Dogcow Land

The Beginnings of a Mac Search Engine

05 Mar 2009 4:44 pm

[ Excited Mood: Excited ]

Since an Apple/Mac Web Search Engine is on a list of potential Mac GUI projects, I decided to do some more research on the subject of search engines in general last evening. Among many documents, I came across a tutorial to build one's own search engine.

It came with a database populator and search front-end. The populator would accept a URL as input, download the web page, and index all words and how many occurrences of each there were.

After a few minutes of indexing some Mac GUI pages, then using the search front-end to type in keywords and find matching hits, I decided to build an automated crawler.

I had written a basic crawler a few months ago, so it was easy to take this code and refit it for my purposes. The crawler is quite simple: enter a starting URL, and it will download that page, use a regular expression to get the page title and body, as well as all URLs. These URLs are placed into a queue, and the page body is scanned for keywords. After the keywords are put in the search database, the crawler then puts this URL into the "done" list and proceeds on to the next URL in the queue. This process is repeated until there are no more URLs. My crawler was able to index roughly 2,000 pages in about 30 minutes.

Some difficulties: parsing HTML! This is the big one, since keywords should only come from text which users will see, so things such as Javascript, CSS, HTML comments, everything in the <head>, etc... should be stripped out.

Also, directory traversal and link-finding is tricky as well. There are a few different ways which links may be coded in the HTML: as absolute, starting with http://; or as relative, in the same directory, either starting with a filename or a ./; or going up a number of directories with a ../

The final part of the whole system is a ranking process and the front-end page which returns matching results to keywords. The ranking as of now is simple: pages which have the most number of occurences of a keyword are ranked higher. The results page is similar to any other search engine: each result has the page title, a snippet of context which contains the keywords in bold, and finally the URL.

One of the trickiest things to generate is the context, since as of know, a lot of "useless" text comes up, rather than actual paragraphs that the user would be interested in.

There are many things to improve, but it's an exciting start, I think.

The Trackback URL for this entry is:

http://macgui.com/blogs/?mode=trackback&e=315

   

Author Message
There are no replies for this entry.
Display posts from previous:   

Town Square -> Blogs -> Dogcow Land -> The Beginnings of a Mac Search Engine