Town Square -> Blogs -> Dogcow Land
|
macgui.com is almost 6 years old!
20 Jul 2010 6:45 pm
Phewf! I just did one of those egotistical "what is your domain name worth" things, where a computer tries to do some Google searches to estimate how valuable your domain name might be.
One of the techniques it uses is age of the domain name. It reported that macgui.com is 5 years, 11 months, and 29 days old! I had almost forgotten that August 20th is the birthday of macgui.com. It's exactly 30 days from now.
Of course, Mac GUI is older than that. Mac GUI started in February of 2004 at the URL of http://macgui.bravehost.com/ It was only when the site started to "take off," so to speak, that I decided it was worth the investment of $10 into a domain name from Yahoo. When I first registered it, I was still on a free web hosting plan, so typing macgui.com would just redirect you to the full URL mentioned previously. It wasn't until around the time of macgui.net that I started to pay for Mac GUI's web hosting.
Looking back, choosing Yahoo as the registrar was likely one of the best decisions I've ever made for this site. Later in 2005, I also registered macgui.net with Yahoo. I've never had a single problem with Yahoo since then.
Today, sometimes I still can't believe that no one had thought of the 6-letter domain. It's so simple and obvious. Clearly, Mac GUI wasn't the only Mac customization site around at the time, and still isn't today. Sometimes, you just get lucky, I guess.
The domain registration just renewed two days ago, so we're good for another year. And you can be darn sure we will still be around then.
Posted By: Dog Cow 0 Comments (Post your comment)
Trackbacks (0) Permalink
|
Usenet Archiving Continues
25 Apr 2010 1:52 am
1 million Usenet posts = 5 GB
That's one statistic out of many that I can tell you after about 11 months of archiving Usenet. Hard to believe it's almost been a year since my first attempts at it. This coming week will mark a full year of archiving, but I only started with the Mac and Apple II groups.
Since then, I've acquired an estimated 100 million Usenet articles, weighing in at an (again) estimated size of 500 GB. That's one half of a Terabyte. Of course, these are just my estimates based on various counters. There's actually about 3 or 4 separate archiving projects all going on, and they all have different counters and so on. Of course, most all of the data is stored compressed and will likely take up less than 100 GB in that form.
I've got posts as recent as 30 minutes ago, and as old as the 1980's, thanks to Mr. David Wiseman at the University of Western Ontario.
Someday, I hope to be able to get a server with big enough disks to be able to build a search index and display all of these posts online, and also allow anyone to download them en-masse. Until then, they are likely to end up compressed, sitting on some servers and portable hard drives, waiting to be looked at.
This morning, I deployed version 1.0.8 of the NNSync software, the first update since December of 2009. This version addresses a problem: getting total article and disk size statistics was taking as long as 45 minutes! So, the solution is to store the total article count and total disk space count in two text files, updated every time the program finishes, which is once an hour. Of course, these counters only start today, so I will need to add in the previous statistics, which I've been posting every 1 or 2 weeks, to get the grand total.
A good question you might be asking is "why?" Why devote a portion of my time to this project, which doesn't seem to have any visible or useful results? Part of it is just the fun of seeing the numbers rise. It's all imaginary, I know, but how many people can say that they have tens of millions of something? More practically, there doesn't seem to be too many other people archiving Usenet. Or if there are, they aren't making a big deal of it. Usenet belongs to the world. Every post on it is in the public domain for everyone to read and perhaps benefit from. If no one is saving it, these posts slowly, silently expire and disappear. And that's a shame, considering how much useful information that is there, and how much time people all over spend on it.
Eventually, I hope to be able to open my little Usenet archive to everyone else.
Posted By: Dog Cow 3 Comments (Post your comment)
Trackbacks (0) Permalink
|
Outline of a Search Engine
12 Apr 2010 10:11 pm
[ Mood: Excited ] [ Currently: Listening to someone's headphones which are too loud ]
Being, of course, that a Search Engine is my Next Big Project, that's the thing which has been on my mind for the past week. Well why am I planning on building a search engine? Two reasons. The first and most important:
I want it. It's something that I would find to be useful.
The second reason is more practical: perhaps others will find it of use.
Considering that I know just a tiny fraction of all there is to know about search indexing techniques, I am sure that this will be a long and interesting project. There is no finish date or dead-line in mind. It could even be that the project is never finished. But it will be started.
In case you've wondered what goes into a search engine, here's a brief outline, presented in the tentative order that I will attempt to start on each component. Nothing is easy; everything is hard.
1.) HTML Parser
This is what I'm working on first. The HTML parser has a few responsibilities: detect if a web page is in fact HTML, and then if so, which version of (X)HTML it's in. The parser then has the task of separating the chaff from the grain, so to speak. It needs to extract the page title and description, keywords and other meta tags, as well as get all of the useful content from a web page.
This is a lot more than simply stripping all HTML tags from a page with a regular expression. One needs to account for:
- detecting the markup language used (HTML 4.01 vs XHTML 1.1, for example)
- invalid markup
- normalizing text encoding
- page content which is not relevant.
The first two items are fairly straight-forward. The final point, irrelevant page content, is not. What this entails is making sure that little bitties such as bread-crumbs, side-bars, form labels and the like are not accounted for when indexing the words on a page. Or if they are indexed, making sure that they are not overly-emphasized.
2.) Web Crawler and URL server
These are two distinct components, but they are tied to each other, so I will likely develop the two concurrently.
The URL server is rather straight-forward. It stores a list of all URLs known and doles them out to a group of web crawlers. The URL server should keep track if a crawler currently has a specific URL in its queue, when a URL should next be crawled, and expiring outdated and bogus URLs, among other things. Finally, the URL server should distribute URLs to crawlers in a manner which is efficient. This will be described next.
The web crawlers are a bit more complicated, but only just. Essentially, the crawler will follow the same set of steps:
1.) Contact the URL server, get a list of URLs
2.) Visit each URL and attempt to download whatever data it presents
3.) Collect all results
4.) Report back to the URL server the results
5.) Hand data off to Indexer
6.) Repeat
Here, results includes any URLs found, and any exceptional HTTP responses. Data means the actual HTML pages which were successfully downloaded.
URLs will come in batches of a few hundred. When the URL server sends out URLs to a crawler, it will mark those URLs as "active" so that no two crawlers are attempting to download the same URL. Additionally, URLs from the same domain will always be sent to one crawler, and they will be interleaved so as to not flood a single domain with hits from multiple crawlers, or a single crawler accessing pages in succession.
The crawler should be able to detect anomalous conditions, such as a 404 error, 500 error, 301 redirect, etc. and take any action as necessary, and always report back to the URL server the results.
Finally, the crawler should be able to parse the robots.txt file and obey any directives found within. The crawlers will be multi-threaded for maximum efficiency.
3.) Indexer
Here comes a significant portion of the search engine: taking the raw HTML pages, and turning them into inverted indices of word-URL pairs. As the crawler is doing its work, it will be building a collection of HTML pages. The indexer will come along and use the HTML Parser to extract all of the useful content of these pages and update the search index as needed.
Aspects of the page, such as word order, frequency, style, and other elements will all be taken into consideration for ranking purposes.
4.) Query Parser and Executer
Again, these are two components which are tied together. The query parser takes the user's search input string and sanitizes it, normalizes it, and expands it all as necessary. The query parser should be able to handle boolean operators, word exclusion, and exact phrase searching, among other things. Common synonyms for words may also be added, and very frequent words may be dropped unless the user explicitly says otherwise.
Once the query has been parsed, it is handed over to the search execution engine which does the final process of consulting the dictionary, indices, ranking data, and finally ordering and returning the results.
5.) Front End
The front end is merely the user interface. It is where the search begins and ends. This part of the search engine does the least amount of work. It merely accepts the user's raw search query, and then displays back the results it got from running the query.
Posted By: Dog Cow 4 Comments (Post your comment)
Trackbacks (0) Permalink
|
6 Years Have Come And Gone
26 Feb 2010 12:46 am
[ Mood: Amused ] [ Currently: Listening to some girls chattering nearby ]
I don't even know where to begin. What to say? I've been working on this site for 6 years and 2 weeks. Though if you really want to get technical about it, Mac GUI didn't go "live" online until later in March.
It really amazes me to think back to those cold, snowy days in February of 2004, when I first thought of Mac GUI. I still have the planner book with my original sketch of Mac GUI's home page, done in pencil in the back seat of a car ride.
Mac GUI's first design was based on some of my favorite sites at the time, Cool Macintosh and The Mushroom Kingdom. Since I was both: a.) a design newbie and b.) lacking in tools, Mac GUI's first designs were really terrible compared to today. The first color scheme was purple for goodness' sake!
Back then, I was using Adobe PageMill 3.0 and Photoshop 5.5LE to do the HTML and graphics for Mac GUI. All of this was done on a Power Mac 6100/66 running OS 9. I didn't get the blue & white G3 until later in 2004, and that G3 carried me all the way up to summer 2008, when I finally got the then-newest model of Intel Mac mini.
Obviously, everything was strictly static HTML back then. That's one of the big differences I see between then and now. Today it seems like everyone needs to have a CMS or blog, even for a really small site with just a handful of pages (say, less than 50 pages).
Mac GUI didn't get fully converted to a PHP database system until Mac GUI 5 in September of 2006. Even then, it still took a few more months to get all 1,800 or so downloads from the static HTML pages into the new database system. Now there are over 24,000 downloads.
Having everything in a database is nice in many ways, but it also makes updating the site harder. When everything was in HTML, updates were as simple as copying the new or updated files from my Mac on to a disk and then uploading them. Now, updates are in the database, and so I have to generate .SQL files and then import those. Tuesday's Vault update took over an hour to just generate the .SQL file, and I even had written a script to do it, but there is still some manual work involved.
Still today, I do not have a home Internet connection. In the very earliest days of Mac GUI, I would copy all of the updated HTML pages onto a PC-formatted 3.5" floppy disk and go out to the nearest public Internet place to upload them. When I got the G3, I substituted a 100 MB Zip disk. In 2005, I was an early buyer of a USB flash drive. It held 256 MB of data and cost roughly $30. Now I use a 4 GB drive which cost about $15.
So things certainly have changed a lot in 6 years. I wonder if we'll make it to 10 years.
Posted By: Dog Cow 2 Comments (Post your comment)
Trackbacks (0) Permalink
|
Projects for 2010
25 Jan 2010 11:49 pm
[ Mood: Excited ]
I already know what projects I have in mind for 2010, but you don't, so I'll let you know.
The first one is a search index for all 800,000 and some Usenet posts, and perhaps an additional 6 million beyond that. It's a feature which more than one person has requested since the 10 years' historical archive was put online a few weeks ago.
The second project is far more grand and ambitious, a search engine, and it's one which I have been thinking about for awhile. The only difference between then and now is that I know a lot more about how to design such a system. This project is a search engine which only searches web sites about Apple products. In addition, it would have additional features tailored for Mac and Apple users, such as listings of various compression formats and disk catalogs.
The third project is to finally finish the Bookmarks system which I started writing a few months ago. It's nothing major, but I've found that it can be fun to find a list of links and be lead to other sites which one would have never stumbled upon before. I'd like to make my own list of bookmarks for others, and I wonder if there would be people besides myself interested in such a feature.
The fourth project is to enhance Vault with more photos, software, text files and better descriptions of the existing material. Hopefully there will be other people who would like to help. If that's you, then take a look at the Contribute link on the Vault files, and see what you can do to fill it in!
The final project is sort of on-going and that is "Mac GUI City 2," which is essentially just a combination of a site redesign as well as efforts to make the site more concise by removing extra features, options, and areas which don't get used so often.
I'll have more to say on all of these projects later.
Posted By: Dog Cow 1 Comments (Post your comment)
Trackbacks (0) Permalink
|
|
|
| Blog Owner: |
Dog Cow |
| Contributors: |
(none) |
| Blog: |
View All Entries |
|
Friends |
|
|
| Go: |
Back/Forward |
Calendar
|
«
<
»
>
July 2010
|
|
|
|
|
1 |
2 |
3 |
| 4 |
5 |
6 |
7 |
8 |
9 |
10 |
| 11 |
12 |
13 |
14 |
15 |
16 |
17 |
| 18 |
19 |
20 |
21 |
22 |
23 |
24 |
| 25 |
26 |
27 |
28 |
29 |
30 |
31 |
About Dog Cow
Joined
11 Dec 2004 5:20 pm
Location
USA
Occupation
Assistant chicken-keeper
Interests
Mac computers, as well as many of the same things that you like!
Blog
Blog Started
20 Jul 2005 12:50 am
Total entries
149
Blog Age
1,837 days
Total replies
279
Visits
30,393
RSS
|
|
|