Well obviously I didn't want to sit around and make these myself. Too boring!
So of course we make our computer slave do the task for us!
The general outline of the algorithm is pretty simple:
1.) Get a file from Vault
2.) Load it in the emulator
3.) Wait for it to load
4.) Take some screenshots
5.) Repeat with next file
Well step 1 is pretty simple. I have a local copy of the entire Vault Collection so I can write the program to iterate over every file. Now I only want to load disk images to the emulator, so I use the Format Converter to skip all files except for DOS 3.3, ProDOS, and Pascal. In future, Format Converter will be called to convert ShrinkIt archives into disk images.
The emulator of choice turns out to be Virtual ][ by Gerard Putter since it is AppleScript-aware (scriptable).
I wrote the AppleScript to load a disk image to the emulator, an Apple IIe, then wait a few seconds for the disk to load. This wait period had to be determined by some trial and error with a small, random sample of disks. I also look to see if the drive light is off before taking the first screenshot.
The script will take a series of four screenshots. It will press some keys, such as Return, 1, and A, then wait a few seconds. This gets the introduction screen, and usually the next few screens of a program. Not perfect, but still good enough in most cases.
Now this process takes about 35 seconds per disk image, and the whole screenshot-taking process lasted around 90 hours in total. I was asleep during many hours of this time.
Each screenshot is tagged with its file ID and sequence number ( 1 to 4) in the set.
So now I have ended up with just over 21,000 screenshots. Looking through these, it is obvious that many of them are duplicates, errors, black screens, or ProDOS loading screens.
Time to write some more programs!
The first job is to eliminate exact duplicates within each set of screenshots for a file. This is a job for the cryptographic hashing algorithm, MD5! I simply take the MD5 hash of each file, then compare it to the others in the set. If there's a match, then I delete that matching file.
Now the count is down to around 17,000 files. Not bad, but I still need to remove more.
Next job is to remove low-quality screenshots. These were mentioned earlier: loading screens, please wait screens, errors, etc. I automate this process by building a list of MD5 hashes of these screenshots, then comparing every screenshot.
Getting rid of these screenshots reduces the count to about 13,000 files. Still more work.
As I look through the screenshots still, I see that there are number of near duplicates. These are screenshots that to the computer are unique, but to humans are identical. One common example is a cursor flashing on in one shot, but off in the next.
The algorithm to eliminate these is interesting. I can't use an MD5 hash of the entire screenshot, but I can cut the screenshot into squares and compare the hash of each square. If more than a certain number match, then the two screenshots are near matches, and the second one is deleted. I end up cutting screenshots into about 80 squares and if more than 77 match, then that's a near duplicate.
Finally the count is down to 10,657 screenshots for 4,668 programs. Good enough! I'll write the program to enter these into the Vault database, and away we go!