So to recap where we left off, I had all of my messages stored stored in the filesystem named by their hash values and all of my folders are stored as CSV files.
Neither of these were authoritative, though. Everything was still being loaded from the old GDBM files.
My first step after that was to begin loading from the message store. Unfortunately, the folder indexes stored in GDBM don't store the hashes. To reconstruct them I would have needed to load the messages from GDBM, hash them and then load the same messages from the new message store which would be kind of silly.
The GDBM indexes reference messages by a monotonically increasing numeric id which is the message key in the GDBM file. I didn't really want to mess with the contents of the GDBM files, but it seemed like the best option was to store the hashes in the GDBM indexes.
That didn't work. Many of the folders are too big to allow me to rewrite the index. So during folder write I wrote out another CSV file mapping the GDBM message key to the message hash. When we load an index, we just lookup the key in the message key to hash index and tack on the associated hash.
With the addition of the index-to-hash lookup, everything works quite nicely. But then, there's the garbage collection thing. The old GDBM messages got deleted from GDBM when the index is written. I suppose I could have just deleted the files from the message store at the same time, but I kind of like the idea of them laying around in the trash for a while, just in case there's something I want to dig out.
Conceptually, garbage collection on this system is pretty easy. Since all messages are currently referenced by folders, we can just go through the message store and collect all of the message hashes. The ones that aren't referenced by folders can be deleted.
The first time I did that, I just printed the senders and subject lines of the messages to be deleted (I've learned not to do mass deletes on the first try). Of course, most of the stuff printed out was stuff that I very much wanted to keep.
The first problem was that the hashes generated from the loaded messages didn't always match those in the index. Insignificant changes to the contents produce a different hash value and the old code doesn't necessarily serialize the message to exactly the same string that it reads. So after reloading and reserializing a message to compute the folder hash, you can end up with a different hash (and I don't exactly recall the details of how this happens).
Some of the discrepancies were my fault, preserving the original contents wasn't a goal of the original system and there were places where I took liberties like adding an extra newline in an attachment. Other problems were deeper, the code that parses headers in python's rfc822 module doesn't reproduce the original newlines when serializing them back out.
The solution to all of this was simple: when loading a message, just store the original contents.
I still had some duplicates in the message store: these were due to messages that had been reserialized during a move from the Inbox to the folder where I wanted to keep them. To exclude these, I generated a mapping from message ids to message hash.
Every legitimate e-mail message contains a "Message-Id" header that should be a globally unique identifier (I could have used this as a key instead of hash except that my draft messages don't have them and also I'm generally distrustful). So for any given message hash that's not in a folder, if there is another message with the same id that is in a folder, we can safely delete the original message.
So running the garbage collector against the current store, I have 1459 messages that would be deleted. But again, because I'm untrusting, I have to look at them all to verify. This doesn't scale well. Also, the scan ends up taking a lot of time.
So I added a journal. Whenever there's an action on the index (add, delete, undelete) I record the action, folder and hash. There's a lighter weight garbage collector that scans the journal, collapses the actions (to deal with delete from one folder, add to another and undelete) and then deletes the messages that were finally marked as deleted. It also does a sanity check to verify that the message is not currently listed in a folder.
So now I can take down spugmail and run the lightweight GC script to clean things up. Eventually I'll want to just do the journal collapsing in the client and write a "trash" index and delete the oldest ones in a cron job, but that can wait.
Now to finally remove those pesky GDBM files...