Saint Aardvark's Axiom of Self-Righteous Anger

A user at work wanted to move from a desktop machine to a laptop. The Windows profile moved over just fine, so all that was left to do was copy over his outlook.pst. Only it turns out his desktop's hard drive has been quietly failing for a while, and there's some corruption right in his 1.2GB Outlook file. Well, fuck.

The Inbox Recover Tool is meant to help with this sort of thing. It took me a while to find a mention of that, longer to realize that it was actually called scanpst.exe, and even longer to decide that the Windows search tool wasn't going to find C:\Program Files\Common Files\MAPI\1033 -- a fact that is fucking buried in Microsoft's Office support section. (Why 1033? Something to do with Unicode and US English character sets.) Of course, it didn't work.

So okay, what about getting Outlook to export to another file? Good idea! Only it fails about 700MB through, and there's no indication what worked and what didn't -- so no chance for the user to decide if that's enough or not.

So what about exporting a subset of the folders, seeing what fails, and then repeating the process without the failing folder? A little tedious, sure, but it'll work, right? Wrong: you can export one folder, or you can export one folder and its subfolders, but you cannot export more than one folder at one time. Jesus fucking Christ!

Workaround for that was to copy folders (one at a fucking time, natch) to another folder (call it Backup) and try exporting that -- and then see what fails, yadda yadda. But natch, that doesn't work either. You have to watch closely to see what folders are being exported, and anyway a folder may be displayed as being exported more than once, so you still don't know whether a given folder may have worked.

Plus, there was the failing hard drive (remember that?); I suspect that it this new backup folder was just getting thrown on the same crappy chunk of hard drive, making the export of the Backup folder fail in interestingly inconsistent ways. And of course, the whole process takes fifteen minutes to fail, during which time I can't do anything else and neither can the user.

And in the middle of my frustration and rage, an even greater rage welled up in me when I realized that Outlook had totally ruined this guy's email.

Think about it! Here's all this plain text email -- even attachments are encoded in ASCII -- and it has been completely fucking borked by being irretrievably (well, in this case anyway) converted to some proprietary binary format that is completely opaque to me, without at least the saving grace of having good tools for its manipulation available. Redundancy, ease of recovery and ease of manipulation has been thrown away for the sake of (let's be generous here) speed and functionality (indexing, correlation, etc). It's completely ridiculous.

This led to the formation of Saint Aardvark's Axiom of Information Utility:

Any sufficiently important information must be indistinguishable from plain text.

Plain text is redundant, easily (though not necessarily speedily) recognized by the human brain, and has many automated tools to deal with it (think of Unix). All these things make it very, very recoverable. If the information is that important, you need to be able to get at it even if there's a hardware failure. Binary formats throw that away, and that is simply wrong.

But what's a self-important axiom without an equally self-important corrollary?

Any gains in the functionality or speed of information access must be obtained from derived versions of the original information, leaving the original in its plain text form.

I'm perfectly willing to give Outlook the benefit of the doubt in this case; having used a PDA for all of two weeks, I feel uniquely qualified to recognize the utility of having cross-referenced contacts, to-do lists, email, and so on. But this must not come at the expense of recovery!

Think of source code. It's possible to hack on a binary with a hex editor or a disassembler. You can even fix bugs or change the way a program works in this way. But you would never maintain a program in this way: it's hard to understand, it's easy to make a mistake, and it's hard to (say) port to a new language or hardware platform. That's what source code is for: it's easy to understand (assuming you're a programmer), and even if some of it gets garbled it's easy to recover. Plus, you can use tools like indent to change how it looks, or grep to pick out interesting bits, or tags to cross-reference function calls with their definitions.

Of course, you wouldn't try to run source code -- that's what a compiler is for. You gain speed by transforming the source code while still leaving that source code intact: nothing is lost in the process. And that's what Outlook should have done: compiled the plain text email into whatever database (I'm assuming) format Outlook likes, that allows Outlook to do Outlook stuff quickly, while still leaving the original source code -- the email -- intact.

Of course, you don't have to imagine recompiling Outlook's PST file each time; this'd be an incremental thing. And really, it shouldn't be that much different from what it does now...same speed, just a little more disk space taken up. And if the PST file gets borked, no matter -- the recovery tool is nothing more than a compiler that regenerates it from the original email.

As much as I'm picking on Outlook though, this isn't Outlook's problem alone. I've written before about how PHPWiki obscures the information it stores in MySQL. And I did a similar thing to myself years ago by compressing email, since I was running out of disk space. Somewhere along the way the files got corrupted, and I can't get that email back because gzip barfs on it.

And of course, this is just my opinion, formed in the heat of anger. It's almost certainly not a new idea, and might even be wrong. I'd love to hear some feedback on this.