The Life of a Sysadmin

Carousel is a lie!

Saint Aardvark's Axiom of Self-Righteous Anger
2006-02-04 13:14:28

A user at work wanted to move from a desktop machine to a laptop. The Windows profile moved over just fine, so all that was left to do was copy over his outlook.pst. Only it turns out his desktop's hard drive has been quietly failing for a while, and there's some corruption right in his 1.2GB Outlook file. Well, fuck.

The Inbox Recover Tool is meant to help with this sort of thing. It took me a while to find a mention of that, longer to realize that it was actually called scanpst.exe, and even longer to decide that the Windows search tool wasn't going to find C:\Program Files\Common Files\MAPI\1033 -- a fact that is fucking buried in Microsoft's Office support section. (Why 1033? Something to do with Unicode and US English character sets.) Of course, it didn't work.

So okay, what about getting Outlook to export to another file? Good idea! Only it fails about 700MB through, and there's no indication what worked and what didn't -- so no chance for the user to decide if that's enough or not.

So what about exporting a subset of the folders, seeing what fails, and then repeating the process without the failing folder? A little tedious, sure, but it'll work, right? Wrong: you can export one folder, or you can export one folder and its subfolders, but you cannot export more than one folder at one time. Jesus fucking Christ!

Workaround for that was to copy folders (one at a fucking time, natch) to another folder (call it Backup) and try exporting that -- and then see what fails, yadda yadda. But natch, that doesn't work either. You have to watch closely to see what folders are being exported, and anyway a folder may be displayed as being exported more than once, so you still don't know whether a given folder may have worked.

Plus, there was the failing hard drive (remember that?); I suspect that it this new backup folder was just getting thrown on the same crappy chunk of hard drive, making the export of the Backup folder fail in interestingly inconsistent ways. And of course, the whole process takes fifteen minutes to fail, during which time I can't do anything else and neither can the user.

And in the middle of my frustration and rage, an even greater rage welled up in me when I realized that Outlook had totally ruined this guy's email.

Think about it! Here's all this plain text email -- even attachments are encoded in ASCII -- and it has been completely fucking borked by being irretrievably (well, in this case anyway) converted to some proprietary binary format that is completely opaque to me, without at least the saving grace of having good tools for its manipulation available. Redundancy, ease of recovery and ease of manipulation has been thrown away for the sake of (let's be generous here) speed and functionality (indexing, correlation, etc). It's completely ridiculous.

This led to the formation of Saint Aardvark's Axiom of Information Utility:

Any sufficiently important information must be indistinguishable from plain text.

Plain text is redundant, easily (though not necessarily speedily) recognized by the human brain, and has many automated tools to deal with it (think of Unix). All these things make it very, very recoverable. If the information is that important, you need to be able to get at it even if there's a hardware failure. Binary formats throw that away, and that is simply wrong.

But what's a self-important axiom without an equally self-important corrollary?

Any gains in the functionality or speed of information access must be obtained from derived versions of the original information, leaving the original in its plain text form.

I'm perfectly willing to give Outlook the benefit of the doubt in this case; having used a PDA for all of two weeks, I feel uniquely qualified to recognize the utility of having cross-referenced contacts, to-do lists, email, and so on. But this must not come at the expense of recovery!

Think of source code. It's possible to hack on a binary with a hex editor or a disassembler. You can even fix bugs or change the way a program works in this way. But you would never maintain a program in this way: it's hard to understand, it's easy to make a mistake, and it's hard to (say) port to a new language or hardware platform. That's what source code is for: it's easy to understand (assuming you're a programmer), and even if some of it gets garbled it's easy to recover. Plus, you can use tools like indent to change how it looks, or grep to pick out interesting bits, or tags to cross-reference function calls with their definitions.

Of course, you wouldn't try to run source code -- that's what a compiler is for. You gain speed by transforming the source code while still leaving that source code intact: nothing is lost in the process. And that's what Outlook should have done: compiled the plain text email into whatever database (I'm assuming) format Outlook likes, that allows Outlook to do Outlook stuff quickly, while still leaving the original source code -- the email -- intact.

Of course, you don't have to imagine recompiling Outlook's PST file each time; this'd be an incremental thing. And really, it shouldn't be that much different from what it does now...same speed, just a little more disk space taken up. And if the PST file gets borked, no matter -- the recovery tool is nothing more than a compiler that regenerates it from the original email.

As much as I'm picking on Outlook though, this isn't Outlook's problem alone. I've written before about how PHPWiki obscures the information it stores in MySQL. And I did a similar thing to myself years ago by compressing email, since I was running out of disk space. Somewhere along the way the files got corrupted, and I can't get that email back because gzip barfs on it.

And of course, this is just my opinion, formed in the heat of anger. It's almost certainly not a new idea, and might even be wrong. I'd love to hear some feedback on this.

Comments On This Entry

Toby
Submitted at 14:07:53 on 01 March 2006
And HTTP intervenes to send our nonsense to blog comments. Sigh.
Josh Cheney
Submitted at 13:55:44 on 04 February 2006
I would very heartily agree with you on this point. I find that no matter how good of an exporter something claims to have, you will always have a need to access this information directly, and if the format that it is stored in some sort of binary file, you are pretty much SOL. I think that vpopmail strikes a nice balance, and exemplifies that which you stated above. The password file for vpopmail, if I understand it correctly, and I may not, holds the passwords, mailbox size limits, etc, for each user. The actual file accessed in the course of checking mail is a binary file, but that binary file is built from a plain text file that can be read and edited by hand. Being able to edit the configuration of a program without using that program is one of the things that has made the various Unicies my platform of choice, rather than some sort of philosophical love of free software. It isn't unusual for me to bork something up to the point where whatever program it is refuses to start, and if the configuration can only be interpreted by the program, then there is no way to fix that, short of reinstalling the program, and possibly loosing data in the process. In general, yes, I would agree with your two axioms.
Arwen
Submitted at 23:13:42 on 04 February 2006
Agreed. And Agreed. As someone who has wrangled with a Borked Outlook .pst, and also on general premises.
Joy Gosai
Submitted at 20:40:01 on 06 February 2006
Your Holiness, I am a regular reader of your blog, something of an acolyte you might say. But when I tried to post my first comment today I was rudely shooed away by the deamons guarding the gate who said that my ISP-assigned address was on a spam blacklist. Getting a new address from the ISP, registering with Wordpress and even using an IPv6-toIPv4 gateway did not help. While no one likes spam, may I humbly request some miracle to make life easier for us sinners: captcha for example. Here is what I wanted to say about your Outlook post: I can think of some counterexamples to your first axiom: filesystems, RDBMSs and IP packet headers all carry sufficiently important data and yet are binary only. In any one these cases mirroring the binary-ony data in a plain text format would not be acceptable for performance reasons. Moreover, we do not mind the binary-only formats since we have reliable tools for examining and manipulating the data and for converting into other formats. So my counter-axioms would be: (1) a new data format should be created only when all old data formats have been demonstrated to be unusable (2) the creator of a new data format must provide tools that allow the data to be manipulated with the greatest generality possible.
Jyotirmoy
Submitted at 21:16:03 on 06 February 2006
Saints intervene to transmit our words to heaven.
Bax
Submitted at 09:47:17 on 17 May 2006
Out of curiosity, why isn't his mail on a server somewhere? Only running POP?
Saint Aardvark
Submitted at 20:04:38 on 17 May 2006
'Cos I didn't have the foresight to set him up w/IMAP. Sigh.
Jyotirmoy
Submitted at 22:02:30 on 06 February 2006
Saints intervene to send our words to heaven.
Saint Aardvark
Submitted at 20:52:55 on 06 February 2006
> While no one likes spam, may I humbly request some miracle to make
> life easier for us sinners: captcha for example.
The image-based ones irritate me because they shut out folks with text browsers (like me) or the blind. But I used to have one in place that just had a very simple question...I'll look into this again.
>  Here is what I wanted to say about your Outlook post:
>  I can think of some counterexamples to your first axiom:
>  filesystems, RDBMSs and IP packet headers all carry sufficiently
>  important data and yet are binary only. In any one these cases
>  mirroring the binary-ony data in a plain text format would not be
>  acceptable for performance reasons
That's a very good point. I'd argue that IP packets are inherently transitory (weird things like storing files in SMTP transactions aside...see http://lcamtuf.coredump.cx/juggling_with_packets.txt), and not what I meant. However, I hadn't actually got so far as to think of that...it's only this example from you that prompts this thought. So thanks...and perhaps a good revision of the first axiom would be "Any *long-term storage* of sufficiently important information must be indistinguishable from plain text." That takes care of IP packets, but the others I don't have a good answer for. I'd thought of things like filesystems and databases, and as I couldn't get a clear idea of how to treat these things I skipped around it. And anyhow, I think you address it well with this point:
>  Moreover, we do not mind the
>  binary-only formats since we have reliable tools for examining and
>  manipulating the data and for converting into other formats.
Very true. One point that I had thought when originally ranting about Outlook, but had failed to develop in the blog: by "indistinguishable from plain text", I'm not only trying to be cute and sound like Arthur C. Clarke. I mean that there are a couple things that distinguish plain text: 1. It's prima facie identifiable and understandable 2. There are excellent tools to, as you say, examine, manipulate and convert it The first I wrote about, but the second I did not. I'd originally thought of extending the definition of "plain text" to perhaps include well-documented formats that come with good tools, but this got lost in the shufffle.
>  So my counter-axioms would be: (1) a new data format should be
>  created only when all old data formats have been demonstrated to be
>  unusable (2) the creator of a new data format must provide tools
>  that allow the data to be manipulated with the greatest generality
>  possible. 
Interesting. #2 I agree with, and had originally intended to write about as a corrollary or consequence of the axioms I was proposing. #1...#1 I'm not so sure about, as it appears to place what seem to be arbitrary or unnecessary limits on programmers. Again, it might be better to say that a new data format *for long-term storage* should only be etc.