Epicycles

Yesterday I was asked to restore a backup for a Windows desktop, and I couldn't: I'd been backing up "Documents and Settings", not "Users". The former is appropriate for XP, which this workstation'd had at some point, but not Windows 7 which it had now. I'd missed the 286-byte size of full backups. Luckily the user had another way to retrieve his data. But I felt pretty sick for a while; still do.

When shit like this happens, I try to come up with a Nagios test to watch for it. It's the regression test for sysadmins: is Nagios okay? Then at least you aren't repeating any mistakes. But how the hell do I test for this case? I'm not sure when the change happened, because the full backups I had (going back three months; our usual policy) were all 286 bytes. I thought I could settle for "alert me about full backups under...oh, I dunno, 100KB." But a search for that in the catalog turns up maybe ten or so, nine of them legitimate, meaning an alert for this will give 90% false positives.

So all right, a list of exceptions. Except that needs to be maintained. So imagine this sequence:

  1. A tiny filesystem is being backed up, and it's on the don't-bug-me-if-it's-small list.
  2. It actually starts holding files, which are now backed up, so it's probably important.
  3. But I don't update the don't-bug-me-if-it's-small list.
  4. Something goes wrong and the backups go back to being small.
  5. Someone requests the restore, and I can't provide it.

I need some way of saying "Oh, that's unusual..." Which makes me think of statistics, which I don't understand very well, and I start to think this is a bigger task than I realize and I'm maybe trying to create AI in a Bash script.

And really, I've got don't-bug-me-if-this lists, and local checks and exceptions, and I've documented things as well as I can but it's never enough. I've tried hard to make things easy for my eventual successor (I'm not switching jobs any time soon; just thinking of the future), and if not easy then at least documented, but I have this nagging feeling that she'll look at all this and just shake her head, the way I've done at other setups. It feels like this baroque, Balkanized, over-intricate set of kludges, special cases, homebrown scripts littered with FIXMEs and I don't know what-all. I've got Nagios invoking Bacula, and Cfengine managing some but not all, and it just feels overgrown. Weedy. Some days I don't know the way out.

And the stupid part is that NONE OF THIS WOULD HAVE FIXED THE ORIGINAL PROBLEM: I screwed up and did not adjust the files I was backing up for a client. And that realization -- that after cycling through all these dark worryings about how I'm doing my job, I'm right back where I started, a gutkick suspicion that I shouldn't be allowed to do what I do and I can't even begin to make a go at fixing things -- that is one hell of a way to end a day at work.