Epicycles

13 Mar 2013

Yesterday I was asked to restore a backup for a Windows desktop, and I couldn't: I'd been backing up "Documents and Settings", not "Users". The former is appropriate for XP, which this workstation'd had at some point, but not Windows 7 which it had now. I'd missed the 286-byte size of full backups. Luckily the user had another way to retrieve his data. But I felt pretty sick for a while; still do.

When shit like this happens, I try to come up with a Nagios test to watch for it. It's the regression test for sysadmins: is Nagios okay? Then at least you aren't repeating any mistakes. But how the hell do I test for this case? I'm not sure when the change happened, because the full backups I had (going back three months; our usual policy) were all 286 bytes. I thought I could settle for "alert me about full backups under...oh, I dunno, 100KB." But a search for that in the catalog turns up maybe ten or so, nine of them legitimate, meaning an alert for this will give 90% false positives.

So all right, a list of exceptions. Except that needs to be maintained. So imagine this sequence:

A tiny filesystem is being backed up, and it's on the don't-bug-me-if-it's-small list.
It actually starts holding files, which are now backed up, so it's probably important.
But I don't update the don't-bug-me-if-it's-small list.
Something goes wrong and the backups go back to being small.
Someone requests the restore, and I can't provide it.

I need some way of saying "Oh, that's unusual..." Which makes me think of statistics, which I don't understand very well, and I start to think this is a bigger task than I realize and I'm maybe trying to create AI in a Bash script.

And really, I've got don't-bug-me-if-this lists, and local checks and exceptions, and I've documented things as well as I can but it's never enough. I've tried hard to make things easy for my eventual successor (I'm not switching jobs any time soon; just thinking of the future), and if not easy then at least documented, but I have this nagging feeling that she'll look at all this and just shake her head, the way I've done at other setups. It feels like this baroque, Balkanized, over-intricate set of kludges, special cases, homebrown scripts littered with FIXMEs and I don't know what-all. I've got Nagios invoking Bacula, and Cfengine managing some but not all, and it just feels overgrown. Weedy. Some days I don't know the way out.

And the stupid part is that NONE OF THIS WOULD HAVE FIXED THE ORIGINAL PROBLEM: I screwed up and did not adjust the files I was backing up for a client. And that realization -- that after cycling through all these dark worryings about how I'm doing my job, I'm right back where I started, a gutkick suspicion that I shouldn't be allowed to do what I do and I can't even begin to make a go at fixing things -- that is one hell of a way to end a day at work.

4 Comments

From: Frederic Woodbridge
13 March 2013 15:32:20

LOL!

I think you're being way to hard on yourself, but that's just me.

From: Zen Render
13 March 2013 15:43:06

Imposter effect (I've always called it Intruder Complex): the ugly big brother of Dunning Kruger.
http://www.standalone-sysadmin.com/blog/2013/02/the-impostor-effect-vs-dunning-kruger/

Can Nagios be set to get upset if a backup changes day/day week/week by more than % of Last?

If backup is bigger/smaller than 50% of what it was last week, let me know? So if someone drops an entire software repository on their desktop one day, you'll see it, but you'll also see it if they relocate their User folder to a new/larger drive that they just installed?

From: Saint Aardvark the Carpeted
15 March 2013 13:41:39

@Frederic: Thanks!

@Zen Render: Yes, I think I can accomplish that by getting Nagios to run judicious MySQL queries against Bacula's catalog DB. I'll be putting that in place shortly. But this is one of the things that makes me feel like I'm trying to write AI in a Bash script. :-\

From: Andrew
19 March 2013 21:44:37

Perhaps, unfortunately, this is something that Nagios monitoring cannot cure.
A programs of random restores across everything you are backing up might help. We try and do one or two test restores a year, but we are not selecting them randomly.
If there is someone other than the backup operator who can do the practice restores that might be worthwhile.
This is time consuming, but I suppose so is writing convoluted Nagios checks.

Add a comment:

Name and email required; email is not displayed.

Name (required)
Email (required)
Website
The site is named after Saint ________ the Carpeted: (required)
Comment

Carousel is a LIE!

Epicycles

4 Comments

Add a comment:

Related Posts

QRP weekend 08 Oct 2018

Open Source Cubesat Workshop 2018 03 Oct 2018

mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018