Posts tagged “backups”

March 13, 2015 New server
So a while ago, I wrote about the li'l ol' laptop under the TV; an old, old Dell with a P3 processor that was finally coming to an end. Oliver Hookins, bless his heart, recommended the Zotac ZBox; after a bit of research, I agreed and bought the CI-320. (Sorry, Wout, but I wanted something with a bit more horsepower than a Banana Pi.) I bought 4GB of RAM for it, and I already had a 64GB SSD lying around. Debian installed on it w/o any problems whatsoever, and I migrated everything over a couple of weeks ago.

It's pretty wonderful, not least because it's completely silent. It's passively cooled, and with the SSD there are no moving parts. I've got an external HD attached to it via USB (though this thing has also got eSATA, GigE, wireless, HDMI...), and it does backup for the house. I finally got rid of my crappy, crappy-ass rsync wrapper and set up rsnapshot; I've been told to check out elkarbackup, a nice-looking web interface for it. (Now if I can only get off my butt and set up duply and offsite encrypted storage...)

And the name? Zombie.saintaardvarkthecarpeted.com. Zbox...what're you gonna do?
February 01, 2014 Random Reading part $RANDOM
- Canada's CSEC tracked travellers at Canadian airports who used the free WiFi. Not only that, tracked 'em afterward and backward as they showed up at other public hotspots across Canada. Oh, lovely.
- A TSA screener explains: Yes, we saw you naked and we laughed.
- ESR writes about dragging Emacs forward -- switching to git, and away from Texinfo, all to keep Emacs relevant. There are about eleven thousand comments. Quote:
And if the idea of RMS and ESR cooperating to subvert Emacs's decades-old culture from within strikes you as both entertaining and bizarrely funny...yeah, it is. Ours has always been a more complex relationship than most people understand.
- My wife takes out our younger son's stuffed dogs for the day, and gets all the space she needs at Costco. WIN.
- Looks like the supernova in Ursa Major has peaked at magnitude 10.5 or so.
- Have I mentioned Adlibre backup before? 'Cos it's really quite awesome. Written in shell, uses rsync and ZFS to back up hosts. Simple and good.
- Maclean's sent a sketch artist to cover Justin Bieber getting booked. I'd like to sketch that well.
March 13, 2013 Epicycles
Yesterday I was asked to restore a backup for a Windows desktop, and I couldn't: I'd been backing up "Documents and Settings", not "Users". The former is appropriate for XP, which this workstation'd had at some point, but not Windows 7 which it had now. I'd missed the 286-byte size of full backups. Luckily the user had another way to retrieve his data. But I felt pretty sick for a while; still do.

When shit like this happens, I try to come up with a Nagios test to watch for it. It's the regression test for sysadmins: is Nagios okay? Then at least you aren't repeating any mistakes. But how the hell do I test for this case? I'm not sure when the change happened, because the full backups I had (going back three months; our usual policy) were all 286 bytes. I thought I could settle for "alert me about full backups under...oh, I dunno, 100KB." But a search for that in the catalog turns up maybe ten or so, nine of them legitimate, meaning an alert for this will give 90% false positives.

So all right, a list of exceptions. Except that needs to be maintained. So imagine this sequence:
1. A tiny filesystem is being backed up, and it's on the don't-bug-me-if-it's-small list.
2. It actually starts holding files, which are now backed up, so it's probably important.
3. But I don't update the don't-bug-me-if-it's-small list.
4. Something goes wrong and the backups go back to being small.
5. Someone requests the restore, and I can't provide it.
I need some way of saying "Oh, that's unusual..." Which makes me think of statistics, which I don't understand very well, and I start to think this is a bigger task than I realize and I'm maybe trying to create AI in a Bash script.

And really, I've got don't-bug-me-if-this lists, and local checks and exceptions, and I've documented things as well as I can but it's never enough. I've tried hard to make things easy for my eventual successor (I'm not switching jobs any time soon; just thinking of the future), and if not easy then at least documented, but I have this nagging feeling that she'll look at all this and just shake her head, the way I've done at other setups. It feels like this baroque, Balkanized, over-intricate set of kludges, special cases, homebrown scripts littered with FIXMEs and I don't know what-all. I've got Nagios invoking Bacula, and Cfengine managing some but not all, and it just feels overgrown. Weedy. Some days I don't know the way out.

And the stupid part is that NONE OF THIS WOULD HAVE FIXED THE ORIGINAL PROBLEM: I screwed up and did not adjust the files I was backing up for a client. And that realization -- that after cycling through all these dark worryings about how I'm doing my job, I'm right back where I started, a gutkick suspicion that I shouldn't be allowed to do what I do and I can't even begin to make a go at fixing things -- that is one hell of a way to end a day at work.
February 15, 2013 Queues for Bacula
I have a love-hate relationship with Bacula. It works, it's got clients for Windows, and it uses a database for its catalog (a big improvement over what I'd been used to, back in the day, from Amanda...though that's probably changed since then). OTOH, it has had an annoying set of bugs, the database can be a real bear to deal with, and scheduling....oh, scheduling. I'm going to hold off on ranting on scheduling. But you should know that in Bacula, you have to be explicit:
```
Schedule {
  Name = "WeeklyCycle"
  Run = Level=Full Pool=Monthly 1st sat at 2:05
  Run = Level=Differential Pool=Daily 2nd-5th sat at 2:05
  Run = Level=Incremental IncrementalPool=Daily FullPool=Monthly 1st-5th mon-fri, 2nd-5th sun at 00:41
}
```
This leads to problems on the first Saturday of the month, when all those full backups kick off. In the server room itself, where the backup server (and tape library) are located, it's not too bad; there's a GigE network, lots of bandwidth, and it's a dull roar, as you may say. But I also back up clients on a couple of other networks on campus -- one of which is 100 Mbit. Backing up 12 x 500GB home partitions on a remote 100 MBit network means a) no one on that network can connect to their servers anymore, and b) everything takes days to complete, making it entirely likely that something will fail in that time and you've just lost your backup.

One way to do that is to adjust the schedule. Maybe you say that you only want to do full backups every two months, and to not do everything on the same Saturday. That leads to crap like this:
```
Schedule {
  Name = "TwoMonthExpiryWeeklyCycleWednesdayFull"
  Run = Level=Full Pool=MonthlyTwoMonthExpiry 2nd Wed jan,mar,may,jun,sep,nov at 20:41
  Run = Level=Differential Pool=Daily 2nd-5th sat at 2:05
  Run = Level=Differential Pool=Daily 1st sat feb,apr,jun,aug,oct,dec at 2:05
  Run = Level=Incremental IncrementalPool=Daily FullPool=MonthlyTwoMonthExpiry 1st-5th mon-tue,thu-fri,sun, 2nd-5th wed at 20:41
  Run = Level=Incremental IncrementalPool=Daily FullPool=MonthlyTwoMonthExpiry 2nd Wed jan,mar,may,jun,sep,nov at 20:41
}
```
That is awful; it's difficult to make sure you've caught everything, and you have to do something like this for Thursday, Friday, Tuesday...

I guess I did rant about Bacula scheduling after all.

A while back I realized that what I really wanted was a queue: a list of jobs for the slow network that would get done one at a time. I looked around, and found Perl's IPC::DirQueue. It's a pretty simple module that uses directories and files to manage queues, and it's safe over NFS. It seemed a good place to start.

So here's what I've got so far: there's an IPC::Dirqueue-managed queue that has a list of jobs like this:
- desktop-01-home
- desktop-01-etc
- desktop-01-var
- desktop-02-home
- (etc, etc)
I've got a simple Perl script that, using IPC::DirQueue, take the first job and run it like so:
```
open (BCONSOLE, "| /usr/bin/bconsole");
print BCONSOLE "run job=" . $job . " level=Full pool=Monthly yes";
close (BCONSOLE);
```
I've set up a separate job definition for the 100Mbit-network clients:
```
JobDefs {
  Name = "100MbitNetworkJob
  Type = Backup
  Client = agnatha-fd
  Level = Incremental
  Schedule = "WeeklyCycleNoFull"
  Storage = tape
  Messages = Standard
  Priority = 10
  SpoolData = yes
  Pool = Daily
  Cancel Lower Level Duplicates = yes
  Cancel Queued Duplicates = yes
  RunScript {
```
```
  RunsWhen = After
  Runs On Client = No
  Command = "/path/to/job_queue/bacula_queue_mgr -c %l -r"
```
```
  }
}
```
"WeeklyCycleNoFull" is just what it sounds like: daily incrementals, weekly diffs, but no fulls; those are taken care of by the queue. The RunScript stanza is the interesting part: it runs baculaqueuemgr (my Perl script) after each job has completed. It includes the level of the job that just finished (Incremental, Differential or Full), and the "-r" argument to run a job.

The Perl script in question will only run a job if the one that just finished was a Full level. This was meant to be a crappy^Wsimple way of ensuring that we run Fulls one at a time -- no triggering a Full if an Incremental has just finished, since I might well be running a bunch of Incrementals at once.

It's not yet entirely working. It works well enough if I run the queue manually (which is actually tolerable compared to what I had before), but Bacula running the "baculaqueuemgr" command does nto quite work. The queue module has a built-in assumption about job lifetimes, and while I can tweak it to be something like days (instead of the default, which I think is 15 minutes), the script still notes that it's removing a lot of stale lockfiles, and there's nothing left to run because they're all old jobs. I'm still working on this, and I may end up switching to some other queue module. (Any suggestions, let me know; pretty much open to any scripting language.)

A future feature will be getting these jobs queued up automagically by Nagios. I point Nagios at Bacula to make sure that jobs get run often enough, and it should be possible to have Nagios' event handler enqueue a job when it figure's it's overdue.

For now, though, I'm glad this has worked out as well as it has. I still feel guilty about trying to duplicate Amanda's scheduling features, but I feel a bit like Macbeth: in blood stepped in so far....So I keep going. For now.
April 20, 2012 Debugging Bacula FileSet exclusions -- an example
A user at $WORK was running a series of jobs on the cluster -- dozens at any moment. Other users have their quota set to 60 GB, but this user was not (long story). His home directory is at 400GB, but it was closer to a terabyte not so long ago....right when we had a hard drive and a tape drive fail at the same time on our backup server.

We do backups every night to tape using Bacula. Most backups are incremental (whatever changed since the last backup, usually the day before) and are small...maybe tens of GB per day. But backups for this user, because of the proliferation of logs from his jobs, were closer to the size of his home directory every day -- simply because all these log files were being updated as each job progressed.

Ordinarily this wouldn't be a problem, but the cluster of hardware failures have really fucked things up; they're better now, but I'm very slowly playing catchup backups. Eating a tape or more every day is not in my budget right this moment.

I asked him if any of the log files could be excluded from backups without any great loss. After talking it over with him, we came to this agreement:
- His home directory would be backed up (obvs)
- but within "projects/output", only files that contained "rep0" somewhere in the filename would be backed up.
This would exclude lots of other files like "1rep2.foo", "8rep9.log", etc, and would cut out about 200 GB of useless churn every day.

Bacula has the ability to do this sort of thing...but I found its methods somewhat counterintuitive, so I want to set down what I did and how I tested it.

First off, the original, let's-include-everything FileSet looked like this:
```
FileSet {
  Name = "example"
  Include {
    File = /home/example
    Options {
      signature = SHA1
    }
  }
  Exclude {
    File = /proc
    File = /tmp
    File = /.journal
    File = /.fsck
    File = /.zfs
  }
}
```
We back up everything under /home/example, we keep SHA1 signatures, and we exclude a handful of directories (most of which are boilerplate, applied to every FileSet by default).

In order to get Bacula to change the FileSet definition, you have to get the director to reload its configuration file. But some errors -- not all -- cause a running bacula-dir process to die. So before I started fiddling around, I added a Makefile to the /opt/bacula/etc directory that looked like this:
```
test:
        @/opt/bacula/sbin/bacula-dir -t && echo "bacula-dir.conf looks good" || echo "problem with bacula-dir.conf"

reload: test
        echo "reload" | /opt/bacula/sbin/bconsole
```
Whenever I made a change, I'd run "make reload", which would test the configuration first; if it failed, bacula would not be reloaded. (The "@" symbol, in a Makefile, discards standard output.)

Next, I needed a listing of what we were backing up now, before I started fiddling with things:
```
    echo "estimate job=fileserver-example listing" | bconsole > /tmp/listing-before
```
The "estimate" command gets Bacula to estimate how big the job is; the "listing" argument tells it to list the files it'd back up. By default it gives you the info for a full backup. (You can also append a joblevel, so you can see how big a Differential or Incremental; I didn't need that here, but it's worth remembering for next time.)

After that, I made another Makefile that looked like this:
```
test: estimate shouldwork shouldfail

estimate:
        @echo "estimate job=fileserver-example listing" | bconsole > /tmp/listing-after ; wc -l /tmp/listing*

shouldwork: estimate
        grep rep0 /tmp/listing-before | grep projects/output | while read i ; do grep -q $$i /tmp/listing-after || exit 1 ; done

shouldfail:
        grep rep2 /tmp/listing-before |grep projects/output | while read i ; do grep -q $$i /tmp/listing-after && exit 1 ; done ; true
```
This is a little hackish, so in detail:
- The estimate target gets an updated listing of what Bacula will back up; the line count lets me eyeball how it compares to the old, all-inclusive listing.
- The shoudwork target gives me a quick way to make sure that all the files with "rep0" in the name and "projects/output" in the path are still in that updated listing. We grep for these files in the new listing; it either works or exits with error code 1, which make will catch and declare an error.
- The shouldfail target is similar, except I'm making sure that files with "rep2" in the name are excluded from the new listing and we're short-circuiting the loop if any line is found. The "true" at the end is there to give make a final success; we only make it to that command if the entire loop has not found anything, which is what we want. It's there to make this test a "MUST NOT". (That's probably not explained very well.)
Anyhow: after each change, I'd run "make reload" as root to make sure that the syntax worked. After that, I'd run "make test" as an ordinary user (no need for root privileges) to make sure that I was on the right track. After a while, I got this:
```
FileSet {
  Name = "example"
  Include {
      File = /home/example
      Include {
        Options {
          signature = SHA1
          Wilddir = /home/example/projects/output
          Exclude = yes
        }
      }
  }
  Include {
    File = /home/example/projects/output
    Options {
      WildFile = "*rep0*"
      Signature = SHA1
    }
    Options {
      Exclude = yes
      RegexFile = ".*"
    }
  }
  Exclude {
    File = /proc
    File = /tmp
    File = /.journal
    File = /.fsck
    File = /.zfs
  }
}
```
Again, this is a little counterintuitive to me, so here's how it works out.
- The first "Include" stanza is the same, except that in the "Options" section we're excluding "/home/example/projects/output". That's what the "Wilddir" and "Exclude = yes" directives are for.
- The second "Include" stanza puts the "/home/example/projects/output" back in, but modified with two "Options" sections: the first to include "rep0" (a simple fileglob) and the second to exclude everything. What ends up being included by this stanza is the union of those two options: only files named "rep0" in the directory "/home/example/projects/output".
- Last, the third stanza is our standard "Exclude" boilerplate.
After I was confident that I had the right set of files excluded, I sent the user a list of files to confirm that all was well:
```
cat /tmp/listing_before | while read i ; do grep -q $i /tmp/listing_after || echo $i ; done > /tmp/excluded
```
Now, I'm the first to admit that that is ugly. Diff, useless use of cat...lots of objections to raise. But it's been a long day and I got what I wanted. I pointed the user at it, made sure it was okay, and committed the changes.

All in all, this gave me a good loop for testing: it caught fatal errors before they happened, it let me be sure I was excluding the right things, and I was able to work in a stepwise fashion to get where I wanted.
April 11, 2012 Why I'm starting to hate Bacula
This is an attempt to lay out my problems with Bacula, and to be explicit about what I hope to achieve by replacing it (if, in fact, I do go ahead with that). If I'm wrong, correct me.

Too many long jobs monopolize spool space, storage job slots, and generally hold up production.

My largest jobs right now are around 1-2 TB -- and in order to accomplish that, I need to manually split up filesystems using a messy syntax. A job running that long will cycle through
- spool for hours
- despool for hours
many, many times. During spooling, a slot of storage space jobs is used. During despooling, no other job can despool to that tape drive. Often, this ends up holding up a lot of other jobs. If there's a problem, I'm faced with a choice between killing a job that's been running for days, or letting lots of other stuff go swithout backups until/unless it finishes.

More generally, I'm faced with a choice between letting everything run forever at the beginning of the month (because it's simplest to schedule fulls for the first Saturday or some such), or juggling schedules manually to stagger things (which I'm doing now, and leads to schedules like FullBackupSecondSundayAfterLent).

Possible fixes:
- Disk -> Disk -> Tape
- More breaking up of jobs
- More/faster spool space
- Tweaking max jobs
- Smart ZFS snapshot scripts (so I can restart a job by pointing it at the snapshot; may be yet more baroqueness)
Bacula seems to get confused easily about what tapes are available for use.

Bacula's storage daemon seems to often hold on to outdated info about what tapes are in what state.

Example: the daily pool is full, so jobs are halted. Status storage shows it's waiting for a drive to be created for the daily pool. I move a volume from another pool, then have to attempt to mount it manually in the appropriate drive -- the storage daemon doesn't pick up on this change automatically.

Sometimes this works, and sometimes it doesn't. Sometimes both are waiting for a tape from the same pool; creating one doesn't let the jobs queued up on the other drive run on that new tape, but rather you need to create a second new tape and mount it. On top of that, sometimes the jobs hang around on the storage daemon still waiting for a new tape -- or something...because they don't get out of the way, and let other jobs run in their place, unless they're cancelled (and sometimes only when bacula-sd is restarted).

This may be fixed with the upgrade to 5.2.6. However....

The new version of Bacula crashes when I run too many jobs at once.

That's 5.2.6, upgraded to from 5.0.2 (time got away on me, yes). And by too many I mean, like, 50. That's not too many! I'm not sure what the hell's going on, though at least now I have a backtrace. I'm seriously pissed off about this point. Yes, I'll file a bug, but this is annoying.

All in all, I spend far too much time babysitting Bacula.

It's extremely high maintenance, and that's pissing me off. Understand, this is coming after a long weekend spent babysitting it, trying to make sure some jobs got written. There are other problems at work, yes, but this is not meant to be so hard.
March 27, 2012 Bacula script: get_monthly_tapes
Periodically I remove tapes at $WORK from our tape library to keep them somewhere else. getmonthlytapes is a Perl script that helps me do just that. Released under the GPL; share and enjoy!
May 03, 2011 Bacula multi-tape restores while backups queued
I've got a tape library at work with two tape drives. Today, one of the drives was doing (full) backups and the second was free for a restore job. However, when that restore job ran, I got this error:
```
JobId 62397: Forward spacing Volume "000039" to file:block 7:0.
JobId 62397: Error: block.c:1016 Read error on fd=7 at file:blk 3:0 on device "Drive-0" (/dev/nst1). ERR=Input/output error.
JobId 62397: End of Volume at file 3 on device "Drive-0" (/dev/nst1), Volume "000039"
JobId 62397: Fatal error: acquire.c:72 Acquire read: num_writers=1 not zero. Job 62397 canceled.
JobId 62397: Fatal error: mount.c:844 Cannot open Dev="Drive-0" (/dev/nst1), Vol=000039
JobId 62397: End of all volumes.
JobId 62397: Error: Bacula cbs-01-dir 5.0.2 (28Apr10): 03-May-2011 12:09:20
```
The problem wasn't that it encountered the end of the volume -- the job spanned a number of volumes, so that was okay.

No, the problem was that after the restore job had run, a number of other regular backups had started. These were incrementals, and thus were unable to use the first drive. When the restore job ran into the EOM on the first volume, it appears to have released the drive -- at which point the incrementals started up and denied the use of the second drive to the restore job. The restore job promptly gave up and called it an error.

As I was in a hurry, I tried killing off the incrementals and re-running the restore job. This worked just fine. Arguably it's a bug, but I suspect I just need to tweak the priority for restore jobs instead.

(Two entries in one day...woot!)
April 11, 2011 Checking Bacula exclusions
I came across this tip on an old posting to the Bacula mailing list. To determine if exclusions in a fileset are working, run these commands in bconsole:
```
@output some-file
estimate job=<job-name> listing level=Full
@output
```
The file will contain a list of files Bacula will include in the backup.

(Incidentally, I came across this while trying to figure out why my excludions weren't working; turned out I needed to remove the trailing slash in my directory names in the Exclude section.
August 17, 2010 Rule
I'm trying to get Bacula to make a separate copy of monthly full backups that can be kept off-site. To do this, I'm experimenting with its "Copy" directive. I was hoping to get a complete set of tapes ready to keep offsite before I left, but it was taking much longer than anticipated (2 days to copy 2 tapes). So I cancelled the jobs, typed unmount at bconsole, and went home thinking Bacula would just grab the right tape from the autochanger when backups came.

What I should have typed was release. release lets Bacula grab whatever tape it needs. unmount leaves Bacula unwilling to do anything on its own, and it waits for the operator (ie, me) to do something.

Result: 3 weeks of no backups. Welcome back, chump.

There are a number of things I can do to make sure this doesn't happen again. There's a thread on the Bacula-users mailing list (came up in my absence, even) detailing how to make sure something's mounted. I can use release the way Kern intended. I can set up a separate check that goes to my cel phone directly, and not through Nagios. I can run a small backup job manually on Fridays just to make sure it's going to work. And on it goes.

I knew enough not to make changes as root on Friday before going on vacation. But now I know that includes backups.
May 26, 2010 Config file parsing
I've been setting up some new VMs for a separate project at work. I've realized that this is painful for two reasons: Bacula and Nagios.

Both are important...can't have a service without monitoring, and can't have a machine without backups. But each of these are configured by vast files; Bacula's is monolithic (the director's, anyhow, which is where you add new jobs) and Nagios' is legion. And they're hard to configure automagically with sed/awk/perl or cfengine; their stanzas span lines, and whitespace is important.

I've recently added a short script to my Nagios config; it regenerates a file that monitors all the Bacula jobs and makes sure they happen often enough. This is good...and I want more.

I found pynag, a Python module to parse and configure Nagios files. This is a start. I've had problems getting its head around my config files, because it didn't understand recursion in hostgroups (which I think is a recent feature of Nagios) or a hostname equal to "*". I've got the first working, and I'm banging my head against the second. The three books I got recently on Python should help (wow, IronPython looks nice).

There are a lot of example scripts with pynag. None do exactly what I want, but it looks like it should be possible to generate Nagios config files from some kind of list of hosts and services. This would be a big improvement.

But then there's Augeas, which does bi-directional parsing of config files. Have a look at the walk-through...it's pretty astounding. I realized that I've been looking for something like this for a long time: an easier way of managing all sorts of config files. Cfengine (v2 to be sure) just isn't cutting it anymore for me.

Now, the problem with Augeas for my present task is that there isn't anything in the current tree that does what I want, either. There is a commit for parsing nagios.cfg -- not sure if it's been released, or if it will parse everything in a Nagios config_dir. There's nothing for Bacula, either. This will mean a lot more work to get my ideal configuration management tool.

(On a side note, my wife said something to me the other day that was quite striking: I need tasks that can be divvied up into 45-minute chunks. That's how much free time I've got in the morning, bus rides to and from work, and the evening. Commute + kids != long blocks of free time.)

And I've got a congenital weakness for grand overarching syntheses of all existing knowledge...or at least big tasks like managing config files. So I'm trying to be aware of my brain.

...and there's son #2 waking up. Time to post.
March 23, 2010 What I've been banging my head against
I think I've finally figured out what's going on with my bacula-sd hangs. At the risk of repeating myself, this post is adapted from the bug report I've just filed.

Here's the situation: with the latest Bacula release (5.0.1), I regularly see bacula-sd hang when running jobs; it happens often and at seemingly random times; and when this happens, I see two bacula processes, a parent and a child. (Since bacula is multi-threaded, I usually just see one storage daemon process.) This came to my attention when I came back from vacation to find a week's worth of backups stalled (sigh).

When bacula-sd hangs, the traceback of the relevant thread in the parent process's looks like this:
```
Thread 10 (Thread 0x466c0940 (LWP 12926)):
#0  0x00000035aa4c5f3b in read () from /lib64/libc.so.6
#1  0x00000035aa46cc07 in _IO_new_file_underflow (fp=<value optimized out>) at fileops.c:590
#2  0x00000035aa46d5ce in _IO_default_uflow (fp=<value optimized out>) at genops.c:435
#3  0x00000035aa468e8b in _IO_getc (fp=<value optimized out>) at getc.c:41
#4  0x00002b76479565c0 in bfgets (s=0x466bf710 "", size=<value optimized out>, fd=0x60080a0) at bsys.c:617
#5  0x000000000040d432 in release_device (dcr=0x60988b8) at acquire.c:533
[snip]
```
Here, bacula's storage daemon has just finished running a job, and before it releases the tape drive to someone else runs the "Alert" command. This is specified in the config file for the storage daemon, and is meant to see if the drive has, say, run out of magnetism during the last job. Here's the source code in stored/acquire.c:
```
  alert = get_pool_memory(PM_FNAME);
  alert = edit_device_codes(dcr, alert, dcr->device->alert_command, "");
  bpipe = open_bpipe(alert, 0, "r");
  if (bpipe) {
     while (fgets(line, sizeof(line), bpipe->rfd)) {  /* AardvarkNote: This is where the parent hangs */
        Jmsg(jcr, M_ALERT, 0, _("Alert: %s"), line);
     }
     status = close_bpipe(bpipe);
  }
```
Meanwhile, the child process stack looks like this:
```
Thread 1 (Thread 0x466c0940 (LWP 13000)):
#0  0x00000035aa4df9ee in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00000035aa4d06a5 in _L_lock_1206 () from /lib64/libc.so.6
#2  0x00000035aa4d05be in closelog () at ../misc/syslog.c:419
#3  0x00002b764795cd35 in open_bpipe (prog=<value optimized out>, wait=0, mode=<value optimized out>) at bpipe.c:138
#4  0x000000000040d3f1 in release_device (dcr=0x60988b8) at acquire.c:531
[snip]
```
open_bpipe() can be found in lib/bpipe.c; it's a routine for forking a child process (FORESHADOWING: not another thread!) and sets up a pipe between parent and child. The relevant bits look like this:
```
/* Start worker process */
switch (bpipe->worker_pid = fork()) {
[snip]
```
```
case 0:                            /* child */
   if (mode_write) {
      close(writep[1]);
      dup2(writep[0], 0);          /* Dup our write to his stdin */
   }
   if (mode_read) {
      close(readp[0]);             /* Close unused child fds */
      dup2(readp[1], 1);           /* dup our read to his stdout */
      dup2(readp[1], 2);           /*   and his stderr */
   }
   /* AardvarkNote: This is where the child hangs: */
   closelog();                     /* close syslog if open */
   for (i=3; i<=32; i++) {         /* close any open file descriptors */
      close(i);
   }
   execvp(bargv[0], bargv);        /* call the program */
```
closelog() itself is simple: it "closes the current Syslog connection, if there is one." But running strace on the child process just shows a lot of futex calls...nothing very useful there at all. So what the hell is going on, and why it hanging at closelog()?

Some background info on threads: In Linux at least, they're user-land things, and the kernel doesn't know about them. To the kernel, it's just another process with a PID. Implementing threads is left as an exercise to the reader...or in this case, to glibc and NPTL.

Since threads are part of the same process, they share memory. (A fork(), by contrast, copies over parent memory to the child -- and then the child has its own copy of everything, separate from the parent.) glibc/NPTL implements locks for certain things to make sure that one thread doesn't stomp all over another thread's memory willy-nilly. And those locks are done with futexes, which are provided by the kernel...which explains why it would show up in strace, which tracks system calls.

Why is this relevant? Because in the glibc code, closelog() looks like this:
```
void
closelog ()
{
  /* Protect against multiple users and cancellation.  */
  __libc_cleanup_push (cancel_handler, NULL);
  __libc_lock_lock (syslog_lock);        /* AardvarkNote:  This is where things hang */

  closelog_internal ();
  LogTag = NULL;
  LogType = SOCK_DGRAM; /* this is the default */

  /* Free the lock.  */
  __libc_cleanup_pop (1);
}
```
That __libc_lock_lock (syslog_lock) call is there to prevent two threads trying to mess with the syslog file handle at one time. Sure enough, the info for frame #2 of the child process shows that the process is trying to get the syslog_lock mutex:
```
#2  0x00000035aa4d05be in closelog () at ../misc/syslog.c:419
419      __libc_lock_lock (syslog_lock);
ignore1 = <value optimized out>
ignore2 = <value optimized out>
ignore3 = <value optimized out>
```
As noted in the mailing list discussion, closelog() should be a noop if there's no descriptor open to syslog. However, in my case there is such a file descriptor, because I've got bacula-sd configured to log to syslog.

Well, as the Bible notes, mixing fork() and threading is problematic:

There are at least two serious problems with the semantics of fork() in a multi-threaded program. One problem has to do with state (for example, memory) covered by mutexes. Consider the case where one thread has a mutex locked and the state covered by that mutex is inconsistent while another thread calls fork(). In the child, the mutex is in the locked state (locked by a nonexistent thread and thus can never be unlocked). Having the child simply reinitialize the mutex is unsatisfactory since this approach does not resolve the question about how to correct or otherwise deal with the inconsistent state in the child.

And hey, doesn't that sound familiar?

Now that I had an idea of what was going on, I was able to find a number of similar problems that people have encountered:
- This post describes a multi-threaded app that hung when a signal was received during a syslog() call
- This thread from the libc-alpha mailing list, describes a similar problem:
```
> The particular problem I'm seeing is with syslog_lock.
>
> If another thread is writing to syslog when the fork happens,
> the syslog_lock mutex is cloned in a locked state.
```
- This Debian bug describes a multithreaded app deadlocking when syslog() is called after a fork()
- The Bible describes the POSIX-ish way around this, the pthread_atfork() call.
- This blog entry has an overview of the problem with threads and fork(), and of pthread_atfork()
So: what seems to be happening is that, when the stars align, the child process is being created at the same moment that the parent process is logging to syslog. The child is created with a locked syslog_lock mutex, but without the thread that had been holding it...and thus, without anything that can release it. The child blocks waiting for the mutex, the parent blocks on the child, and backup jobs halt (well, at least the spooling of jobs to tape) until I kill the child manually.

This was complicated to find for a number of reasons:
- My gut feeling (see also: handwavy assertion) is that logging to syslog is relatively unusual, which would explain why this problem has taken a while to surface.
- It's a race condition that's exacerbated by having multiple jobs running at once; I'd only recently implemented this in my Bacula setup.
- btraceback, a shell wrapper around gdb that comes with Bacula, is meant to run the debugger on a hung Bacula process. I used it to get the tracebacks shown above. It's great if you don't know what you're doing with gdb (and I certainly don't!). But it has a number of problems.
  - In my setup, bacula-sd ran as root; however, for reasons I don't fully understand, btraceback is usually called by the unprivileged bacula user. That meant I had to modify the script to run gdb with sudo, and change sudoers to allow bacula to run "sudo gdb" without a password. Otherwise, the dump was useless -- particularly frustrating before I'd figured out how to duplicate the problem.
  - btraceback can be triggered by sending a fatal signal to the bacula-sd process (say, "kill -6"). That's great when you notice a hang and can run the script. But it won't trace child processes -- which was where the interesting things were -- and it was a while before it occured to me to do that manually.
- Bacula has an --enable-lockmgr option that's meant to catch deadlocks, kill itself and run btraceback. However, it's not enabled by default in the SRPMs I was using to build bacula, and in any case it watches for deadlocks on Bacula's own locks -- not external locks like syslog_lock.
So what to do?

For now, I'm removing the log-to-syslog option from bacula-sd.conf. When I do that, I see no problems at all running jobs.

On the programming side -- and keep in mind I'm no programmer -- it looks like there are a number of options here:
1. Don't call closelog() after fork() in open_bpipe(). This may leave an open descriptor to syslog, which may be a problem. (Or maybe it's just ugly. Don't know.)
2. Don't fork() a child process when running the alert command, but start another thread instead. I have no idea why a fork() is preferred over a new thread in this scenario, but I presume there's a good reason.
3. Use pthread_atfork() to set things up for a fork(). That's what The Bible says to do, but I don't know what Bacula would actually need to do with it in order to make this deadlock go away.
Good lord, I'm closing in on 1700 words here. If you've stuck with me this long, I hope it was interesting; once I got over the panic it was a fascinating problem, and it's taught me a lot about processes, threads and deadlocks. And how can that be a bad thing?
March 16, 2010 Come from
I've been off on vacation for a week while my parents have been visiting. It's been fun, relaxing, and generally a good time. But today it was back to work.

I found out that backups have not been running since the day after I went on vacation. There were something like 650 jobs in Bacula stacked up, waiting to run; the storage daemon was stuck trying to do something, and nothing was getting done.

Near as I can tell, the storage daemon was stuck in a deadlock. I've got some backtraces, I've posted to the devel mailing list, and it looks there's a bug, recently fixed, that addresses the problem. Inna meantime I've put in Nagios monitoring for the size of the run queue, and I'm figuring out how to watch for the extra bacula-sd process that showed up.

Tonight, at least, the same problem happened again. This is good, because now I have a chance to repeat it. Got another backtrace, killed the job, and things are moving on. Once things are finished here, I think I'm going to try running that job again and seeing if I can trigger the problem.

Sigh. This was going to be a good day and a good post. But it's late and I've got a lot to do.

February 25, 2010 It's a race to the finish

I mentioned that I've been having problems with Bacula recently. These have been aggravated by the fact that the trigger seems to be a job that takes 53 hours to finish.

Well, I think I've got a handle on one part of the problem. See, when Bacula is doing this big job, other jobs stack up behind it -- despite having two tape drives, and two separate pools of tapes, and concurrent jobs set up, the daily jobs don't finish. The director says this:

9279 Full    BackupCatalog.2010-02-20_21.10.00_10 is waiting for higher priority jobs to finish
9496 Full    BackupCatalog.2010-02-23_21.10.00_13 is waiting execution
9498 Full    bigass_server-d_drive.2010-02-24_03.05.01_15 is running
9520 Increme  little_server-var.2010-02-24_21.05.00_38 is waiting on Storage tape
9521 Increme  little_server-opt.2010-02-24_21.05.00_39 is waiting on max Storage jobs

but storage says this:

Running Jobs:
Writing: Full Backup job bigass_server-d_drive JobId=9498
Volume="000031"

pool="Monthly" device="Drive-0" (/dev/nst1)
spooling=1 despooling=0 despool_wait=0
Files=708,555 Bytes=1,052,080,331,191 Bytes/sec=11,195,559
FDReadSeqNo=22,294,829 in_msg=20170567 out_msg=5 fd=16

Writing: Incremental Backup job little_server-var JobId=9508 Volume="000017"

pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=156 Bytes=3,403,527,093 Bytes/sec=72,415
FDReadSeqNo=53,041 in_msg=52667 out_msg=9 fd=9

Writing: Incremental Backup job little_server-etc JobId=9519 Volume="000017"

pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=9 Bytes=183,606 Bytes/sec=3
FDReadSeqNo=72 in_msg=50 out_msg=9 fd=10

Writing: Incremental Backup job other_little_server-etc JobId=9522 Volume="000017"

pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=5 Bytes=182,029 Bytes/sec=3
FDReadSeqNo=45 in_msg=32 out_msg=9 fd=19

Writing: Incremental Backup job other_little_server-var JobId=9525 Volume="000017"

pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=0 Bytes=0 Bytes/sec=0
FDSocket closed

Out of desperation I tried running "unmount" for the drive holding the daily tape, thinking that might reset things somehow...but the console just sat there, and never returned a prompt or an error message. Meanwhile, storage was logging this:

cbs-01-sd: dircmd.c:218-0 <dird: unmount SL-500 drive=1
cbs-01-sd: dircmd.c:232-0 Do command: unmount
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-0
cbs-01-sd: dircmd.c:617-0 Device SL-500 drive wrong: want=1 got=0 skipping
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-1
cbs-01-sd: dircmd.c:612-0 Found changer device Drive-1
cbs-01-sd: dircmd.c:625-0 Found device Drive-1
cbs-01-sd: block.c:133-0 Returning new block=39cee10
cbs-01-sd: acquire.c:647-0 JobId=0 enter attach_dcr_to_dev

...and then just hung there. "Aha, race condition!" I thought, and sure enough a bit of searching found this commit in November: "Fix SD DCR race condition that causes seg faults". No, I don't have a segfault, but the commit touches the last routine I see logged (along with a buncha others).

This commit is in the 5.0.1 release; I wasn't planning to upgrade to this just yet, but I think I may have to. But I'm going on vacation week after next, and I'm reluctant to do this right before I'm away for a week. What to do, what to do...

February 23, 2010 Randomized Updates
Backups: Bacula has been giving me problems the last week or so. I've got this file server I'm trying to back up; it's got a 2TB partition, and I've been naively trying to just grab it all in one go. Partly that's because it hasn't been backed up before, and I figured this'd be the quickest, simplest way to get going.

What's happened is that after slurping 2 TB over a 100 Mbit connection (no, there's no way to make that quicker), which takes 53 hours, the writing to tape fails for reasons I've yet to figure out. Bacula doesn't say "Oh, the first bit worked so I can just grab that next time...." (To be fair, that's probably a much harder problem than I imagine.) And in the meantime, despite having two drives and two pools of tapes, backups for other stuff pile up behind this big backup and then don't work: they get put on spool space, but then despooling to tape fails.

Contact manglement: I've been looking for a contact management program for $WORK. Requirements:
- database (not LDAP), so I can do funky joins and such
- web-based
- FIAF
- members can belong to more than one group
- extensible (ideally w/o my intervention) without making me feel like I've returned to maintaining the database frontend at my last job
This turns out to be surprisingly hard to find, and not just because Freshmeat's interface is terrible. Applications appear to fall into n categories:
- Incomplete and not updated since 2004
- Complete but full of bletcherous hacks like direct SQL manipulation of the table (making future modification hard), or assuming that a person belongs to one-and-only-one group; often intended for personal use, not business, as evidenced by space for a birthday and contact fields for 30 different IM platforms
- SugarCRM
- Replacements for Access/Filemaker
So now I'm trying to decide between using Dadabik, which'll let me make a frontend w/o much work as long as I can come up with a schema, or modifying one of the complete-but-bletcherous apps and getting a prettier page. (I'm always paranoid about people refusing to use a web-based tool because it isn't pretty enough; I don't know how to make it prettier and it's not something I personally care about enough to do something about, so I'm caught between don't care and don't know how to fix it if I do care. As a result I panic.)

Family: Son #2 went to the hospital Sunday night with his mom; he's fine, but I was up 'til they got back at midnight. Still got up at 5:30am as usual, thinking I'd catch up last night. Then Son #1 had a bad nightmare last night and it took a while to get him calmed down. Spent a couple hours after that staring at the ceiling, trying to get myself calmed down. Still up at 5:30am as usual.

Dentist: Root canal didn't work. My former dentist, who is the second most graceless dentist I've ever seen, couldn't get through and referred me to an endodontist (someone who does root canals; thank you, Wikipedia). My appointment for them is on April 1st.

And that is that.
October 03, 2009 Eject, *then* reboot
Ran into a little problem this week when I tried to do a restore from a backup at work. Bacula loaded the tape, then said it couldn't read the label. Wha?

After much investigation, during which I completely neglected to cut-n-paste the error messages, I think I've figured out what happened:
- I upgraded the license key for our storage library;
- I rebooted the library, 'cos that's what you gotta do;
- but the tape was still in there, say halfway through after the last batch of backups;
- so the drive rewound the tape after being power-cycled;
- and Bacula didn't know this;
- so it wrote the next backups that night at the beginning of the tape, not realizing this would be a Bad Thing(tm).
Ack. Needless to say, this was not good. Fortunately, the file in question was not a terribly important one; unfortunately, that's about the last 2 weeks of incrementals gone. Lesson learned: don't assume your backup program knows what's going on when hardware reboots from under it.

In other news: on Thursday I got 5 new Dell servers. Woot! One of 'em will be our new LDAP/web/email/FTP server (Xen ftw!); the rest are going to be running protein search engines for various researchers across BC. They're racked and I'm stoked, except that it turns out the difference between the DRAC6 Express and Enterprise, besides a few hundred dollars, is that the Enterprise does console redirection and the Express doesn't. Dammit.

I'm going to see if there's any trickery that can be done, but I'm not holding out hope. I have got a 32-port console server, but it's two racks away...might have to run a small batch o' cables up and over to make this work.
July 21, 2009 Two x \"aha!\" re: Bacula
First, it occurred to me today that the problems I've been having with bacula-sd dying or becoming unresponsive may be because of the way Nagios has monitored it. I've been using the check_tcp plugin, and when I looked on the backup machine there were, at one point, 21 connections to the sd port. Half were from the monitoring machine and were in the CLOSE_WAIT state. The max concurrent jobs for -sd is set to 20. I've turned off Nagios monitoring for now; we'll see how that does.

Second -- edit: sorry, stupid error. I withdraw the point.

July 06, 2009 Zombie bacula-sd and open port

Weird...Just ran into a problem with restarting bacula-sd. For some reason, the previous instance had died badly and left a zombie process. I restarted bacula-sd but was left with an open port:

# sudo netstat -tupan | grep 9103
tcp        0      0 0.0.0.0:9103                0.0.0.0:* LISTEN      -

which meant that bconsole hung every time it tried to get the status of bacula-sd. Unsure what to do, I tried telnetting to it for fun and then quit; after that the port was freed up and grabbed by the already-running storage daemon:

tcp        0      0 0.0.0.0:9103                0.0.0.0:* LISTEN      16254/bacula-sd

and bconsole was able to see it just fine:

Connecting to Storage daemon tape at bacula.example.com:9103

example-sd Version: 3.0.1 (30 April 2009) x86_64-example-linux-gnu example
Daemon started 06-Jul-09 10:18, 0 Jobs run since started.
 Heap: heap=180,224 smbytes=25,009 max_bytes=122,270 bufs=94 max_bufs=96
Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8

July 03, 2009 GPT and MBR
I've run into an interesting problem with the new backup machine.

It's a Sun X4240 with 10 x 15k disks in it: 2 x 73GB (mirrored for the OS) and 8 x, um, a bunch (250GB?), RAID0 for Bacula spooling. (I want fast disk access, so RAID0 it is.) RAID is taken care of by an onboard RAID card, so these look like regular disks to Linux.

Now the spool disk works out to about 2.2TB or so — which is big enough to make baby fdisk cry:
```
WARNING: The size of this disk is 2.4 TB (2391994793984 bytes).
DOS partition table format can not be used on drives for volumes
larger than 2.2 TB (2199023255040 bytes). Use parted(1) and GUID
partition table format (GPT).
```
Well, okay, haven't used parted before but that's no reason to hold back. I follow directions and eventually figure out that mkpart gpt ext3 0 2392G will do what I want. GPT? Piece of cake! And then I rebooted, and I couldn't boot up again. Blank screen after the POST. Crap!

The first time this happened, the reboot also coincided with some additional problems during the POST where too many cards were trying to shove their ROM into the BIOS memory (or some such); I thought the two were connected. But then I did it again today, and I finally started digging.

The problem is that parted overwrites the MBR when setting up a GPT disklabel. This has been noted and argued over. My understanding of the two sides of the debate is:
- the MBR is not part of the EFI standard, so it's entirely rational that it should be erased;
- but very few x86 machines are EFI-only;
- and traditional disklabels don't support partitions over 2TB, so what's a brother gonna do?;
- and an MBR-GPT hybrid seems a nice way out of this.
Meanwhile, the parted camp has a number of bugs dealing with this very issue, two opened a year ago, and none have any response in them.

This enterprising soul submitted a patch back in December 2008, which appears to have fallen to the floor.

As for me, I was able to convince the BIOS to boot from the smaller disk, and then get a rescue CentOS image going via PXE booting, and then reinstall grub on the smaller disk. Sorted. All I had to do was change root (hd1,0) to `root (hd0,0) in grub.conf.

A touch anti-climactic after all that, perhaps. But it was interesting a) to learn about all this (I hadn't really thought about successors to the DOS partition format before), and b) to see what a slender thread we (okay, I) hang our hopes on sometimes. It's a necessary, sobering thing to realize how much of what I use, depend on, believe in is created by volunteers who are smart, hard-working people — they argue and and focus and forget just like real people, not inhabitants of some shining city on a hill I sometimes take them for ("Next beer in Jerusalem!").
July 02, 2009 Bacula, gossip, advice
- Bacula config coming along; figured out today that /dev/nst0 corresponds to what mtx sees as Data Transfer Element 1 (as opposed to DTE 0), which explains why previous attempts to run label barcode just failed miserably. (Neat command that.) And I had thought that DTE meant the arm, but no: upon reflection, it's a subtle/obtuse (not the right word, but oh well) way of referring to the tape drive itself.
- Rather interesting comment, if you like that sort of thing, from Mark Burgess (originator of Cfengine on Puppet and Luke Kanies. I know, I should remain above, but it is weirdly fascinating.
- And to go out on a high note, some excellent advice from Tom Limoncelli on setting priorities as a sysadmin:
This sounds like when I was at my previous employer and they asked if I could develop a web-based system to take surveys. I nearly said, "yes" because, well, I know perl, I know CGI, and I could do it. However, I was smart enough to say "no, but surveymonkey.com will do it for cheap." Best of all it was self-service and the HR person was able to do it entirely without me. If I had said I could write such a program, it would have been days of back-and-forth changes which would have driven me crazy. Instead, she was happy to be empowered to do it herself. In fact, doing it herself without any help became a feather in her cap.

The lesson I learned is that "can I do it?" includes "do I want to do it?". If I can do something but don't want to, the answer is, "No, I don't know how" not "I know how but don't want to". The first makes you look like you know your limits. The latter sounds like you are just being difficult.
June 11, 2009 Tour, FC
Gave a tour of the new server room today to about 30-odd people in the department. Ended on a bit of a low note ("and that's the end! Any questions?") but other than that it went well. Even got an ounce of champagne at the end of it.

Oh, and yesterday I found out that our SL-500 has three fibre channel interfaces, compared to the one interface in the server we bought. I think the sales folks assumed we had a fibre switch, and I didn't realize it all (data + control) wouldn't go over one cable. Arghh.

Just saw a character named Terence on "Entourage" who was not Terrance Stamp. Now I want to see "Bowfinger" and "The Limey", in that order.
April 24, 2009 Bacula over TLS at last!
I'm testing Bacula 3; the new release has just come out, and I'm very much looking forward to rolling it out here.

One of the things I've been doing is trying to get TLS working, which I utterly failed at in my last job. I must've failed to see these pages, which a) point out that the otherwise-excellent Bacula manual is (ahem) sparing when it comes to TLS, and b) you need to put the cert files in places that strike me as unexpected.

Thus, in bacula-dir.conf you put the directives listing the director's cert/key in the client section — IOW, you say "and use this key/cert combo when connecting to client foo." Meanwhile, on client foo, you add the client's cert/key directives in the director section ("and use this key/cert when talking to the director"), along with things like the CA cert and required CNs.

Oh, and did you know that you can debug SSL handshakes with openssl? True story.
February 04, 2009 Sleep!
I can't believe it...my youngest son, after nearly three weeks of being up four or five times each night, slept nearly all the way through without a break: he only woke up at 1am and 5:15am, which is close enough to my usual wakeup time as makes no difference. It was wonderful to have a bit of sleep.

This comes after staying up late (11pm!) on Sunday bottling the latest batch of beer, a Grapefruit Bitter recipe from the local homebrew shop. You know, it really does taste like grapefruit, and even this early I'm really looking forward to this beer.

My laptop has a broken hinge, dammit. I carry it around in my backpack without any padding, so I guess I'm lucky it's lasted this long. Fortunately the monitor still works and mostly stays upright. I've had a look at some directions on how to replace it; it looks fiddly, but spending $20 on a new set of hinges from eBay is a lot more attractive than spending $100. Of course, the other consideration is whether I can get three hours to work on it….But in the meantime, I've got it on the SkyTrain for the first time in a week; it's been hard to want to do anything but sleep lately.

Work is still busy:
- I'm trying to get tinyMCE and img_assist to work with Drupal
  1. tinyMCE is no problem, but the img_assist part wasn't working with it. Turns out you need to get dev versions of the img_assist and WYSIWIG modules, because the latest version of tinyMCE (which is required for Drupal 6) broke some parts of img_assist (which, in turn, was in the middle of a rewrite anyhow). Eventually, the admin ass't will be able to work on the website w/o having to know HTML which == major win.
- Contacting vendors to look at backup hardware. So far we're looking at the Dell ML6010 and the Sun SL500. They're both modular, which is nice; we've got (low) tens of TB now but that'll ramp up quickly. The SL500 seems to have some weird things; according to this post, it takes up to 30 minutes to boot (!) and you can't change its IP address without a visit from the service engineer (!!). Those posts are two years old, so perhaps things have changed.
- Trying to figure out what we want for backup software, too. I'm used to Bacula (which works well with the ML6010) and Amanda, but I've been working a little bit with Tivoli lately. One of the advantages of Tivoli is the ease of restoring it gives to the users…very nice. I'm reading Backup and Recovery again, trying to get a sense of what we want, and reviewing Preston's presentation at LISA06 called "Seriously, tape-only backup systems are dead". So what do we put in front of this thing? Not sure yet…
- Speaking of Tivoli, it's suddenly stopped working for us: it backed up filesystems on our Thumper just fine (though we had to point it at individual ZFS filesystems, rather than telling it to just go), then stopped; it hangs on files over a certain size (somewhere around 500kb or so) and just sits there, trying to renew the connection over and over again. I've been suspecting firewall problems, but I haven't changed anything and I can't see any logged blocked packets. Weird.
Update: turned out to be an MTU problem:
- The Thumper supports Jumbo frames
- Our switch supports Jumbo frames
- Our firewall's inside interface, a GigE from Broadcom, does not support Jumbo frames
- Our switch will silently drop jumbo frames when directed to an interface that does not support it
I had no idea there were GigE NICs that did not support Jumbo frames. Though maybe that's just the OpenBSD driver for it. Hm.
October 16, 2008 Blogger.com hates me. I mean, +1 for Bacula
Matt asked how Amanda worked for people, and whether they'd recommend anything else. I tried to leave a comment, but Blogger's CAPTCHA (god, I hate that acronym) never seems to work for me. So here goes. (Irony of a man w/an email-based comment system complaining about someone else's left as exercise f/t reader.)

Amanda: Nice, but: At my last job (2.5 years ago now), we started running into problems when backing up a 1TB RAID5 array...simple Promise disk array, nothing special or terribly fast. Amanda would take hours to do an estimate of the backups…which, since Amanda tries to pack tapes as full as it can, it does all the time. This got to be a huge pain, and we didn't find a solution to this problem before I left. (We were using GNU tar for Amanda; not sure if that had anything to do with it, and I can't remember what the alternatives were…maybe dump? Dunno.) Not sure what the current state is.

Bacula: +1 on the nice. Very, very good at my current job; absolutely no problems with it at all. And the documentation is enough to cry for, it's so complete and wonderful and thorough and accurate and well done. Clients for Unix, Windows, and Mac. Total filesystmes here are…uh…less than 1TB, definitely, although it's creeping up there. So the smaller size may have something to do with it.
September 25, 2008 That's a mighty big catchup I got goin' there
Work...hell, life is busy these days.

At work, our (only) tape drive failed a couple of weeks ago; Bacula asked for a new tape, I put it in, and suddenly the "Drive Error" LED started blinking and the drive would not eject the tape. No combination of power cycling, paperclips or pleading would help. Fortunately, $UNIVERSITY_VENDOR had an external HP Ultrium 960 tape drive + 24 tapes in a local warehouse. Hurray for expedited shipping from Richmond!

Not only that, the Ultrium 3 drive can still read/write our Ultrium 2 media. By this I mean that a) I'd forgotten that the LTO standard calls for R/W for the last generation, not R/O, and b) the few tests I've been able to do with reading random old backups and reading/writing random new backups seem to go just fine.

Question for the peanut gallery: Has anyone had an Ultrium tape written by one drive that couldn't be read by another? I've read about tapes not being readable by drives other than the one that wrote it, but haven't heard any accounts first-hand for modern stuff.

Another question for the peanut gallery: I ended up finding instructions from HP that showed how to take apart a tape drive and manually eject a stuck tape. I did it for the old Ultrium 2. (No, it wasn't an HP drive, but they're all made in Hungary...so how many companies can be making these things, really?) The question is, do I trust this thing or not? My instinct is "not as far as I can throw it", but the instructions didn't mention anything one way or the other.

In other news, $NEW_ASSIGNMENT is looking to build a machine room in the basement of a building across the way, and I'm (natch) involved in that. Unfortunately, I've never been involved in one before. Fortunately, I got training on this when I went to LISA in 2006, and there's also Limoncelli, Hogan and Chalup to help out. (That link sends the author a few pennies, BTW; if you haven't bought it yet, get your boss to buy it for you.)

As part of the movement of servers from one data centre across town to new, temporary space here (in advance of this new machine room), another chunk of $UNIVERSITY has volunteered to help out with backups by sucking data over the ether with Tivoli. Nice, neighbourly think of them to do!

I met with the two sysadmins today and got a tour of their server room. (Not strictly necessary when arranging for backups, but was I gonna turn down the chance to tour a 1500-node cluster? No, I was not.) And oh, it was nice. Proper cable management...I just about cried. :-) Big racks full of blades, batteries, fibre everywhere, and a big-ass robotic Ultrium 2 tape cabinet. (I was surprised that it was 2, and not U3 or U4, but they pointed out that this had all been bought about four or five years ago…and like I've heard about other government-funded efforts, there's millions for capital and little for maintenance or upgrades.)

They told me about assembling most of it from scratch...partly for the experience, partly because they weren't happy with the way the vendor was doing it ("learning as they went along" was how they described it). I urged them to think about presenting at LISA, and was surprised that they hadn't heard of the conference or considered writing up their efforts.

Similarly, I was arranging for MX service for the new place with the university IT department, and the guy I was speaking to mentioned using Postfix. That surprised me, as I'd been under the impression that they used Sendmail, and I said so. He said that they had, but they switched to Postfix a year ago and were quite happy with it: excellent performance as an MTA (I think he said millions of emails per day, which I think is higher than my entire career total :-) and much better Milter performance than Sendmail. I told him he should make a presentation to the university sysadmin group, and he said he'd never considered it.

Oh, and I've completely passed over the A/C leak in my main job's server room…or the buttload of new servers we're gonna be getting at the new job…or adding the Sieve plugin for Dovecot on a CentOS box...or OpenBSD on a Dell R300 (completely fine; the only thing I've got to figure out is how it'll handle the onboard RAID if a drive fails). I've just been busy busy busy: two work places, still a 90-minute commute by transit, and two kids, one of whom is about to wake up right now.

Not that I'm complaining. Things are going great, and they're only getting better.

Last note: I'm seriously considering moving to Steve Kemp's Chronicle engine. Chris Siebenmann's note about the attraction of file-based systems for techies is quite true, as is his note about it being hard to do well. I haven't done it well, and I don't think I've got the time to make it good. Chronicle looks damn nice, even if it does mean opening up comments via the web again…which might mean actually getting comments every now and then. Anyhow, another project for the pile.
September 14, 2007 Quickbooks, Bacula
I was able to get Quickbooks 2007 working with a non-admin account today…woot! Here's what I did:
- Create a user (let's call it "quickbooks") and put the user in the admin group. Set a password.
- Since our QuickBooks files are on a shared drive, I logged in as that user and mapped the share to a drive (let's say the Z: drive).
- Still as the quickbooks user, open up Windows Explorer. Select Tools -> Folder Options -> View and select "Launch folder windows in separate process". Log out.
- Log in as the ordinary user who needs to use Quickbooks and have them runas, using the quickbooks account: right-click on the Quickbooks icon, select Run As, then select the quickbooks account. Put in the password you set up.
- You may need to browse to the file rather than opening it up from quickbook's list of recently-opened files.
- I've also mapped the quickbooks drive in the ordinary user's account, and took care to map the drive to the same letter as in the Quickbooks account. I'm not sure if this is strictly necessary.
This isn't ideal — the explorer process in QB is still running privileged — but at least that's the only IE process running as admin.

And Bacula: tripped over a small thing. I'm running the btape utility to make sure our tape drive works with it. I ran bfill, rather than fill, then wondered why I got errors at the end. Turns out to be an old command that probably shouldn't be around anymore.

Now to run fill…another couple hours to go.
September 05, 2007 bsdstats.org
Came across a mention of BSDstats.org on the Dragonfly BSD Digest, and I've set it up on my home machine. There are a ton of FreeBSD machines, and only 64 OpenBSD clients reported…time to change that!

I'm reading the documentation for Bacula right now, and it's amazing. Clearly written, thorough and extensive — almost 800 pages long. I'm very impressed.
December 21, 2005 Amanda, restore disks
At work we use Amanda for backups, and it's pretty good -- but for various reasons we don't use the amanda server/client on every single machine. For these exceptions, we point amanda at the host it's running on, where we have copies of the important stuff kept by rsync. This usually works pretty well, and it also fits in well with our other backup mechanism: the copy of yesterday. This is a copy of home directoiries and some other things, updated with rsync every morning at 3am. This gives people an easy way to get something they had yesterday, which means less trips to the backup tapes.

We also do a couple sets of backups with Amanda: daily, where we let Amanda juggle full vs. incremental in the usual way, and weekly/monthly, where we tell Amanda to just do full backups. For those, we just point Amanda at the copy of yesterday, rather than grab full backups over the network.

I've run into problems over the last few weeks, though, where weekly backups have failed for a few home directories -- the fullest ones, natch. It's taken me a while to figure out what's going on, but I think I've got a handle on it finally.

Full weekly/monthly backups take a while to do -- typically two full days, because of non-automated tape changers. While this happens, I let regular backups pile up on the holding disk (close to half a terabyte available), then flush them when the weeklies are done. Here's the error that amstatus shows:
```
wait for dumping driver: (aborted:nak error:  amandad busy)
```
Thanks to this post (Nabble? Never heard of 'em...) I finally clued in to the obvious: Amanda sometimes asks the local host for backups twice -- once as part of a daily backup, and once as part of a weekly backup. If this is right (why haven't I come across this more often?) it's going to cause pain. We don't have a tape changer, so backups just plain take a long time; there's no one here at 3am to switch a tape. I'm uncomfortable with the idea of turning off regular backups for two days a week. I really don't want to have to come in on weekends to switch tapes. Hm.

Maybe I'll look at just letting weeklies dump to disk over the weekend, then flush 'em during the week. That might work pretty well, actually.

In other news, got a bunch of HP workstations in from CDW, and I'm quite happy with them. At last count, the company has 879 people starting in January (no, not really) and the idea of setting up that many Shuttles (my usual workstation of choice), manually installing XP (no, no automated install yet) just filled me w/dread. The HPs are nice, very well put together (they're built like fucking tanks and weigh just as much), and they come with XP Pro installed. But hey, manager wanted 160GB drives and these came with 80GB drives. What to do?

Turns out you can take out the old, put in the new (bigger) drive, and just use the restore disk. Boo hiss restore disks with no full copy of the OS, but damn it's nice: very few questions, and when you're done you're ready to go. And by "ready to go" I mean of course "ready to turn off all the crap, turn on other crap and install even more crap". I've either got to swallow my pride and get an AD controller in here (Noooooooo!) or else figure out some other way of automating all this.
November 30, 2005 You know you're a geek...
...when the passing of a fire truck in the middle of the night means you obsess for half an hour about getting backup tapes out of your apartment.
November 02, 2005 Amanda report format
Add this to amanda.conf to get better formatted reports:
```
columnspec "Disk=1:18,HostName=0:10,OutKB=1:10,DumpRate=1:10,TapeRate=1:10"
```
April 29, 2005 DIY, HD
I got the iBook, I got the Slashdot t-shirt, I got the beard...but do you think I can get a wireless signal? Oh no. Thanks, Broadcom. But hey, enough complaining. Time for an update.

The wireless ISP is gonna do a point-to-point link between windows of our old and new temporary offices. Should give us 100Mb/s access or so. Which is good, because for a while I thought I'd have to walk down to London Drugs, grab some Linksys routers, and install my own firmware to do it. Which would have been a lot of fun...but would have been a fuck of a lot to get ready in, like, three days. Now I just have to get OpenVPN talking at either end, get Shaw installed, and set up a firewall. Oboy.

And then there's the troubles I've been having with our backup server. A while back I decided to start racking all the boxes we've been using as servers -- transfer the hard drives to proper servers, then use the old shell as a desktop for a new hire. Welp, the backup server was the first to go, and man it's been a headache.

First off, I didn't take care of cooling properly, and the tape drive (HP Ultrium 215, for those paying attention) suffered a nice little nervous breakdown and kept spitting out the tape. I tried downloading the HP diagnostic tool, but it only runs on Linux and the server runs FreeBSD -- neither Linux compatibility mode (not surprising) nor a Knoppix disk (kept hanging) allowed it to work. So I had no real idea what was going on other than the drive was too hot for my liking.

But HP, bless their souls, came to the rescue. Once I made it through their speech recognition voicemail tree hell, they just sent out another one -- they didn't even bitch about not being able to run the diagnostic tool. Not only that, it came the next day, and we don't even have any special contract with them -- that's just warranty. Thumbs up for them.

But now I've got different problems: the damn machine keeps seizing up on me. See, I've got this 500GB concatenated Vinum array of three disks that I use as a copy of yesterday's home directory for people, and I'm trying to move it to a four-disk RAID5 drive on the Promise array. I tried using rsync, and it just froze...but eventually. I thought maybe rsync was spending too much CPU time figuring out what to transfer, so today I tried using dump | restore -- and sure enough, it froze again.

I plugged in a monitor, hoping for a panic or something, but nope -- just unresponsive. I've found some mention in the FreeBSD mailing lists about possible problems with write caching and the Adaptec 3960D SCSI controller (which I thought was a 39160 SCSI controller, but I guess not). I'll have to see if that does the trick or not -- but in the meantime I'm wondering how I'm gonna get yesterday on the Promise. Of course, figuring out why it's crashing in the first place would be even better...

But it's not all bad news: earlier this week, the support manager at Promise that I've been dealing with called to tell me that the word had come down from on high. Yep...Promise is going to follow the GPL and properly release the Linux and Busybox source code for the firmware that goes into the VTrak 15100. Hurray! I'll have to watch, of course, and make sure it shows up...but it sounds good. "Let's put it this way," said the manager. "It's on my desk for me to do. And I don't want it there for long." To the home front, now.

As if I didn't have enough on the go, I've blown my tax return on the makings of a MythTV backend: 2.4GHz P4, umpty-GB hard drive, the PCHDTV-300 (get it while you can!), generic 128MB Nvidia (no onboard video on this mobo, or I would've stuck with that), a Hauppauge PVR-500MCE, and a nice Asus mobo in an Antec case to tie it all together. Random notes:
1. I like the case-- no sharp edges, very well put together, easy to assemble, and pretty damned quiet. Nice.
2. I think the graphics card was causing problems -- the machine kept seizing up for no apparent reason, and when I opened the case to have a look the heatsink was almost burning hot. (Memory was my first guess, but I'm running Gentoo on this and all the compiles went fine -- kernel, gcc, qt...'as a lotta compiles.)
3. The ivtv project lists the PVR-500 (dual tuners! yeah, baby!) as in alpha, but a fair few people have reported success with it. Me, I'm getting the finest MPEG-2 recordings of static you could ask for...but then, I'm pretty sure I'm doing something wrong, and I simply haven't had a chance to work on it since getting it assembled last weekend.
And now for something completely different: new mottoes for Harley Davidson:
- "Harley-Davidson: Because social contracts are for weak pussy-ass losers with small dicks."
- "Harley-Davidson: Because those other people aren't really human. Not like you and me."
- "Harley-Davidson: You deserve it. So do they."
- "Harley-Davidson: Because if you pissed in their faces, you'd be arrested."
- "Harley-Davidson: Because 'Fuck you!' is just too damned hard to remember."
- "Harley-Davidson: Because 'Fuck you!' is just too damned eloquent."