Queues for Bacula

I have a love-hate relationship with Bacula. It works, it's got clients for Windows, and it uses a database for its catalog (a big improvement over what I'd been used to, back in the day, from Amanda...though that's probably changed since then). OTOH, it has had an annoying set of bugs, the database can be a real bear to deal with, and scheduling....oh, scheduling. I'm going to hold off on ranting on scheduling. But you should know that in Bacula, you have to be explicit:

Schedule {
  Name = "WeeklyCycle"
  Run = Level=Full Pool=Monthly 1st sat at 2:05
  Run = Level=Differential Pool=Daily 2nd-5th sat at 2:05
  Run = Level=Incremental IncrementalPool=Daily FullPool=Monthly 1st-5th mon-fri, 2nd-5th sun at 00:41
}

This leads to problems on the first Saturday of the month, when all those full backups kick off. In the server room itself, where the backup server (and tape library) are located, it's not too bad; there's a GigE network, lots of bandwidth, and it's a dull roar, as you may say. But I also back up clients on a couple of other networks on campus -- one of which is 100 Mbit. Backing up 12 x 500GB home partitions on a remote 100 MBit network means a) no one on that network can connect to their servers anymore, and b) everything takes days to complete, making it entirely likely that something will fail in that time and you've just lost your backup.

One way to do that is to adjust the schedule. Maybe you say that you only want to do full backups every two months, and to not do everything on the same Saturday. That leads to crap like this:

Schedule {
  Name = "TwoMonthExpiryWeeklyCycleWednesdayFull"
  Run = Level=Full Pool=MonthlyTwoMonthExpiry 2nd Wed jan,mar,may,jun,sep,nov at 20:41
  Run = Level=Differential Pool=Daily 2nd-5th sat at 2:05
  Run = Level=Differential Pool=Daily 1st sat feb,apr,jun,aug,oct,dec at 2:05
  Run = Level=Incremental IncrementalPool=Daily FullPool=MonthlyTwoMonthExpiry 1st-5th mon-tue,thu-fri,sun, 2nd-5th wed at 20:41
  Run = Level=Incremental IncrementalPool=Daily FullPool=MonthlyTwoMonthExpiry 2nd Wed jan,mar,may,jun,sep,nov at 20:41
}

That is awful; it's difficult to make sure you've caught everything, and you have to do something like this for Thursday, Friday, Tuesday...

I guess I did rant about Bacula scheduling after all.

A while back I realized that what I really wanted was a queue: a list of jobs for the slow network that would get done one at a time. I looked around, and found Perl's IPC::DirQueue. It's a pretty simple module that uses directories and files to manage queues, and it's safe over NFS. It seemed a good place to start.

So here's what I've got so far: there's an IPC::Dirqueue-managed queue that has a list of jobs like this:

I've got a simple Perl script that, using IPC::DirQueue, take the first job and run it like so:

open (BCONSOLE, "| /usr/bin/bconsole");
print BCONSOLE "run job=" . $job . " level=Full pool=Monthly yes";
close (BCONSOLE);

I've set up a separate job definition for the 100Mbit-network clients:

JobDefs {
  Name = "100MbitNetworkJob
  Type = Backup
  Client = agnatha-fd
  Level = Incremental
  Schedule = "WeeklyCycleNoFull"
  Storage = tape
  Messages = Standard
  Priority = 10
  SpoolData = yes
  Pool = Daily
  Cancel Lower Level Duplicates = yes
  Cancel Queued Duplicates = yes
  RunScript {
  RunsWhen = After
  Runs On Client = No
  Command = "/path/to/job_queue/bacula_queue_mgr -c %l -r"
  }
}

"WeeklyCycleNoFull" is just what it sounds like: daily incrementals, weekly diffs, but no fulls; those are taken care of by the queue. The RunScript stanza is the interesting part: it runs baculaqueuemgr (my Perl script) after each job has completed. It includes the level of the job that just finished (Incremental, Differential or Full), and the "-r" argument to run a job.

The Perl script in question will only run a job if the one that just finished was a Full level. This was meant to be a crappy^Wsimple way of ensuring that we run Fulls one at a time -- no triggering a Full if an Incremental has just finished, since I might well be running a bunch of Incrementals at once.

It's not yet entirely working. It works well enough if I run the queue manually (which is actually tolerable compared to what I had before), but Bacula running the "baculaqueuemgr" command does nto quite work. The queue module has a built-in assumption about job lifetimes, and while I can tweak it to be something like days (instead of the default, which I think is 15 minutes), the script still notes that it's removing a lot of stale lockfiles, and there's nothing left to run because they're all old jobs. I'm still working on this, and I may end up switching to some other queue module. (Any suggestions, let me know; pretty much open to any scripting language.)

A future feature will be getting these jobs queued up automagically by Nagios. I point Nagios at Bacula to make sure that jobs get run often enough, and it should be possible to have Nagios' event handler enqueue a job when it figure's it's overdue.

For now, though, I'm glad this has worked out as well as it has. I still feel guilty about trying to duplicate Amanda's scheduling features, but I feel a bit like Macbeth: in blood stepped in so far....So I keep going. For now.