Debugging Bacula FileSet exclusions -- an example

A user at $WORK was running a series of jobs on the cluster -- dozens at any moment. Other users have their quota set to 60 GB, but this user was not (long story). His home directory is at 400GB, but it was closer to a terabyte not so long ago....right when we had a hard drive and a tape drive fail at the same time on our backup server.

We do backups every night to tape using Bacula. Most backups are incremental (whatever changed since the last backup, usually the day before) and are small...maybe tens of GB per day. But backups for this user, because of the proliferation of logs from his jobs, were closer to the size of his home directory every day -- simply because all these log files were being updated as each job progressed.

Ordinarily this wouldn't be a problem, but the cluster of hardware failures have really fucked things up; they're better now, but I'm very slowly playing catchup backups. Eating a tape or more every day is not in my budget right this moment.

I asked him if any of the log files could be excluded from backups without any great loss. After talking it over with him, we came to this agreement:

This would exclude lots of other files like "1rep2.foo", "8rep9.log", etc, and would cut out about 200 GB of useless churn every day.

Bacula has the ability to do this sort of thing...but I found its methods somewhat counterintuitive, so I want to set down what I did and how I tested it.

First off, the original, let's-include-everything FileSet looked like this:

FileSet {
  Name = "example"
  Include {
    File = /home/example
    Options {
      signature = SHA1
    }
  }
  Exclude {
    File = /proc
    File = /tmp
    File = /.journal
    File = /.fsck
    File = /.zfs
  }
}

We back up everything under /home/example, we keep SHA1 signatures, and we exclude a handful of directories (most of which are boilerplate, applied to every FileSet by default).

In order to get Bacula to change the FileSet definition, you have to get the director to reload its configuration file. But some errors -- not all -- cause a running bacula-dir process to die. So before I started fiddling around, I added a Makefile to the /opt/bacula/etc directory that looked like this:

test:
        @/opt/bacula/sbin/bacula-dir -t && echo "bacula-dir.conf looks good" || echo "problem with bacula-dir.conf"

reload: test
        echo "reload" | /opt/bacula/sbin/bconsole

Whenever I made a change, I'd run "make reload", which would test the configuration first; if it failed, bacula would not be reloaded. (The "@" symbol, in a Makefile, discards standard output.)

Next, I needed a listing of what we were backing up now, before I started fiddling with things:

    echo "estimate job=fileserver-example listing" | bconsole > /tmp/listing-before

The "estimate" command gets Bacula to estimate how big the job is; the "listing" argument tells it to list the files it'd back up. By default it gives you the info for a full backup. (You can also append a joblevel, so you can see how big a Differential or Incremental; I didn't need that here, but it's worth remembering for next time.)

After that, I made another Makefile that looked like this:

test: estimate shouldwork shouldfail

estimate:
        @echo "estimate job=fileserver-example listing" | bconsole > /tmp/listing-after ; wc -l /tmp/listing*

shouldwork: estimate
        grep rep0 /tmp/listing-before | grep projects/output | while read i ; do grep -q $$i /tmp/listing-after || exit 1 ; done

shouldfail:
        grep rep2 /tmp/listing-before |grep projects/output | while read i ; do grep -q $$i /tmp/listing-after && exit 1 ; done ; true

This is a little hackish, so in detail:

Anyhow: after each change, I'd run "make reload" as root to make sure that the syntax worked. After that, I'd run "make test" as an ordinary user (no need for root privileges) to make sure that I was on the right track. After a while, I got this:

FileSet {
  Name = "example"
  Include {
      File = /home/example
      Include {
        Options {
          signature = SHA1
          Wilddir = /home/example/projects/output
          Exclude = yes
        }
      }
  }
  Include {
    File = /home/example/projects/output
    Options {
      WildFile = "*rep0*"
      Signature = SHA1
    }
    Options {
      Exclude = yes
      RegexFile = ".*"
    }
  }
  Exclude {
    File = /proc
    File = /tmp
    File = /.journal
    File = /.fsck
    File = /.zfs
  }
}

Again, this is a little counterintuitive to me, so here's how it works out.

After I was confident that I had the right set of files excluded, I sent the user a list of files to confirm that all was well:

cat /tmp/listing_before | while read i ; do grep -q $i /tmp/listing_after || echo $i ; done > /tmp/excluded

Now, I'm the first to admit that that is ugly. Diff, useless use of cat...lots of objections to raise. But it's been a long day and I got what I wanted. I pointed the user at it, made sure it was okay, and committed the changes.

All in all, this gave me a good loop for testing: it caught fatal errors before they happened, it let me be sure I was excluding the right things, and I was able to work in a stepwise fashion to get where I wanted.