# sudo netstat -tupan | grep 9103 tcp 0 0 0.0.0.0:9103 0.0.0.0:* LISTEN -
So a while ago, I wrote about the li'l ol' laptop under the TV; an old, old Dell with a P3 processor that was finally coming to an end. Oliver Hookins, bless his heart, recommended the Zotac ZBox; after a bit of research, I agreed and bought the CI-320. (Sorry, Wout, but I wanted something with a bit more horsepower than a Banana Pi.) I bought 4GB of RAM for it, and I already had a 64GB SSD lying around. Debian installed on it w/o any problems whatsoever, and I migrated everything over a couple of weeks ago.
It's pretty wonderful, not least because it's completely silent. It's passively cooled, and with the SSD there are no moving parts. I've got an external HD attached to it via USB (though this thing has also got eSATA, GigE, wireless, HDMI...), and it does backup for the house. I finally got rid of my crappy, crappy-ass rsync wrapper and set up rsnapshot; I've been told to check out elkarbackup, a nice-looking web interface for it. (Now if I can only get off my butt and set up duply and offsite encrypted storage...)
And the name? Zombie.saintaardvarkthecarpeted.com. Zbox...what're you gonna do?
Canada's CSEC tracked travellers at Canadian airports who used the free WiFi. Not only that, tracked 'em afterward and backward as they showed up at other public hotspots across Canada. Oh, lovely.
A TSA screener explains: Yes, we saw you naked and we laughed.
ESR writes about dragging Emacs forward -- switching to git, and away from Texinfo, all to keep Emacs relevant. There are about eleven thousand comments. Quote:
And if the idea of RMS and ESR cooperating to subvert Emacs's decades-old culture from within strikes you as both entertaining and bizarrely funny...yeah, it is. Ours has always been a more complex relationship than most people understand.
My wife takes out our younger son's stuffed dogs for the day, and gets all the space she needs at Costco. WIN.
Looks like the supernova in Ursa Major has peaked at magnitude 10.5 or so.
Have I mentioned Adlibre backup before? 'Cos it's really quite awesome. Written in shell, uses rsync and ZFS to back up hosts. Simple and good.
Maclean's sent a sketch artist to cover Justin Bieber getting booked. I'd like to sketch that well.
Yesterday I was asked to restore a backup for a Windows desktop, and I couldn't: I'd been backing up "Documents and Settings", not "Users". The former is appropriate for XP, which this workstation'd had at some point, but not Windows 7 which it had now. I'd missed the 286-byte size of full backups. Luckily the user had another way to retrieve his data. But I felt pretty sick for a while; still do.
When shit like this happens, I try to come up with a Nagios test to watch for it. It's the regression test for sysadmins: is Nagios okay? Then at least you aren't repeating any mistakes. But how the hell do I test for this case? I'm not sure when the change happened, because the full backups I had (going back three months; our usual policy) were all 286 bytes. I thought I could settle for "alert me about full backups under...oh, I dunno, 100KB." But a search for that in the catalog turns up maybe ten or so, nine of them legitimate, meaning an alert for this will give 90% false positives.
So all right, a list of exceptions. Except that needs to be maintained. So imagine this sequence:
I need some way of saying "Oh, that's unusual..." Which makes me think of statistics, which I don't understand very well, and I start to think this is a bigger task than I realize and I'm maybe trying to create AI in a Bash script.
And really, I've got don't-bug-me-if-this lists, and local checks and exceptions, and I've documented things as well as I can but it's never enough. I've tried hard to make things easy for my eventual successor (I'm not switching jobs any time soon; just thinking of the future), and if not easy then at least documented, but I have this nagging feeling that she'll look at all this and just shake her head, the way I've done at other setups. It feels like this baroque, Balkanized, over-intricate set of kludges, special cases, homebrown scripts littered with FIXMEs and I don't know what-all. I've got Nagios invoking Bacula, and Cfengine managing some but not all, and it just feels overgrown. Weedy. Some days I don't know the way out.
And the stupid part is that NONE OF THIS WOULD HAVE FIXED THE ORIGINAL PROBLEM: I screwed up and did not adjust the files I was backing up for a client. And that realization -- that after cycling through all these dark worryings about how I'm doing my job, I'm right back where I started, a gutkick suspicion that I shouldn't be allowed to do what I do and I can't even begin to make a go at fixing things -- that is one hell of a way to end a day at work.
I have a love-hate relationship with Bacula. It works, it's got clients for Windows, and it uses a database for its catalog (a big improvement over what I'd been used to, back in the day, from Amanda...though that's probably changed since then). OTOH, it has had an annoying set of bugs, the database can be a real bear to deal with, and scheduling....oh, scheduling. I'm going to hold off on ranting on scheduling. But you should know that in Bacula, you have to be explicit:
Schedule {
Name = "WeeklyCycle"
Run = Level=Full Pool=Monthly 1st sat at 2:05
Run = Level=Differential Pool=Daily 2nd-5th sat at 2:05
Run = Level=Incremental IncrementalPool=Daily FullPool=Monthly 1st-5th mon-fri, 2nd-5th sun at 00:41
}
This leads to problems on the first Saturday of the month, when all those full backups kick off. In the server room itself, where the backup server (and tape library) are located, it's not too bad; there's a GigE network, lots of bandwidth, and it's a dull roar, as you may say. But I also back up clients on a couple of other networks on campus -- one of which is 100 Mbit. Backing up 12 x 500GB home partitions on a remote 100 MBit network means a) no one on that network can connect to their servers anymore, and b) everything takes days to complete, making it entirely likely that something will fail in that time and you've just lost your backup.
One way to do that is to adjust the schedule. Maybe you say that you only want to do full backups every two months, and to not do everything on the same Saturday. That leads to crap like this:
Schedule {
Name = "TwoMonthExpiryWeeklyCycleWednesdayFull"
Run = Level=Full Pool=MonthlyTwoMonthExpiry 2nd Wed jan,mar,may,jun,sep,nov at 20:41
Run = Level=Differential Pool=Daily 2nd-5th sat at 2:05
Run = Level=Differential Pool=Daily 1st sat feb,apr,jun,aug,oct,dec at 2:05
Run = Level=Incremental IncrementalPool=Daily FullPool=MonthlyTwoMonthExpiry 1st-5th mon-tue,thu-fri,sun, 2nd-5th wed at 20:41
Run = Level=Incremental IncrementalPool=Daily FullPool=MonthlyTwoMonthExpiry 2nd Wed jan,mar,may,jun,sep,nov at 20:41
}
That is awful; it's difficult to make sure you've caught everything, and you have to do something like this for Thursday, Friday, Tuesday...
I guess I did rant about Bacula scheduling after all.
A while back I realized that what I really wanted was a queue: a list of jobs for the slow network that would get done one at a time. I looked around, and found Perl's IPC::DirQueue. It's a pretty simple module that uses directories and files to manage queues, and it's safe over NFS. It seemed a good place to start.
So here's what I've got so far: there's an IPC::Dirqueue-managed queue that has a list of jobs like this:
I've got a simple Perl script that, using IPC::DirQueue, take the first job and run it like so:
open (BCONSOLE, "| /usr/bin/bconsole");
print BCONSOLE "run job=" . $job . " level=Full pool=Monthly yes";
close (BCONSOLE);
I've set up a separate job definition for the 100Mbit-network clients:
JobDefs {
Name = "100MbitNetworkJob
Type = Backup
Client = agnatha-fd
Level = Incremental
Schedule = "WeeklyCycleNoFull"
Storage = tape
Messages = Standard
Priority = 10
SpoolData = yes
Pool = Daily
Cancel Lower Level Duplicates = yes
Cancel Queued Duplicates = yes
RunScript {
RunsWhen = After
Runs On Client = No
Command = "/path/to/job_queue/bacula_queue_mgr -c %l -r"
}
}
"WeeklyCycleNoFull" is just what it sounds like: daily incrementals, weekly diffs, but no fulls; those are taken care of by the queue. The RunScript stanza is the interesting part: it runs baculaqueuemgr (my Perl script) after each job has completed. It includes the level of the job that just finished (Incremental, Differential or Full), and the "-r" argument to run a job.
The Perl script in question will only run a job if the one that just finished was a Full level. This was meant to be a crappy^Wsimple way of ensuring that we run Fulls one at a time -- no triggering a Full if an Incremental has just finished, since I might well be running a bunch of Incrementals at once.
It's not yet entirely working. It works well enough if I run the queue manually (which is actually tolerable compared to what I had before), but Bacula running the "baculaqueuemgr" command does nto quite work. The queue module has a built-in assumption about job lifetimes, and while I can tweak it to be something like days (instead of the default, which I think is 15 minutes), the script still notes that it's removing a lot of stale lockfiles, and there's nothing left to run because they're all old jobs. I'm still working on this, and I may end up switching to some other queue module. (Any suggestions, let me know; pretty much open to any scripting language.)
A future feature will be getting these jobs queued up automagically by Nagios. I point Nagios at Bacula to make sure that jobs get run often enough, and it should be possible to have Nagios' event handler enqueue a job when it figure's it's overdue.
For now, though, I'm glad this has worked out as well as it has. I still feel guilty about trying to duplicate Amanda's scheduling features, but I feel a bit like Macbeth: in blood stepped in so far....So I keep going. For now.
A user at $WORK was running a series of jobs on the cluster -- dozens at any moment. Other users have their quota set to 60 GB, but this user was not (long story). His home directory is at 400GB, but it was closer to a terabyte not so long ago....right when we had a hard drive and a tape drive fail at the same time on our backup server.
We do backups every night to tape using Bacula. Most backups are incremental (whatever changed since the last backup, usually the day before) and are small...maybe tens of GB per day. But backups for this user, because of the proliferation of logs from his jobs, were closer to the size of his home directory every day -- simply because all these log files were being updated as each job progressed.
Ordinarily this wouldn't be a problem, but the cluster of hardware failures have really fucked things up; they're better now, but I'm very slowly playing catchup backups. Eating a tape or more every day is not in my budget right this moment.
I asked him if any of the log files could be excluded from backups without any great loss. After talking it over with him, we came to this agreement:
This would exclude lots of other files like "1rep2.foo", "8rep9.log", etc, and would cut out about 200 GB of useless churn every day.
Bacula has the ability to do this sort of thing...but I found its methods somewhat counterintuitive, so I want to set down what I did and how I tested it.
First off, the original, let's-include-everything FileSet looked like this:
FileSet {
Name = "example"
Include {
File = /home/example
Options {
signature = SHA1
}
}
Exclude {
File = /proc
File = /tmp
File = /.journal
File = /.fsck
File = /.zfs
}
}
We back up everything under /home/example, we keep SHA1 signatures, and we exclude a handful of directories (most of which are boilerplate, applied to every FileSet by default).
In order to get Bacula to change the FileSet definition, you have to get the director to reload its configuration file. But some errors -- not all -- cause a running bacula-dir process to die. So before I started fiddling around, I added a Makefile to the /opt/bacula/etc directory that looked like this:
test:
@/opt/bacula/sbin/bacula-dir -t && echo "bacula-dir.conf looks good" || echo "problem with bacula-dir.conf"
reload: test
echo "reload" | /opt/bacula/sbin/bconsole
Whenever I made a change, I'd run "make reload", which would test the configuration first; if it failed, bacula would not be reloaded. (The "@" symbol, in a Makefile, discards standard output.)
Next, I needed a listing of what we were backing up now, before I started fiddling with things:
echo "estimate job=fileserver-example listing" | bconsole > /tmp/listing-before
The "estimate" command gets Bacula to estimate how big the job is; the "listing" argument tells it to list the files it'd back up. By default it gives you the info for a full backup. (You can also append a joblevel, so you can see how big a Differential or Incremental; I didn't need that here, but it's worth remembering for next time.)
After that, I made another Makefile that looked like this:
test: estimate shouldwork shouldfail
estimate:
@echo "estimate job=fileserver-example listing" | bconsole > /tmp/listing-after ; wc -l /tmp/listing*
shouldwork: estimate
grep rep0 /tmp/listing-before | grep projects/output | while read i ; do grep -q $$i /tmp/listing-after || exit 1 ; done
shouldfail:
grep rep2 /tmp/listing-before |grep projects/output | while read i ; do grep -q $$i /tmp/listing-after && exit 1 ; done ; true
This is a little hackish, so in detail:
The estimate target gets an updated listing of what Bacula will back up; the line count lets me eyeball how it compares to the old, all-inclusive listing.
The shoudwork target gives me a quick way to make sure that all the files with "rep0" in the name and "projects/output" in the path are still in that updated listing. We grep for these files in the new listing; it either works or exits with error code 1, which make will catch and declare an error.
The shouldfail target is similar, except I'm making sure that files with "rep2" in the name are excluded from the new listing and we're short-circuiting the loop if any line is found. The "true" at the end is there to give make a final success; we only make it to that command if the entire loop has not found anything, which is what we want. It's there to make this test a "MUST NOT". (That's probably not explained very well.)
Anyhow: after each change, I'd run "make reload" as root to make sure that the syntax worked. After that, I'd run "make test" as an ordinary user (no need for root privileges) to make sure that I was on the right track. After a while, I got this:
FileSet {
Name = "example"
Include {
File = /home/example
Include {
Options {
signature = SHA1
Wilddir = /home/example/projects/output
Exclude = yes
}
}
}
Include {
File = /home/example/projects/output
Options {
WildFile = "*rep0*"
Signature = SHA1
}
Options {
Exclude = yes
RegexFile = ".*"
}
}
Exclude {
File = /proc
File = /tmp
File = /.journal
File = /.fsck
File = /.zfs
}
}
Again, this is a little counterintuitive to me, so here's how it works out.
The first "Include" stanza is the same, except that in the "Options" section we're excluding "/home/example/projects/output". That's what the "Wilddir" and "Exclude = yes" directives are for.
The second "Include" stanza puts the "/home/example/projects/output" back in, but modified with two "Options" sections: the first to include "rep0" (a simple fileglob) and the second to exclude everything. What ends up being included by this stanza is the union of those two options: only files named "rep0" in the directory "/home/example/projects/output".
Last, the third stanza is our standard "Exclude" boilerplate.
After I was confident that I had the right set of files excluded, I sent the user a list of files to confirm that all was well:
cat /tmp/listing_before | while read i ; do grep -q $i /tmp/listing_after || echo $i ; done > /tmp/excluded
Now, I'm the first to admit that that is ugly. Diff, useless use of cat...lots of objections to raise. But it's been a long day and I got what I wanted. I pointed the user at it, made sure it was okay, and committed the changes.
All in all, this gave me a good loop for testing: it caught fatal errors before they happened, it let me be sure I was excluding the right things, and I was able to work in a stepwise fashion to get where I wanted.
This is an attempt to lay out my problems with Bacula, and to be explicit about what I hope to achieve by replacing it (if, in fact, I do go ahead with that). If I'm wrong, correct me.
Too many long jobs monopolize spool space, storage job slots, and generally hold up production.
My largest jobs right now are around 1-2 TB -- and in order to accomplish that, I need to manually split up filesystems using a messy syntax. A job running that long will cycle through
many, many times. During spooling, a slot of storage space jobs is used. During despooling, no other job can despool to that tape drive. Often, this ends up holding up a lot of other jobs. If there's a problem, I'm faced with a choice between killing a job that's been running for days, or letting lots of other stuff go swithout backups until/unless it finishes.
More generally, I'm faced with a choice between letting everything run forever at the beginning of the month (because it's simplest to schedule fulls for the first Saturday or some such), or juggling schedules manually to stagger things (which I'm doing now, and leads to schedules like FullBackupSecondSundayAfterLent).
Possible fixes:
Bacula seems to get confused easily about what tapes are available for use.
Bacula's storage daemon seems to often hold on to outdated info about what tapes are in what state.
Example: the daily pool is full, so jobs are halted. Status storage shows it's waiting for a drive to be created for the daily pool. I move a volume from another pool, then have to attempt to mount it manually in the appropriate drive -- the storage daemon doesn't pick up on this change automatically.
Sometimes this works, and sometimes it doesn't. Sometimes both are waiting for a tape from the same pool; creating one doesn't let the jobs queued up on the other drive run on that new tape, but rather you need to create a second new tape and mount it. On top of that, sometimes the jobs hang around on the storage daemon still waiting for a new tape -- or something...because they don't get out of the way, and let other jobs run in their place, unless they're cancelled (and sometimes only when bacula-sd is restarted).
This may be fixed with the upgrade to 5.2.6. However....
The new version of Bacula crashes when I run too many jobs at once.
That's 5.2.6, upgraded to from 5.0.2 (time got away on me, yes). And by too many I mean, like, 50. That's not too many! I'm not sure what the hell's going on, though at least now I have a backtrace. I'm seriously pissed off about this point. Yes, I'll file a bug, but this is annoying.
All in all, I spend far too much time babysitting Bacula.
It's extremely high maintenance, and that's pissing me off. Understand, this is coming after a long weekend spent babysitting it, trying to make sure some jobs got written. There are other problems at work, yes, but this is not meant to be so hard.
Periodically I remove tapes at $WORK from our tape library to keep them somewhere else. getmonthlytapes is a Perl script that helps me do just that. Released under the GPL; share and enjoy!
I've got a tape library at work with two tape drives. Today, one of the drives was doing (full) backups and the second was free for a restore job. However, when that restore job ran, I got this error:
JobId 62397: Forward spacing Volume "000039" to file:block 7:0.
JobId 62397: Error: block.c:1016 Read error on fd=7 at file:blk 3:0 on device "Drive-0" (/dev/nst1). ERR=Input/output error.
JobId 62397: End of Volume at file 3 on device "Drive-0" (/dev/nst1), Volume "000039"
JobId 62397: Fatal error: acquire.c:72 Acquire read: num_writers=1 not zero. Job 62397 canceled.
JobId 62397: Fatal error: mount.c:844 Cannot open Dev="Drive-0" (/dev/nst1), Vol=000039
JobId 62397: End of all volumes.
JobId 62397: Error: Bacula cbs-01-dir 5.0.2 (28Apr10): 03-May-2011 12:09:20
The problem wasn't that it encountered the end of the volume -- the job spanned a number of volumes, so that was okay.
No, the problem was that after the restore job had run, a number of other regular backups had started. These were incrementals, and thus were unable to use the first drive. When the restore job ran into the EOM on the first volume, it appears to have released the drive -- at which point the incrementals started up and denied the use of the second drive to the restore job. The restore job promptly gave up and called it an error.
As I was in a hurry, I tried killing off the incrementals and re-running the restore job. This worked just fine. Arguably it's a bug, but I suspect I just need to tweak the priority for restore jobs instead.
(Two entries in one day...woot!)
I came across this tip on an old posting to the Bacula mailing list. To determine if exclusions in a fileset are working, run these commands in bconsole:
@output some-file
estimate job=<job-name> listing level=Full
@output
The file will contain a list of files Bacula will include in the backup.
(Incidentally, I came across this while trying to figure out why my
excludions weren't working; turned out I needed to remove the trailing
slash in my directory names in the Exclude
section.
I'm trying to get Bacula to make a separate copy of monthly full
backups that can be kept off-site. To do this, I'm experimenting with
its "Copy" directive. I was hoping to get a complete set of tapes
ready to keep offsite before I left, but it was taking much longer
than anticipated (2 days to copy 2 tapes). So I cancelled the jobs,
typed unmount
at bconsole, and went home thinking Bacula would just
grab the right tape from the autochanger when backups came.
What I should have typed was release
. release
lets Bacula grab
whatever tape it needs. unmount
leaves Bacula unwilling to do
anything on its own, and it waits for the operator (ie, me) to do
something.
Result: 3 weeks of no backups. Welcome back, chump.
There are a number of things I can do to make sure this doesn't happen
again. There's a thread on the Bacula-users mailing list (came up in
my absence, even) detailing how to make sure something's mounted. I
can use release
the way Kern intended. I can set up a separate
check that goes to my cel phone directly, and not through Nagios. I
can run a small backup job manually on Fridays just to make sure it's
going to work. And on it goes.
I knew enough not to make changes as root on Friday before going on vacation. But now I know that includes backups.
I've been setting up some new VMs for a separate project at work. I've realized that this is painful for two reasons: Bacula and Nagios.
Both are important...can't have a service without monitoring, and can't have a machine without backups. But each of these are configured by vast files; Bacula's is monolithic (the director's, anyhow, which is where you add new jobs) and Nagios' is legion. And they're hard to configure automagically with sed/awk/perl or cfengine; their stanzas span lines, and whitespace is important.
I've recently added a short script to my Nagios config; it regenerates a file that monitors all the Bacula jobs and makes sure they happen often enough. This is good...and I want more.
I found pynag, a Python module to parse and configure Nagios files. This is a start. I've had problems getting its head around my config files, because it didn't understand recursion in hostgroups (which I think is a recent feature of Nagios) or a hostname equal to "*". I've got the first working, and I'm banging my head against the second. The three books I got recently on Python should help (wow, IronPython looks nice).
There are a lot of example scripts with pynag. None do exactly what I want, but it looks like it should be possible to generate Nagios config files from some kind of list of hosts and services. This would be a big improvement.
But then there's Augeas, which does bi-directional parsing of config files. Have a look at the walk-through...it's pretty astounding. I realized that I've been looking for something like this for a long time: an easier way of managing all sorts of config files. Cfengine (v2 to be sure) just isn't cutting it anymore for me.
Now, the problem with Augeas for my present task is that there isn't anything in the current tree that does what I want, either. There is a commit for parsing nagios.cfg -- not sure if it's been released, or if it will parse everything in a Nagios config_dir. There's nothing for Bacula, either. This will mean a lot more work to get my ideal configuration management tool.
(On a side note, my wife said something to me the other day that was quite striking: I need tasks that can be divvied up into 45-minute chunks. That's how much free time I've got in the morning, bus rides to and from work, and the evening. Commute + kids != long blocks of free time.)
And I've got a congenital weakness for grand overarching syntheses of all existing knowledge...or at least big tasks like managing config files. So I'm trying to be aware of my brain.
...and there's son #2 waking up. Time to post.
I think I've finally figured out what's going on with my bacula-sd hangs. At the risk of repeating myself, this post is adapted from the bug report I've just filed.
Here's the situation: with the latest Bacula release (5.0.1), I regularly see bacula-sd hang when running jobs; it happens often and at seemingly random times; and when this happens, I see two bacula processes, a parent and a child. (Since bacula is multi-threaded, I usually just see one storage daemon process.) This came to my attention when I came back from vacation to find a week's worth of backups stalled (sigh).
When bacula-sd hangs, the traceback of the relevant thread in the parent process's looks like this:
Thread 10 (Thread 0x466c0940 (LWP 12926)):
#0 0x00000035aa4c5f3b in read () from /lib64/libc.so.6
#1 0x00000035aa46cc07 in _IO_new_file_underflow (fp=<value optimized out>) at fileops.c:590
#2 0x00000035aa46d5ce in _IO_default_uflow (fp=<value optimized out>) at genops.c:435
#3 0x00000035aa468e8b in _IO_getc (fp=<value optimized out>) at getc.c:41
#4 0x00002b76479565c0 in bfgets (s=0x466bf710 "", size=<value optimized out>, fd=0x60080a0) at bsys.c:617
#5 0x000000000040d432 in release_device (dcr=0x60988b8) at acquire.c:533
[snip]
Here, bacula's storage daemon has just finished running a job, and
before it releases the tape drive to someone else runs the "Alert"
command. This is specified in the config file for the storage daemon,
and is meant to see if the drive has, say, run out of magnetism during
the last job. Here's the source code in stored/acquire.c
:
alert = get_pool_memory(PM_FNAME);
alert = edit_device_codes(dcr, alert, dcr->device->alert_command, "");
bpipe = open_bpipe(alert, 0, "r");
if (bpipe) {
while (fgets(line, sizeof(line), bpipe->rfd)) { /* AardvarkNote: This is where the parent hangs */
Jmsg(jcr, M_ALERT, 0, _("Alert: %s"), line);
}
status = close_bpipe(bpipe);
}
Meanwhile, the child process stack looks like this:
Thread 1 (Thread 0x466c0940 (LWP 13000)):
#0 0x00000035aa4df9ee in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00000035aa4d06a5 in _L_lock_1206 () from /lib64/libc.so.6
#2 0x00000035aa4d05be in closelog () at ../misc/syslog.c:419
#3 0x00002b764795cd35 in open_bpipe (prog=<value optimized out>, wait=0, mode=<value optimized out>) at bpipe.c:138
#4 0x000000000040d3f1 in release_device (dcr=0x60988b8) at acquire.c:531
[snip]
open_bpipe()
can be found in lib/bpipe.c; it's a routine for forking
a child process (FORESHADOWING: not another thread!) and sets up a
pipe between parent and child. The relevant bits look like this:
/* Start worker process */
switch (bpipe->worker_pid = fork()) {
[snip]
case 0: /* child */
if (mode_write) {
close(writep[1]);
dup2(writep[0], 0); /* Dup our write to his stdin */
}
if (mode_read) {
close(readp[0]); /* Close unused child fds */
dup2(readp[1], 1); /* dup our read to his stdout */
dup2(readp[1], 2); /* and his stderr */
}
/* AardvarkNote: This is where the child hangs: */
closelog(); /* close syslog if open */
for (i=3; i<=32; i++) { /* close any open file descriptors */
close(i);
}
execvp(bargv[0], bargv); /* call the program */
closelog()
itself is simple: it "closes the current Syslog
connection, if there is one." But running strace
on the child
process just shows a lot of futex
calls...nothing very useful there
at all. So what the hell is going on, and why it hanging at
closelog()
?
Some background info on threads: In Linux at least, they're user-land things, and the kernel doesn't know about them. To the kernel, it's just another process with a PID. Implementing threads is left as an exercise to the reader...or in this case, to glibc and NPTL.
Since threads are part of the same process, they share memory. (A
fork()
, by contrast, copies over parent memory to the child -- and
then the child has its own copy of everything, separate from the
parent.) glibc/NPTL implements locks for certain things to make sure
that one thread doesn't stomp all over another thread's memory
willy-nilly. And those locks are done with futexes, which are
provided by the kernel...which explains why it would show up in
strace
, which tracks system calls.
Why is this relevant? Because in the glibc code, closelog()
looks
like this:
void
closelog ()
{
/* Protect against multiple users and cancellation. */
__libc_cleanup_push (cancel_handler, NULL);
__libc_lock_lock (syslog_lock); /* AardvarkNote: This is where things hang */
closelog_internal ();
LogTag = NULL;
LogType = SOCK_DGRAM; /* this is the default */
/* Free the lock. */
__libc_cleanup_pop (1);
}
That __libc_lock_lock (syslog_lock)
call is there to prevent two
threads trying to mess with the syslog file handle at one time. Sure
enough, the info for frame #2 of the child process shows that the
process is trying to get the syslog_lock mutex:
#2 0x00000035aa4d05be in closelog () at ../misc/syslog.c:419
419 __libc_lock_lock (syslog_lock);
ignore1 = <value optimized out>
ignore2 = <value optimized out>
ignore3 = <value optimized out>
As noted in the mailing list discussion, closelog()
should be a noop
if there's no descriptor open to syslog. However, in my case there
is such a file descriptor, because I've got bacula-sd configured to
log to syslog.
Well, as the Bible notes, mixing fork()
and threading is
problematic:
There are at least two serious problems with the semantics of fork() in a multi-threaded program. One problem has to do with state (for example, memory) covered by mutexes. Consider the case where one thread has a mutex locked and the state covered by that mutex is inconsistent while another thread calls fork(). In the child, the mutex is in the locked state (locked by a nonexistent thread and thus can never be unlocked). Having the child simply reinitialize the mutex is unsatisfactory since this approach does not resolve the question about how to correct or otherwise deal with the inconsistent state in the child.
And hey, doesn't that sound familiar?
Now that I had an idea of what was going on, I was able to find a number of similar problems that people have encountered:
This post describes a multi-threaded app that hung when a signal was
received during a syslog()
call
This thread from the libc-alpha mailing list, describes a similar problem:
> The particular problem I'm seeing is with syslog_lock.
>
> If another thread is writing to syslog when the fork happens,
> the syslog_lock mutex is cloned in a locked state.
This Debian bug describes a multithreaded app deadlocking when
syslog()
is called after a fork()
The Bible describes the POSIX-ish way around this, the pthread_atfork()
call.
This blog entry has an overview of the problem with threads and
fork()
, and of pthread_atfork()
So: what seems to be happening is that, when the stars align, the
child process is being created at the same moment that the parent
process is logging to syslog. The child is created with a locked
syslog_lock
mutex, but without the thread that had been holding
it...and thus, without anything that can release it. The child
blocks waiting for the mutex, the parent blocks on the child, and
backup jobs halt (well, at least the spooling of jobs to tape) until I
kill the child manually.
This was complicated to find for a number of reasons:
My gut feeling (see also: handwavy assertion) is that logging to syslog is relatively unusual, which would explain why this problem has taken a while to surface.
It's a race condition that's exacerbated by having multiple jobs running at once; I'd only recently implemented this in my Bacula setup.
btraceback
, a shell wrapper around gdb that comes with
Bacula, is meant to run the debugger on a hung Bacula process. I
used it to get the tracebacks shown above. It's great if you don't
know what you're doing with gdb (and I certainly don't!). But it
has a number of problems.
btraceback
is usually called by the unprivileged
bacula user. That meant I had to modify the script to run gdb with
sudo, and change sudoers to allow bacula to run "sudo gdb" without a
password. Otherwise, the dump was useless -- particularly
frustrating before I'd figured out how to duplicate the problem.btraceback
can be triggered by sending a fatal signal to the
bacula-sd process (say, "kill -6"). That's great when you notice a
hang and can run the script. But it won't trace child processes --
which was where the interesting things were -- and it was a while
before it occured to me to do that manually.Bacula has an --enable-lockmgr
option that's meant to catch
deadlocks, kill itself and run btraceback
. However, it's not
enabled by default in the SRPMs I was using to build bacula, and in
any case it watches for deadlocks on Bacula's own locks -- not
external locks like syslog_lock
.
So what to do?
For now, I'm removing the log-to-syslog option from bacula-sd.conf. When I do that, I see no problems at all running jobs.
On the programming side -- and keep in mind I'm no programmer -- it looks like there are a number of options here:
Don't call closelog()
after fork()
in open_bpipe()
. This may
leave an open descriptor to syslog, which may be a problem. (Or maybe
it's just ugly. Don't know.)
Don't fork()
a child process when running the alert command, but
start another thread instead. I have no idea why a fork()
is
preferred over a new thread in this scenario, but I presume there's a
good reason.
Use pthread_atfork()
to set things up for a fork()
. That's
what The Bible says to do, but I don't know what Bacula would
actually need to do with it in order to make this deadlock go away.
Good lord, I'm closing in on 1700 words here. If you've stuck with me this long, I hope it was interesting; once I got over the panic it was a fascinating problem, and it's taught me a lot about processes, threads and deadlocks. And how can that be a bad thing?
I've been off on vacation for a week while my parents have been visiting. It's been fun, relaxing, and generally a good time. But today it was back to work.
I found out that backups have not been running since the day after I went on vacation. There were something like 650 jobs in Bacula stacked up, waiting to run; the storage daemon was stuck trying to do something, and nothing was getting done.
Near as I can tell, the storage daemon was stuck in a deadlock. I've got some backtraces, I've posted to the devel mailing list, and it looks there's a bug, recently fixed, that addresses the problem. Inna meantime I've put in Nagios monitoring for the size of the run queue, and I'm figuring out how to watch for the extra bacula-sd process that showed up.
Tonight, at least, the same problem happened again. This is good, because now I have a chance to repeat it. Got another backtrace, killed the job, and things are moving on. Once things are finished here, I think I'm going to try running that job again and seeing if I can trigger the problem.
Sigh. This was going to be a good day and a good post. But it's late and I've got a lot to do.
I mentioned that I've been having problems with Bacula recently. These have been aggravated by the fact that the trigger seems to be a job that takes 53 hours to finish.
Well, I think I've got a handle on one part of the problem. See, when Bacula is doing this big job, other jobs stack up behind it -- despite having two tape drives, and two separate pools of tapes, and concurrent jobs set up, the daily jobs don't finish. The director says this:
9279 Full BackupCatalog.2010-02-20_21.10.00_10 is waiting for higher priority jobs to finish
9496 Full BackupCatalog.2010-02-23_21.10.00_13 is waiting execution
9498 Full bigass_server-d_drive.2010-02-24_03.05.01_15 is running
9520 Increme little_server-var.2010-02-24_21.05.00_38 is waiting on Storage tape
9521 Increme little_server-opt.2010-02-24_21.05.00_39 is waiting on max Storage jobs
but storage says this:
Running Jobs:
Writing: Full Backup job bigass_server-d_drive JobId=9498
Volume="000031"
pool="Monthly" device="Drive-0" (/dev/nst1)
spooling=1 despooling=0 despool_wait=0
Files=708,555 Bytes=1,052,080,331,191 Bytes/sec=11,195,559
FDReadSeqNo=22,294,829 in_msg=20170567 out_msg=5 fd=16
Writing: Incremental Backup job little_server-var JobId=9508 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=156 Bytes=3,403,527,093 Bytes/sec=72,415
FDReadSeqNo=53,041 in_msg=52667 out_msg=9 fd=9
Writing: Incremental Backup job little_server-etc JobId=9519 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=9 Bytes=183,606 Bytes/sec=3
FDReadSeqNo=72 in_msg=50 out_msg=9 fd=10
Writing: Incremental Backup job other_little_server-etc JobId=9522 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=5 Bytes=182,029 Bytes/sec=3
FDReadSeqNo=45 in_msg=32 out_msg=9 fd=19
Writing: Incremental Backup job other_little_server-var JobId=9525 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=0 Bytes=0 Bytes/sec=0
FDSocket closed
Out of desperation I tried running "unmount" for the drive holding the daily tape, thinking that might reset things somehow...but the console just sat there, and never returned a prompt or an error message. Meanwhile, storage was logging this:
cbs-01-sd: dircmd.c:218-0 <dird: unmount SL-500 drive=1
cbs-01-sd: dircmd.c:232-0 Do command: unmount
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-0
cbs-01-sd: dircmd.c:617-0 Device SL-500 drive wrong: want=1 got=0 skipping
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-1
cbs-01-sd: dircmd.c:612-0 Found changer device Drive-1
cbs-01-sd: dircmd.c:625-0 Found device Drive-1
cbs-01-sd: block.c:133-0 Returning new block=39cee10
cbs-01-sd: acquire.c:647-0 JobId=0 enter attach_dcr_to_dev
...and then just hung there. "Aha, race condition!" I thought, and sure enough a bit of searching found this commit in November: "Fix SD DCR race condition that causes seg faults". No, I don't have a segfault, but the commit touches the last routine I see logged (along with a buncha others).
This commit is in the 5.0.1 release; I wasn't planning to upgrade to this just yet, but I think I may have to. But I'm going on vacation week after next, and I'm reluctant to do this right before I'm away for a week. What to do, what to do...
Backups: Bacula has been giving me problems the last week or so. I've got this file server I'm trying to back up; it's got a 2TB partition, and I've been naively trying to just grab it all in one go. Partly that's because it hasn't been backed up before, and I figured this'd be the quickest, simplest way to get going.
What's happened is that after slurping 2 TB over a 100 Mbit connection (no, there's no way to make that quicker), which takes 53 hours, the writing to tape fails for reasons I've yet to figure out. Bacula doesn't say "Oh, the first bit worked so I can just grab that next time...." (To be fair, that's probably a much harder problem than I imagine.) And in the meantime, despite having two drives and two pools of tapes, backups for other stuff pile up behind this big backup and then don't work: they get put on spool space, but then despooling to tape fails.
Contact manglement: I've been looking for a contact management program for $WORK. Requirements:
This turns out to be surprisingly hard to find, and not just because Freshmeat's interface is terrible. Applications appear to fall into n categories:
So now I'm trying to decide between using Dadabik, which'll let me make a frontend w/o much work as long as I can come up with a schema, or modifying one of the complete-but-bletcherous apps and getting a prettier page. (I'm always paranoid about people refusing to use a web-based tool because it isn't pretty enough; I don't know how to make it prettier and it's not something I personally care about enough to do something about, so I'm caught between don't care and don't know how to fix it if I do care. As a result I panic.)
Family: Son #2 went to the hospital Sunday night with his mom; he's fine, but I was up 'til they got back at midnight. Still got up at 5:30am as usual, thinking I'd catch up last night. Then Son #1 had a bad nightmare last night and it took a while to get him calmed down. Spent a couple hours after that staring at the ceiling, trying to get myself calmed down. Still up at 5:30am as usual.
Dentist: Root canal didn't work. My former dentist, who is the second most graceless dentist I've ever seen, couldn't get through and referred me to an endodontist (someone who does root canals; thank you, Wikipedia). My appointment for them is on April 1st.
And that is that.
Ran into a little problem this week when I tried to do a restore from a backup at work. Bacula loaded the tape, then said it couldn't read the label. Wha?
After much investigation, during which I completely neglected to cut-n-paste the error messages, I think I've figured out what happened:
Ack. Needless to say, this was not good. Fortunately, the file in question was not a terribly important one; unfortunately, that's about the last 2 weeks of incrementals gone. Lesson learned: don't assume your backup program knows what's going on when hardware reboots from under it.
In other news: on Thursday I got 5 new Dell servers. Woot! One of 'em will be our new LDAP/web/email/FTP server (Xen ftw!); the rest are going to be running protein search engines for various researchers across BC. They're racked and I'm stoked, except that it turns out the difference between the DRAC6 Express and Enterprise, besides a few hundred dollars, is that the Enterprise does console redirection and the Express doesn't. Dammit.
I'm going to see if there's any trickery that can be done, but I'm not holding out hope. I have got a 32-port console server, but it's two racks away...might have to run a small batch o' cables up and over to make this work.
First, it occurred to me today that the problems I've been having with bacula-sd dying or becoming unresponsive may be because of the way Nagios has monitored it. I've been using the check_tcp
plugin, and when I looked on the backup machine there were, at one point, 21 connections to the sd port. Half were from the monitoring machine and were in the CLOSE_WAIT
state. The max concurrent jobs for -sd is set to 20. I've turned off Nagios monitoring for now; we'll see how that does.
Second -- edit: sorry, stupid error. I withdraw the point.
Weird...Just ran into a problem with restarting bacula-sd
. For some reason, the previous instance had died badly and left a zombie process. I restarted bacula-sd
but was left with an open port:
# sudo netstat -tupan | grep 9103 tcp 0 0 0.0.0.0:9103 0.0.0.0:* LISTEN -
which meant that bconsole
hung every time it tried to get the status of bacula-sd
. Unsure what to do, I tried telnetting to it for fun and then quit; after that the port was freed up and grabbed by the already-running storage daemon:
tcp 0 0 0.0.0.0:9103 0.0.0.0:* LISTEN 16254/bacula-sd
and bconsole
was able to see it just fine:
Connecting to Storage daemon tape at bacula.example.com:9103 example-sd Version: 3.0.1 (30 April 2009) x86_64-example-linux-gnu example Daemon started 06-Jul-09 10:18, 0 Jobs run since started. Heap: heap=180,224 smbytes=25,009 max_bytes=122,270 bufs=94 max_bufs=96 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8
I've run into an interesting problem with the new backup machine.
It's a Sun X4240 with 10 x 15k disks in it: 2 x 73GB (mirrored for the OS) and 8 x, um, a bunch (250GB?), RAID0 for Bacula spooling. (I want fast disk access, so RAID0 it is.) RAID is taken care of by an onboard RAID card, so these look like regular disks to Linux.
Now the spool disk works out to about 2.2TB or so — which is big
enough to make baby fdisk
cry:
WARNING: The size of this disk is 2.4 TB (2391994793984 bytes). DOS partition table format can not be used on drives for volumes larger than 2.2 TB (2199023255040 bytes). Use parted(1) and GUID partition table format (GPT).
Well, okay, haven't used parted before but that's no reason to hold
back. I follow directions and eventually figure out that mkpart gpt
ext3 0 2392G
will do what I want. GPT? Piece of cake! And then I
rebooted, and I couldn't boot up again. Blank screen after the
POST. Crap!
The first time this happened, the reboot also coincided with some additional problems during the POST where too many cards were trying to shove their ROM into the BIOS memory (or some such); I thought the two were connected. But then I did it again today, and I finally started digging.
The problem is that parted overwrites the MBR when setting up a GPT disklabel. This has been noted and argued over. My understanding of the two sides of the debate is:
Meanwhile, the parted camp has a number of bugs dealing with this very issue, two opened a year ago, and none have any response in them.
This enterprising soul submitted a patch back in December 2008, which appears to have fallen to the floor.
As for me, I was able to convince the BIOS to boot from the smaller
disk, and then get a rescue CentOS image going via PXE booting, and
then reinstall grub on the smaller disk. Sorted. All I had to do was
change root (hd1,0)
to `root (hd0,0) in grub.conf.
A touch anti-climactic after all that, perhaps. But it was interesting a) to learn about all this (I hadn't really thought about successors to the DOS partition format before), and b) to see what a slender thread we (okay, I) hang our hopes on sometimes. It's a necessary, sobering thing to realize how much of what I use, depend on, believe in is created by volunteers who are smart, hard-working people — they argue and and focus and forget just like real people, not inhabitants of some shining city on a hill I sometimes take them for ("Next beer in Jerusalem!").
label barcode
just failed miserably. (Neat command that.) And I had thought that DTE meant the arm, but no: upon reflection, it's a subtle/obtuse (not the right word, but oh well) way of referring to the tape drive itself.This sounds like when I was at my previous employer and they asked if I could develop a web-based system to take surveys. I nearly said, "yes" because, well, I know perl, I know CGI, and I could do it. However, I was smart enough to say "no, but surveymonkey.com will do it for cheap." Best of all it was self-service and the HR person was able to do it entirely without me. If I had said I could write such a program, it would have been days of back-and-forth changes which would have driven me crazy. Instead, she was happy to be empowered to do it herself. In fact, doing it herself without any help became a feather in her cap.
The lesson I learned is that "can I do it?" includes "do I want to do it?". If I can do something but don't want to, the answer is, "No, I don't know how" not "I know how but don't want to". The first makes you look like you know your limits. The latter sounds like you are just being difficult.
Gave a tour of the new server room today to about 30-odd people in the department. Ended on a bit of a low note ("and that's the end! Any questions?") but other than that it went well. Even got an ounce of champagne at the end of it.
Oh, and yesterday I found out that our SL-500 has three fibre channel interfaces, compared to the one interface in the server we bought. I think the sales folks assumed we had a fibre switch, and I didn't realize it all (data + control) wouldn't go over one cable. Arghh.
Just saw a character named Terence on "Entourage" who was not Terrance Stamp. Now I want to see "Bowfinger" and "The Limey", in that order.
I'm testing Bacula 3; the new release has just come out, and I'm very much looking forward to rolling it out here.
One of the things I've been doing is trying to get TLS working, which I utterly failed at in my last job. I must've failed to see these pages, which a) point out that the otherwise-excellent Bacula manual is (ahem) sparing when it comes to TLS, and b) you need to put the cert files in places that strike me as unexpected.
Thus, in bacula-dir.conf
you put the directives listing the
director's cert/key in the client section — IOW, you say "and
use this key/cert combo when connecting to client foo." Meanwhile, on
client foo, you add the client's cert/key directives in the
director section ("and use this key/cert when talking to the
director"), along with things like the CA cert and required CNs.
Oh, and did you know that you can debug SSL handshakes with openssl? True story.
I can't believe it...my youngest son, after nearly three weeks of being up four or five times each night, slept nearly all the way through without a break: he only woke up at 1am and 5:15am, which is close enough to my usual wakeup time as makes no difference. It was wonderful to have a bit of sleep.
This comes after staying up late (11pm!) on Sunday bottling the latest batch of beer, a Grapefruit Bitter recipe from the local homebrew shop. You know, it really does taste like grapefruit, and even this early I'm really looking forward to this beer.
My laptop has a broken hinge, dammit. I carry it around in my backpack without any padding, so I guess I'm lucky it's lasted this long. Fortunately the monitor still works and mostly stays upright. I've had a look at some directions on how to replace it; it looks fiddly, but spending $20 on a new set of hinges from eBay is a lot more attractive than spending $100. Of course, the other consideration is whether I can get three hours to work on it….But in the meantime, I've got it on the SkyTrain for the first time in a week; it's been hard to want to do anything but sleep lately.
Work is still busy:
I'm trying to get tinyMCE and img_assist to work with Drupal
Contacting vendors to look at backup hardware. So far we're looking at the Dell ML6010 and the Sun SL500. They're both modular, which is nice; we've got (low) tens of TB now but that'll ramp up quickly. The SL500 seems to have some weird things; according to this post, it takes up to 30 minutes to boot (!) and you can't change its IP address without a visit from the service engineer (!!). Those posts are two years old, so perhaps things have changed.
Trying to figure out what we want for backup software, too. I'm used to Bacula (which works well with the ML6010) and Amanda, but I've been working a little bit with Tivoli lately. One of the advantages of Tivoli is the ease of restoring it gives to the users…very nice. I'm reading Backup and Recovery again, trying to get a sense of what we want, and reviewing Preston's presentation at LISA06 called "Seriously, tape-only backup systems are dead". So what do we put in front of this thing? Not sure yet…
Speaking of Tivoli, it's suddenly stopped working for us: it backed up filesystems on our Thumper just fine (though we had to point it at individual ZFS filesystems, rather than telling it to just go), then stopped; it hangs on files over a certain size (somewhere around 500kb or so) and just sits there, trying to renew the connection over and over again. I've been suspecting firewall problems, but I haven't changed anything and I can't see any logged blocked packets. Weird.
Update: turned out to be an MTU problem:
I had no idea there were GigE NICs that did not support Jumbo frames. Though maybe that's just the OpenBSD driver for it. Hm.
Matt asked how Amanda worked for people, and whether they'd recommend anything else. I tried to leave a comment, but Blogger's CAPTCHA (god, I hate that acronym) never seems to work for me. So here goes. (Irony of a man w/an email-based comment system complaining about someone else's left as exercise f/t reader.)
Amanda: Nice, but: At my last job (2.5 years ago now), we started running into problems when backing up a 1TB RAID5 array...simple Promise disk array, nothing special or terribly fast. Amanda would take hours to do an estimate of the backups…which, since Amanda tries to pack tapes as full as it can, it does all the time. This got to be a huge pain, and we didn't find a solution to this problem before I left. (We were using GNU tar for Amanda; not sure if that had anything to do with it, and I can't remember what the alternatives were…maybe dump? Dunno.) Not sure what the current state is.
Bacula: +1 on the nice. Very, very good at my current job; absolutely no problems with it at all. And the documentation is enough to cry for, it's so complete and wonderful and thorough and accurate and well done. Clients for Unix, Windows, and Mac. Total filesystmes here are…uh…less than 1TB, definitely, although it's creeping up there. So the smaller size may have something to do with it.
Work...hell, life is busy these days.
At work, our (only) tape drive failed a couple of weeks ago; Bacula asked for a new tape, I put it in, and suddenly the "Drive Error" LED started blinking and the drive would not eject the tape. No combination of power cycling, paperclips or pleading would help. Fortunately, $UNIVERSITY_VENDOR had an external HP Ultrium 960 tape drive + 24 tapes in a local warehouse. Hurray for expedited shipping from Richmond!
Not only that, the Ultrium 3 drive can still read/write our Ultrium 2 media. By this I mean that a) I'd forgotten that the LTO standard calls for R/W for the last generation, not R/O, and b) the few tests I've been able to do with reading random old backups and reading/writing random new backups seem to go just fine.
Question for the peanut gallery: Has anyone had an Ultrium tape written by one drive that couldn't be read by another? I've read about tapes not being readable by drives other than the one that wrote it, but haven't heard any accounts first-hand for modern stuff.
Another question for the peanut gallery: I ended up finding instructions from HP that showed how to take apart a tape drive and manually eject a stuck tape. I did it for the old Ultrium 2. (No, it wasn't an HP drive, but they're all made in Hungary...so how many companies can be making these things, really?) The question is, do I trust this thing or not? My instinct is "not as far as I can throw it", but the instructions didn't mention anything one way or the other.
In other news, $NEW_ASSIGNMENT is looking to build a machine room in the basement of a building across the way, and I'm (natch) involved in that. Unfortunately, I've never been involved in one before. Fortunately, I got training on this when I went to LISA in 2006, and there's also Limoncelli, Hogan and Chalup to help out. (That link sends the author a few pennies, BTW; if you haven't bought it yet, get your boss to buy it for you.)
As part of the movement of servers from one data centre across town to new, temporary space here (in advance of this new machine room), another chunk of $UNIVERSITY has volunteered to help out with backups by sucking data over the ether with Tivoli. Nice, neighbourly think of them to do!
I met with the two sysadmins today and got a tour of their server room. (Not strictly necessary when arranging for backups, but was I gonna turn down the chance to tour a 1500-node cluster? No, I was not.) And oh, it was nice. Proper cable management...I just about cried. :-) Big racks full of blades, batteries, fibre everywhere, and a big-ass robotic Ultrium 2 tape cabinet. (I was surprised that it was 2, and not U3 or U4, but they pointed out that this had all been bought about four or five years ago…and like I've heard about other government-funded efforts, there's millions for capital and little for maintenance or upgrades.)
They told me about assembling most of it from scratch...partly for the experience, partly because they weren't happy with the way the vendor was doing it ("learning as they went along" was how they described it). I urged them to think about presenting at LISA, and was surprised that they hadn't heard of the conference or considered writing up their efforts.
Similarly, I was arranging for MX service for the new place with the university IT department, and the guy I was speaking to mentioned using Postfix. That surprised me, as I'd been under the impression that they used Sendmail, and I said so. He said that they had, but they switched to Postfix a year ago and were quite happy with it: excellent performance as an MTA (I think he said millions of emails per day, which I think is higher than my entire career total :-) and much better Milter performance than Sendmail. I told him he should make a presentation to the university sysadmin group, and he said he'd never considered it.
Oh, and I've completely passed over the A/C leak in my main job's server room…or the buttload of new servers we're gonna be getting at the new job…or adding the Sieve plugin for Dovecot on a CentOS box...or OpenBSD on a Dell R300 (completely fine; the only thing I've got to figure out is how it'll handle the onboard RAID if a drive fails). I've just been busy busy busy: two work places, still a 90-minute commute by transit, and two kids, one of whom is about to wake up right now.
Not that I'm complaining. Things are going great, and they're only getting better.
Last note: I'm seriously considering moving to Steve Kemp's Chronicle engine. Chris Siebenmann's note about the attraction of file-based systems for techies is quite true, as is his note about it being hard to do well. I haven't done it well, and I don't think I've got the time to make it good. Chronicle looks damn nice, even if it does mean opening up comments via the web again…which might mean actually getting comments every now and then. Anyhow, another project for the pile.
I was able to get Quickbooks 2007 working with a non-admin account today…woot! Here's what I did:
This isn't ideal — the explorer process in QB is still running privileged — but at least that's the only IE process running as admin.
And Bacula: tripped over a small thing. I'm running the btape utility to make sure our tape drive works with it. I ran bfill
, rather than fill
, then wondered why I got errors at the end. Turns out to be an old command that probably shouldn't be around anymore.
Now to run fill…another couple hours to go.
Came across a mention of BSDstats.org on the Dragonfly BSD Digest, and I've set it up on my home machine. There are a ton of FreeBSD machines, and only 64 OpenBSD clients reported…time to change that!
I'm reading the documentation for Bacula right now, and it's amazing. Clearly written, thorough and extensive — almost 800 pages long. I'm very impressed.
At work we use Amanda for backups, and it's pretty good -- but for various reasons we don't use the amanda server/client on every single machine. For these exceptions, we point amanda at the host it's running on, where we have copies of the important stuff kept by rsync. This usually works pretty well, and it also fits in well with our other backup mechanism: the copy of yesterday. This is a copy of home directoiries and some other things, updated with rsync every morning at 3am. This gives people an easy way to get something they had yesterday, which means less trips to the backup tapes.
We also do a couple sets of backups with Amanda: daily, where we let Amanda juggle full vs. incremental in the usual way, and weekly/monthly, where we tell Amanda to just do full backups. For those, we just point Amanda at the copy of yesterday, rather than grab full backups over the network.
I've run into problems over the last few weeks, though, where weekly backups have failed for a few home directories -- the fullest ones, natch. It's taken me a while to figure out what's going on, but I think I've got a handle on it finally.
Full weekly/monthly backups take a while to do -- typically two full days, because of non-automated tape changers. While this happens, I let regular backups pile up on the holding disk (close to half a terabyte available), then flush them when the weeklies are done. Here's the error that amstatus shows:
wait for dumping driver: (aborted:nak error: amandad busy)
Thanks to this post (Nabble? Never heard of 'em...) I finally clued in to the obvious: Amanda sometimes asks the local host for backups twice -- once as part of a daily backup, and once as part of a weekly backup. If this is right (why haven't I come across this more often?) it's going to cause pain. We don't have a tape changer, so backups just plain take a long time; there's no one here at 3am to switch a tape. I'm uncomfortable with the idea of turning off regular backups for two days a week. I really don't want to have to come in on weekends to switch tapes. Hm.
Maybe I'll look at just letting weeklies dump to disk over the weekend, then flush 'em during the week. That might work pretty well, actually.
In other news, got a bunch of HP workstations in from CDW, and I'm quite happy with them. At last count, the company has 879 people starting in January (no, not really) and the idea of setting up that many Shuttles (my usual workstation of choice), manually installing XP (no, no automated install yet) just filled me w/dread. The HPs are nice, very well put together (they're built like fucking tanks and weigh just as much), and they come with XP Pro installed. But hey, manager wanted 160GB drives and these came with 80GB drives. What to do?
Turns out you can take out the old, put in the new (bigger) drive, and just use the restore disk. Boo hiss restore disks with no full copy of the OS, but damn it's nice: very few questions, and when you're done you're ready to go. And by "ready to go" I mean of course "ready to turn off all the crap, turn on other crap and install even more crap". I've either got to swallow my pride and get an AD controller in here (Noooooooo!) or else figure out some other way of automating all this.
...when the passing of a fire truck in the middle of the night means you obsess for half an hour about getting backup tapes out of your apartment.
Add this to amanda.conf
to get better formatted reports:
columnspec "Disk=1:18,HostName=0:10,OutKB=1:10,DumpRate=1:10,TapeRate=1:10"
I got the iBook, I got the Slashdot t-shirt, I got the beard...but do you think I can get a wireless signal? Oh no. Thanks, Broadcom. But hey, enough complaining. Time for an update.
The wireless ISP is gonna do a point-to-point link between windows of our old and new temporary offices. Should give us 100Mb/s access or so. Which is good, because for a while I thought I'd have to walk down to London Drugs, grab some Linksys routers, and install my own firmware to do it. Which would have been a lot of fun...but would have been a fuck of a lot to get ready in, like, three days. Now I just have to get OpenVPN talking at either end, get Shaw installed, and set up a firewall. Oboy.
And then there's the troubles I've been having with our backup server. A while back I decided to start racking all the boxes we've been using as servers -- transfer the hard drives to proper servers, then use the old shell as a desktop for a new hire. Welp, the backup server was the first to go, and man it's been a headache.
First off, I didn't take care of cooling properly, and the tape drive (HP Ultrium 215, for those paying attention) suffered a nice little nervous breakdown and kept spitting out the tape. I tried downloading the HP diagnostic tool, but it only runs on Linux and the server runs FreeBSD -- neither Linux compatibility mode (not surprising) nor a Knoppix disk (kept hanging) allowed it to work. So I had no real idea what was going on other than the drive was too hot for my liking.
But HP, bless their souls, came to the rescue. Once I made it through their speech recognition voicemail tree hell, they just sent out another one -- they didn't even bitch about not being able to run the diagnostic tool. Not only that, it came the next day, and we don't even have any special contract with them -- that's just warranty. Thumbs up for them.
But now I've got different problems: the damn machine keeps seizing up
on me. See, I've got this 500GB concatenated Vinum array of three
disks that I use as a copy of yesterday's home directory for people,
and I'm trying to move it to a four-disk RAID5 drive on the Promise
array. I tried using rsync, and it just froze...but eventually. I
thought maybe rsync was spending too much CPU time figuring out what
to transfer, so today I tried using dump | restore
-- and sure
enough, it froze again.
I plugged in a monitor, hoping for a panic or something, but nope -- just unresponsive. I've found some mention in the FreeBSD mailing lists about possible problems with write caching and the Adaptec 3960D SCSI controller (which I thought was a 39160 SCSI controller, but I guess not). I'll have to see if that does the trick or not -- but in the meantime I'm wondering how I'm gonna get yesterday on the Promise. Of course, figuring out why it's crashing in the first place would be even better...
But it's not all bad news: earlier this week, the support manager at Promise that I've been dealing with called to tell me that the word had come down from on high. Yep...Promise is going to follow the GPL and properly release the Linux and Busybox source code for the firmware that goes into the VTrak 15100. Hurray! I'll have to watch, of course, and make sure it shows up...but it sounds good. "Let's put it this way," said the manager. "It's on my desk for me to do. And I don't want it there for long." To the home front, now.
As if I didn't have enough on the go, I've blown my tax return on the makings of a MythTV backend: 2.4GHz P4, umpty-GB hard drive, the PCHDTV-300 (get it while you can!), generic 128MB Nvidia (no onboard video on this mobo, or I would've stuck with that), a Hauppauge PVR-500MCE, and a nice Asus mobo in an Antec case to tie it all together. Random notes:
And now for something completely different: new mottoes for Harley Davidson:
"Harley-Davidson: Because social contracts are for weak pussy-ass losers with small dicks."
"Harley-Davidson: Because those other people aren't really human. Not like you and me."
"Harley-Davidson: You deserve it. So do they."
"Harley-Davidson: Because if you pissed in their faces, you'd be arrested."
"Harley-Davidson: Because 'Fuck you!' is just too damned hard to remember."
"Harley-Davidson: Because 'Fuck you!' is just too damned eloquent."