The Life of a Sysadmin

Carousel is a lie!

Entries from July 2009.

Bacula, gossip, advice
Thu Jul 2 16:31:35 PDT 2009
This sounds like when I was at my previous employer and they asked if
I could develop a web-based system to take surveys.  I nearly said,
"yes" because, well, I know perl, I know CGI, and I could do it.
However, I was smart enough to say "no, but surveymonkey.com will do
it for cheap."  Best of all it was self-service and the HR person was
able to do it entirely without me.  If I had said I could write such a
program, it would have been days of back-and-forth changes which would
have driven me crazy.  Instead, she was happy to be empowered to do it
herself.  In fact, doing it herself without any help became a feather
in her cap.

The lesson I learned is that "can I do it?" includes "do I want to do
it?".  If I can do something but don't want to, the answer is, "No, I
don't know how" not "I know how but don't want to".  The first makes
you look like you know your limits.  The latter sounds like you are
just being difficult.
Tags: backups, cfengine, reading.
GPT and MBR
Fri Jul 3 12:17:25 PDT 2009

I've run into an interesting problem with the new backup machine.

It's a Sun X4240 with 10 x 15k disks in it: 2 x 73GB (mirrored for the OS) and 8 x, um, a bunch (250GB?), RAID0 for Bacula spooling. (I want fast disk access, so RAID0 it is.) RAID is taken care of by an onboard RAID card, so these look like regular disks to Linux.

Now the spool disk works out to about 2.2TB or so — which is big enough to make baby fdisk cry:

WARNING: The size of this disk is 2.4 TB (2391994793984 bytes).
DOS partition table format can not be used on drives for volumes
larger than 2.2 TB (2199023255040 bytes). Use parted(1) and GUID
partition table format (GPT).

Well, okay, haven't used parted before but that's no reason to hold back. I follow directions and eventually figure out that mkpart gpt ext3 0 2392G will do what I want. GPT? Piece of cake! And then I rebooted, and I couldn't boot up again. Blank screen after the POST. Crap!

The first time this happened, the reboot also coincided with some additional problems during the POST where too many cards were trying to shove their ROM into the BIOS memory (or some such); I thought the two were connected. But then I did it again today, and I finally started digging.

The problem is that parted overwrites the MBR when setting up a GPT disklabel. This has been noted and argued over. My understanding of the two sides of the debate is:

Meanwhile, the parted camp has a number of bugs dealing with this very issue, two opened a year ago, and none have any response in them.

This enterprising soul submitted a patch back in December 2008, which appears to have fallen to the floor.

As for me, I was able to convince the BIOS to boot from the smaller disk, and then get a rescue CentOS image going via PXE booting, and then reinstall grub on the smaller disk. Sorted. All I had to do was change root (hd1,0) to `root (hd0,0) in grub.conf.

A touch anti-climactic after all that, perhaps. But it was interesting a) to learn about all this (I hadn't really thought about successors to the DOS partition format before), and b) to see what a slender thread we (okay, I) hang our hopes on sometimes. It's a necessary, sobering thing to realize how much of what I use, depend on, believe in is created by volunteers who are smart, hard-working people — they argue and and focus and forget just like real people, not inhabitants of some shining city on a hill I sometimes take them for ("Next beer in Jerusalem!").

Tags: backups, hardware, linux.
Zombie bacula-sd and open port
Mon Jul 6 10:23:04 PDT 2009

Weird...Just ran into a problem with restarting bacula-sd. For some reason, the previous instance had died badly and left a zombie process. I restarted bacula-sd but was left with an open port:

# sudo netstat -tupan | grep 9103
tcp        0      0 0.0.0.0:9103                0.0.0.0:* LISTEN      -

which meant that bconsole hung every time it tried to get the status of bacula-sd. Unsure what to do, I tried telnetting to it for fun and then quit; after that the port was freed up and grabbed by the already-running storage daemon:

tcp        0      0 0.0.0.0:9103                0.0.0.0:* LISTEN      16254/bacula-sd

and bconsole was able to see it just fine:

Connecting to Storage daemon tape at bacula.example.com:9103

example-sd Version: 3.0.1 (30 April 2009) x86_64-example-linux-gnu example
Daemon started 06-Jul-09 10:18, 0 Jobs run since started.
 Heap: heap=180,224 smbytes=25,009 max_bytes=122,270 bufs=94 max_bufs=96
Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8
Tags: backups, networking.
Cacti debugging
Thu Jul 16 16:15:54 PDT 2009

The saga of the UPS continues. Yesterday I got the SNMP card set up and working. I also found this Cacti template, which promised lots of pretty graphs. But there were a few bumps along the way.

First, Cacti was convinced that the UPS was down. Actually, it took me a while to figure this out because the logs didn't say anything abou this host at all. Eventually I tracked it down to Cacti using SNMP queries to see if it was up; turns out that this machine doesn't like being queried at the OID 0.1, and just doesn't respond. Changing the upness-detecting algorithm (heh) to TCP ping did the trick nicely.

Next, the graphs for the UPS were still not being produced, even though the RRDs were now being updated. I got the debug info for a graph and ran the rrdtool command by hand. The RRD does not contain an RRA matching the chosen CF was the response.

This thread showed a lot of people having the same problem. Since some of these problems were fixed by an upgrade, I did so; there were a few CentOS updates waiting for that machine anyhow. That made it worse: no graphs were being shown now. rrdtool said that there were no fonts present, so maybe fontconfig was out of order. Installing dejavu-lgc-fonts did the trick nicely, and I got my graphs back.

Well, all except the UPS ones I was after in the first place. I was still getting the error about not containing the chosen CF. Well, when all else fails keep reading the forum, right?

The rrdtool command used the LAST function; this was the culprit. If I ran s/LAST/AVERAGE/g on the command, it worked a treat. Thus, one option would have been to edit the template. However, I decided on an alternate approach, suggested in the forum: I removed the UPS RRDs, went to Data Sources -> RRAs in the Console menu, selected each RRA in turn and added LAST to the consolidation function.

Finally! Whee! Except for one: the graph of voltage vs input frequency. I still don't know what this means to me, but I wasn't about to give up now.

Again, rrdtool provided the error: "For a logarithmic yaxis you must specify a lower limit > 0". Bug reports to the rescue: Console -> Graph Templates -> Voltage/Freq, and set Lower Limit to 0.1.

All that and I'm still the only one looking at these graphs. Man, I should frame them.

Tags: monitoring, web.
Mailman: NameError: global name 'DumperSwitchboard' is not defined
Mon Jul 20 14:15:12 PDT 2009

I came across a problem today trying to recover a subscribers list from an old version of mailman, using a new-ish (2.1.9) version. I dug up the config.db file for the list, then ran dumpdb on it:

$ /usr/lib/mailman/bin/dumpdb -n config.db
Traceback (most recent call last):
  File "/usr/lib/mailman/bin/dumpdb", line 159, in ?
    msg = main()
  File "/usr/lib/mailman/bin/dumpdb", line 126, in main
    d = DumperSwitchboard().read(filename)
NameError: global name 'DumperSwitchboard' is not defined

After a bit of digging, I found this mailing list post that gave the solution:

--- bin/dumpdb  2007-06-18 08:35:57.000000000 -0700
+++ bin/dumpdb  2007-08-02 17:45:42.187500000 -0700
@@ -49,6 +49,7 @@
 import sys
 import getopt
 import pprint
+import marshal
 from cPickle import load
 from types import StringType

@@ -121,9 +122,7 @@
     # Handle dbs
     pp = pprint.PrettyPrinter(indent=4)
     if filetype == 1:
-        # BAW: this probably doesn't work if there are mixed types of .db
-        # files (i.e. some marshals, some bdbs).
-        d = DumperSwitchboard().read(filename)
+        d = marshal.load(open(filename))
         if doprint:
             pp.pprint(d)
         return d

I copied dumpdb to my home directory, patched it, then ran it like so:

PYTHONPATH=/usr/lib/mailman/bin/ ./dumpdb config.db

Bingo!

No tags
Two x "aha!" re: Bacula
Tue Jul 21 14:49:50 PDT 2009

First, it occurred to me today that the problems I've been having with bacula-sd dying or becoming unresponsive may be because of the way Nagios has monitored it. I've been using the check_tcp plugin, and when I looked on the backup machine there were, at one point, 21 connections to the sd port. Half were from the monitoring machine and were in the CLOSE_WAIT state. The max concurrent jobs for -sd is set to 20. I've turned off Nagios monitoring for now; we'll see how that does.

Second -- edit: sorry, stupid error. I withdraw the point.

Tags: backups.
In honour of the day...
Fri Jul 31 11:34:47 PDT 2009

I am a) going for beer and b) actually blogging. Yay me!

Tags: meta.

RSS Feed