The Life of a Sysadmin

Carousel is a lie!

Entries from October 2009.

Eject, *then* reboot
Sat Oct 3 13:56:22 PDT 2009

Ran into a little problem this week when I tried to do a restore from a backup at work. Bacula loaded the tape, then said it couldn't read the label. Wha?

After much investigation, during which I completely neglected to cut-n-paste the error messages, I think I've figured out what happened:

Ack. Needless to say, this was not good. Fortunately, the file in question was not a terribly important one; unfortunately, that's about the last 2 weeks of incrementals gone. Lesson learned: don't assume your backup program knows what's going on when hardware reboots from under it.

In other news: on Thursday I got 5 new Dell servers. Woot! One of 'em will be our new LDAP/web/email/FTP server (Xen ftw!); the rest are going to be running protein search engines for various researchers across BC. They're racked and I'm stoked, except that it turns out the difference between the DRAC6 Express and Enterprise, besides a few hundred dollars, is that the Enterprise does console redirection and the Express doesn't. Dammit.

I'm going to see if there's any trickery that can be done, but I'm not holding out hope. I have got a 32-port console server, but it's two racks away...might have to run a small batch o' cables up and over to make this work.

2 comments. Tags: backups, dell, hardware, oops, virtualization.
Where'd that bridge go?
Mon Oct 5 05:41:04 PDT 2009

Yesterday I got paged by one of my two Nagios boxes (learned that trick the hard way): a bunch of the machines in our server room had dropped off the network. Weirdly, this did not include the other Nagios box that's over there. WTF?

I logged into the server room's Nagios box, and sure enough couldn't ping the servers or the firewall. I could ping the console server...which was also on the Outside VLAN along with the monitoring box, as opposed to the Inside VLAN with the servers, which sat behind our firewall.

I was also able to ping the management cards/ILOMs/SPs/whatever the kids are calling them in the servers. Thankfully they're Sun boxes, so no Vista-like maze of flavours there...they all come with console redirection. I logged in and fired up a console, panicing because I thought that perhaps the newly-installed NUT clients had shut down the machines because I'd overlooked something.

But no...the machines were up, though hung if you tried to do any LDAP lookups. (Through an oversight, the LDAP server was also on the Outside VLAN. I'll be fixing that today.) Modulo that, they seemed fine.

So I logged into the firewall, which runs OpenBSD 4.3 in bridging mode. And this is where the weirdness lay: the bridge, and/or its component cards, was not working. ifconfig and brconfig said they were up and fine, and the ARP table was still populated (not sure what the lifetime of entries is -- isn't it around 20 minutes or so? must check -- but by this time the problem had been going on for about an hour). Yet I couldn't ping the firewall (one of those cards has an address) from either side, and I couldn't ping anything from the firewall.

pfctl -s all didn't show anything suspicious. There were no obvious problems in dmesg or /var/log/messages. I disabled, then re-enabled, the firewall to no effect. I ran /etc/netstart to no effect.

I even checked on the switches to see if the firewall's MAC address was showing up anywhere, and it was not -- not even directly after pinging it (and getting no response).

In the end I rebooted the machine and all was well.

The NIC in question is a dual-port Intel Pro 1000 (MT, I believe) that I've never had problems with. I've never come across problems like this before on OpenBSD (or, I think, anywhere else). The onboard Broadcom (boo, hiss) was acting fine...it was also on the ILOM's VLAN, and could see the other ILOMs just fine. (In fact, I should have just SSHd to the firewall using that VLAN from the Nagios box, rather than futz around with a 9600 bps console. Next time.)

So...that's my mystery for the weekend.

In other news, my older son (3.25 yrs) has taken to the stage in a big way: he now stands on top of the steps going up from our living room and sings us songs into one of at least two microphones. "Barbara Ann", anything by The Wiggles, and "Yo Gabba Gabba!" songs are prominent. This is after at least three solid weeks of guitar playing, where anything and everything gets strummed while being cradled in his arms while he sings, or maybe makes feedback sounds that'd make Yo La Tengo proud.

Meanwhile, my younger (1.5 yrs) has started saying lots of different phonemes, which is a real contrast to using "Dat!" for monkey, cereal, ball, yes, no, President Barack Obama's attempted health care reforms, and Linux. He has also begun sleeping in 'til 6:30 or 7:00 in the morning, which lets me write things like this. Both are infinitely endearing.

And incidentally, I really need to set up Nagios dependencies. I've had to ACK 27 services in a row (unrelated (I think) problem with ILOM temperature taking means SNMP checks are timing out). Either that or there's some way that you can select n services in Nagios to ack all at once. Anyone?

4 comments. Tags: networking, openbsd.
Instructions for yak shaving
Mon Oct 5 12:45:51 PDT 2009
  1. Install logwatch on Solaris fileserver.

  2. Notice that logwatch emails are not coming in.

  3. Log in and run logwatch by hand.

  4. Inspect mail log and notice lack of any entries.

  5. Notice that Postfix is in maintenance mode; start it up.

  6. Notice continued lack of emails.

  7. Notice that Postfix is running, which confused svcadm when told to start up Postfix. It fails to do so and fails to log this.

  8. killall postfix, svcadm enable postfix.

  9. man svcadm; svcadm clear postfix; svcadm enable postfix.

  10. Run logwatch by hand; notice emailed report to "root@localhost.localdomain", which gets bounced by Postfix on the mail server because it's a non-existent host.

  11. Resist temptation to go down that rabbit hole just now, and stick to the problem at hand.

  12. Edit /opt/csw/etc/log.d/logwatch.conf and set MailTo to proper address.

  13. Re-run logwatch and note that reports are still going to root@localhost.

  14. After much swearing, notice that actually, logwatch is set to look in /opt/csw/etc/log.d/conf/logwatch.conf for configuration.

  15. Edit that file, re-run logwatch.

  16. Notice errors from Postfix: "postdrop[13848]: [ID 947731 mail.warning] warning: mail_queue_enter: create file maildrop/908447.13848: Permission denied".

  17. Run "postfix set-permissions". Test mail; still failing.

  18. Check permissions on another system and set by hand.

  19. Re-run logwatch. Still no email. Re-run with debug=high and get email.

  20. Wonder idly about futility of self-aware log watching system that can't report on its own heisenbug-induced failure, crappy packaging practices, inability to check end-to-end email connectivity, other career options.

  21. (Update) Realize that the emails show up if "Detail" is set to Medium or High ; Low, the default, makes the report silent.

  22. (Update) Uninstall the package and reinstall, only to find that the symlink to conf/logwatch.conf is set up at installation, and that this is probably a case of $EDITOR breaking the symlink. Apply head to desk.

2 comments. Tags: monitoring, packagemanagement, solaris, yakshaving.
Wrong, wrong, wrong
Fri Oct 9 16:18:06 PDT 2009

I'm not sure exactly where I saw that DRAC6 Express does not do console redirection -- it was on a mailing list somewhere -- but that turns out to be just wrong:

(For the record, it was the "External Serial Connector" in BIOS that got me; it should be "serial device 1", not "Remote Access Device".)

I can now SSH to the DRAC and get a console just fine. I wish to apologize to Dell, the people of Monaco and the constellation Sagitarrius.

3 comments. Tags: correction, dell, hardware.
Problems installing NUT on Solaris 10
Thu Oct 15 09:44:16 PDT 2009

I ran into a couple problems compiling NUT on Solaris 10 today. They were pretty much due to bad setup on my part, but they did take a while to track down. For the record:

Tags: solaris.
Grabbing Confluence markup from an XML export
Thu Oct 15 12:27:22 PDT 2009

As part of a slow migration from Confluence to FosWiki, I had to grab the Confluence markup from an XML dump. I found a Python script to do this, but I think the XML format must have changed in the meantime; the script was unable to grab the body content.

I've lashed together a version available here that works with Confluence 2.10 and Python 2.5 . Now to convert the pages to Foswiki markup...

Ack! Just discovered this updated script in the comments section. Looks like that one grabs a lot more than this does (labels, attachments). Oh well, I needed the practice with Python.

Tags: python.
Foswiki-to-PDF Makefile
Fri Oct 23 12:49:45 PDT 2009

At $WORK I've just switched to using Foswiki (formerly TWiki) for documentation. I miss editing files directly from Emacs like you can with Confluence, but I'll get over it. The main reason I like Foswiki is that, at heart, the source files are plain text, and are available as plain text -- no need to trawl through a database.

Another nice feature that Confluence has is the ability to export a space (Foswiki calls it a web) directly as PDF. A bit of scripting takes care of that, but since this is the second time I've lashed together a Makefile to generate a PDF from Foswiki, I figure it's time to post it.

You can find the Makefile here. It uses lynx, wget and htmldoc; of those, I suppose only htmldoc is hard to replace. There's one important assumption built into the Makefile, though: that every page is linked to from the front page of the web, which is how I organize my pages.

Share and enjoy!

Update: Just showed the boss the printed version, and he was very impressed. Yay me! :-)

2 comments. Tags: documentation, scripting.
Where'd that bridge go? Redux
Wed Oct 28 10:57:13 PDT 2009

So this morning, again, I got paged about machines in our server room dropping off the network. And again, it was the bridge that was the problem. This time, though, I think I've figured out what the problem is.

The firewall has two interfaces, em0 (on the outside) and em1 (on the inside) , which are bridged. em1 has an IP address. I was able to SSH to the machine from the outside and poke around a bit. I still didn't find anything in the logs, but I did notice this (edited for brevity):

$ ifconfig
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 9000
    lladdr 00:15:17:ab:cd:ef
    media: Ethernet autoselect (1000baseT full-duplex)
    status: active
    inet6 fe80::215:17ff:feab:cdef%em0 prefixlen 64 scopeid 0x1
em1: flags=8d43<UP,BROADCAST,RUNNING,PROMISC,OACTIVE,SIMPLEX,MULTICAST> mtu 9000
    lladdr 00:15:17:ab:cd:ee:
    groups: egress
    media: Ethernet autoselect (1000baseT full-duplex)
    status: active
    inet 10.0.0.1 netmask 0xffffff80 broadcast 10.0.0.1
    inet6 fe80::215:17ff:feab:cdee%em1 prefixlen 64 scopeid 0x2

See that? em1 has OACTIVE set. A quick search turned up some interesting hits, so for fun I tried resetting the interface:

$ sudo ifconfig em1 down
$ sudo ifconfig em1 up

and huzzah! it worked.

When I got to work I did some more digging and figured out that this and the earlier outage were almost certainly caused by running a full backup, via Bacula, of the /home partition on the machine. The timing was just about exact. The weird thing, though, is that the partition itself is smaller than var, which was backed up successfully both times:

$ df -hl
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/sd0a      509M   42.4M    442M     9%    /
/dev/sd0g      106G   11.4G   89.1G    11%    /home
/dev/sd0d      3.9G    6.0K    3.7G     0%    /tmp
/dev/sd0f     15.7G    2.4G   12.5G    16%    /usr
/dev/sd0e     15.7G   13.6G    1.4G    91%    /var

The bacula file daemon logged this on the firewall:

Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe
Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Error: bsock.c:306 Write error sending 36841 bytes to Storage daemon:backup.example.com:9103: ERR=Broken pipe

With the earlier outage it was 65536 bytes, but otherwise the same error.

Okay, so the firewall's working again...now what? I'm about to head off to LISA in three days, so I can't very well upgrade to the latest OpenBSD right now. I settled for:

Hopefully that'll keep things going 'til I get back.

4 comments. Tags: lisa, networking, openbsd.
There it was, gone
Fri Oct 30 12:41:27 PDT 2009

Following in Matt's footsteps, I ran into a serious problem just before heading to LISA.

Wednesday afternoon, I'm showing my (sort of) backup how to connect to the console server. Since we're already on the firewall, I get him to SSH to it from there, I show him how to connect to a serial port, and we move on.

About an hour later, I get paged about problems with the database server: SSH and SNMP aren't responding. I try to log in, and sure enough it hangs. I connect to its console and log in as root; it works instantly. Uhoh, I smell LDAP problems...only there's nothing in the logs, and id <uid> works fine. I flip to another terminal and try SSHing to another machine, and that doesn't work either. But already-existing sessions work fine until I try to run sudo or do ls -l. So yeah, that's LDAP.

I try connecting via openssl to the LDAP server (stick alias telnets='openssl s_client -connect' in your .bashrc today!) and get this:

CONNECTED(00000003)

...and that's all. Wha? I tried connecting to it from the other LDAP server and got the usual (certificate, certificate chain, cipher, driver's license, note from mom, etc). Now that's just weird.

After a long and fruitless hour trying to figure out if the LDAP server had suddenly decided that SSL was for suckers and chumps, I finally thought to run tcpdump on the client, the LDAP server and the firewall (which sits between the two). And there it was, plain as day:

Near as I can figure, this was the sequence of events:

This took me two hours to figure out, and another 90 minutes to fix; setting the link speed manually on the firewall just convinced the nic/driver/kernel that there was no carrier there. In the end the combination that worked was telling the switch it was a gigabit port, but letting it negotiate duplexiciousnessity.

Gah. Just gah.

4 comments. Tags: jumboframes, lisa, networking, openbsd, warstory.
Conference Organization BoF at LISA
Fri Oct 30 13:11:52 PDT 2009

Hey, everyone -- I'm organizing a BoF at LISA this year on conference organization. For a couple of years, I've wanted to create a local conference on system administration here in Vancouver, but I've been unsure how to start. I figure what better place to brainstorm and seek advice than at LISA?

So if you have questions or knowledge to share on:

then drop on by the Dover C room on Thursday, November 5th, between 8:30 and 9:30pm. C'mon, you've gotta kill that hour before Matt's BoFs somehow...

1 comments. Tags: conferenceorganization, lisa.
Is Chicago, Is Not Chicago
Sat Oct 31 12:29:35 PDT 2009

Thanks to this conference's theme band, Soul Coughing!

Saskatoon is in the room
Pyongyang is in the room...
Is Chicago
Is not Chicago

"Is Chicago, Is Not Chicago" -- Soul Coughing

Midway through my flight to Baltimore and I'm in Chicago, listening to periodic announcements that the Threat Advisory Level is Orange. The wifi here isn't working for me (associates fine but no address by DHCP), so I'm sititng at my gate, with two hours 'til I leave, wondering if any of the people around me are going to LISA as well.

The airport here has this amazing tunnel that goes between two concourses. Again, it made me think I was in Logan's Run and it was only the thought of being arrested that kept me from running down the moving sidewalk, shouting "Carousel is a LIE!"

Chicaco Airport Logan's Run Tunnel

Departure was entirely uneventful; I didn't even get pulled over for extra questions. One odd thing was that (like O'Hare) the customs section of YVR was quite warm, and each of the customs officers had identical clip-on fans placed above them. The cords curled down out of site, and the reflection in the cubicle glass reminded me of spines; I kept thinking they were skeleton decorations for Hallowe'en.

Tags: lisa.

RSS Feed