Is Chicago, Is Not Chicago

31 Oct 2009

Thanks to this conference's theme band, Soul Coughing!

Saskatoon is in the room
Pyongyang is in the room...
Is Chicago
Is not Chicago

"Is Chicago, Is Not Chicago" -- Soul Coughing

Midway through my flight to Baltimore and I'm in Chicago, listening to periodic announcements that the Threat Advisory Level is Orange. The wifi here isn't working for me (associates fine but no address by DHCP), so I'm sititng at my gate, with two hours 'til I leave, wondering if any of the people around me are going to LISA as well.

The airport here has this amazing tunnel that goes between two concourses. Again, it made me think I was in Logan's Run and it was only the thought of being arrested that kept me from running down the moving sidewalk, shouting "Carousel is a LIE!"

Chicaco Airport Logan's Run Tunnel

Departure was entirely uneventful; I didn't even get pulled over for extra questions. One odd thing was that (like O'Hare) the customs section of YVR was quite warm, and each of the customs officers had identical clip-on fans placed above them. The cords curled down out of site, and the reflection in the cubicle glass reminded me of spines; I kept thinking they were skeleton decorations for Hallowe'en.

Conference Organization BoF at LISA

30 Oct 2009

Hey, everyone -- I'm organizing a BoF at LISA this year on conference organization. For a couple of years, I've wanted to create a local conference on system administration here in Vancouver, but I've been unsure how to start. I figure what better place to brainstorm and seek advice than at LISA?

So if you have questions or knowledge to share on:

Scheduling talks and getting speakers
Technical and organizational requirements
Finding volunteers and sponsors
Figuring out a budget (or "Just how far does this shoestring have to stretch?")

then drop on by the Dover C room on Thursday, November 5th, between 8:30 and 9:30pm. C'mon, you've gotta kill that hour before Matt's BoFs somehow...

There it was, gone

30 Oct 2009

Following in Matt's footsteps, I ran into a serious problem just before heading to LISA.

Wednesday afternoon, I'm showing my (sort of) backup how to connect to the console server. Since we're already on the firewall, I get him to SSH to it from there, I show him how to connect to a serial port, and we move on.

About an hour later, I get paged about problems with the database server: SSH and SNMP aren't responding. I try to log in, and sure enough it hangs. I connect to its console and log in as root; it works instantly. Uhoh, I smell LDAP problems...only there's nothing in the logs, and id <uid> works fine. I flip to another terminal and try SSHing to another machine, and that doesn't work either. But already-existing sessions work fine until I try to run sudo or do ls -l. So yeah, that's LDAP.

I try connecting via openssl to the LDAP server (stick alias telnets='openssl s_client -connect' in your .bashrc today!) and get this:

CONNECTED(00000003)

...and that's all. Wha? I tried connecting to it from the other LDAP server and got the usual (certificate, certificate chain, cipher, driver's license, note from mom, etc). Now that's just weird.

After a long and fruitless hour trying to figure out if the LDAP server had suddenly decided that SSL was for suckers and chumps, I finally thought to run tcpdump on the client, the LDAP server and the firewall (which sits between the two). And there it was, plain as day:

3-way handshake
client says "I speak SSL!"
server says "I speak SSL too! Here you go!"
but the client never sees that packet
and neither does the firewall.

Near as I can figure, this was the sequence of events:

We SSH'd from the firewall, with its two bridged Intel GigE jumbo-enabled NICs
to the console server, which only does 10/100
which somehow prompted a renegotiation of the link speed on the firewall's interface
which settled on 100 MBit, full duplex, but with jumbo frames
which the switch saw as completely bogus
which prompted the switch to (silently, natch) drop all jumbo frames directed at the firewall's outside interface
which, in the context of an LDAP lookup done by a client inside the firewall, meant that the first packet that failed was the "I speak SSL too! Here you go!" packet
which left the client with an established TCP connection to the LDAP server, waiting for a certificate
which meant that it never actually failed over to the other LDAP server.

This took me two hours to figure out, and another 90 minutes to fix; setting the link speed manually on the firewall just convinced the nic/driver/kernel that there was no carrier there. In the end the combination that worked was telling the switch it was a gigabit port, but letting it negotiate duplexiciousnessity.

Gah. Just gah.

Where'd that bridge go? Redux

28 Oct 2009

So this morning, again, I got paged about machines in our server room dropping off the network. And again, it was the bridge that was the problem. This time, though, I think I've figured out what the problem is.

The firewall has two interfaces, em0 (on the outside) and em1 (on the inside) , which are bridged. em1 has an IP address. I was able to SSH to the machine from the outside and poke around a bit. I still didn't find anything in the logs, but I did notice this (edited for brevity):

$ ifconfig
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 9000

    lladdr 00:15:17:ab:cd:ef
    media: Ethernet autoselect (1000baseT full-duplex)
    status: active
    inet6 fe80::215:17ff:feab:cdef%em0 prefixlen 64 scopeid 0x1

em1: flags=8d43<UP,BROADCAST,RUNNING,PROMISC,OACTIVE,SIMPLEX,MULTICAST> mtu 9000

    lladdr 00:15:17:ab:cd:ee:
    groups: egress
    media: Ethernet autoselect (1000baseT full-duplex)
    status: active
    inet 10.0.0.1 netmask 0xffffff80 broadcast 10.0.0.1
    inet6 fe80::215:17ff:feab:cdee%em1 prefixlen 64 scopeid 0x2

See that? em1 has OACTIVE set. A quick search turned up some interesting hits, so for fun I tried resetting the interface:

$ sudo ifconfig em1 down
$ sudo ifconfig em1 up

and huzzah! it worked.

When I got to work I did some more digging and figured out that this and the earlier outage were almost certainly caused by running a full backup, via Bacula, of the /home partition on the machine. The timing was just about exact. The weird thing, though, is that the partition itself is smaller than var, which was backed up successfully both times:

$ df -hl
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/sd0a      509M   42.4M    442M     9%    /
/dev/sd0g      106G   11.4G   89.1G    11%    /home
/dev/sd0d      3.9G    6.0K    3.7G     0%    /tmp
/dev/sd0f     15.7G    2.4G   12.5G    16%    /usr
/dev/sd0e     15.7G   13.6G    1.4G    91%    /var

The bacula file daemon logged this on the firewall:

Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe
Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Error: bsock.c:306 Write error sending 36841 bytes to Storage daemon:backup.example.com:9103: ERR=Broken pipe

With the earlier outage it was 65536 bytes, but otherwise the same error.

Okay, so the firewall's working again...now what? I'm about to head off to LISA in three days, so I can't very well upgrade to the latest OpenBSD right now. I settled for:

turning off full backups on the firewall (everything important is kept in Subversion anyhow), and
running a script from cron every 10 minutes that checks for the OACTIVE flag and, if found, resets the interface.

Hopefully that'll keep things going 'til I get back.

Foswiki-to-PDF Makefile

23 Oct 2009

At $WORK I've just switched to using Foswiki (formerly TWiki) for documentation. I miss editing files directly from Emacs like you can with Confluence, but I'll get over it. The main reason I like Foswiki is that, at heart, the source files are plain text, and are available as plain text -- no need to trawl through a database.

Another nice feature that Confluence has is the ability to export a space (Foswiki calls it a web) directly as PDF. A bit of scripting takes care of that, but since this is the second time I've lashed together a Makefile to generate a PDF from Foswiki, I figure it's time to post it.

You can find the Makefile here. It uses lynx, wget and htmldoc; of those, I suppose only htmldoc is hard to replace. There's one important assumption built into the Makefile, though: that every page is linked to from the front page of the web, which is how I organize my pages.

Share and enjoy!

Update: Just showed the boss the printed version, and he was very impressed. Yay me! :-)

Grabbing Confluence markup from an XML export

15 Oct 2009

As part of a slow migration from Confluence to FosWiki, I had to grab the Confluence markup from an XML dump. I found a Python script to do this, but I think the XML format must have changed in the meantime; the script was unable to grab the body content.

I've lashed together a version available here that works with Confluence 2.10 and Python 2.5 . Now to convert the pages to Foswiki markup...

Ack! Just discovered this updated script in the comments section. Looks like that one grabs a lot more than this does (labels, attachments). Oh well, I needed the practice with Python.

Problems installing NUT on Solaris 10

15 Oct 2009

I ran into a couple problems compiling NUT on Solaris 10 today. They were pretty much due to bad setup on my part, but they did take a while to track down. For the record:

libtool: link: only absolute run-paths are allowed: This turned out to be an obscure way of saying "You don't have libsnmp installed". Solution: configure --without-snmp.
false cru: The full error was:

    libtool: link: false cru .libs/libparseconf.a .libs/parseconf.o
    gmake[1]: *** [libparseconf.la] Error 1

This turned out to be a consequence of not having /usr/ccs/bin in my $PATH.

Wrong, wrong, wrong

09 Oct 2009

I'm not sure exactly where I saw that DRAC6 Express does not do console redirection -- it was on a mailing list somewhere -- but that turns out to be just wrong:

the command has been renamed
and if you RTFM you'll find the right settings for the BIOS and grub.

(For the record, it was the "External Serial Connector" in BIOS that got me; it should be "serial device 1", not "Remote Access Device".)

I can now SSH to the DRAC and get a console just fine. I wish to apologize to Dell, the people of Monaco and the constellation Sagitarrius.

Instructions for yak shaving

05 Oct 2009

Install logwatch on Solaris fileserver.
Notice that logwatch emails are not coming in.
Log in and run logwatch by hand.
Inspect mail log and notice lack of any entries.
Notice that Postfix is in maintenance mode; start it up.
Notice continued lack of emails.
Notice that Postfix is running, which confused svcadm when told to start up Postfix. It fails to do so and fails to log this.
killall postfix, svcadm enable postfix.
man svcadm; svcadm clear postfix; svcadm enable postfix.
Run logwatch by hand; notice emailed report to "root@localhost.localdomain", which gets bounced by Postfix on the mail server because it's a non-existent host.
Resist temptation to go down that rabbit hole just now, and stick to the problem at hand.
Edit /opt/csw/etc/log.d/logwatch.conf and set MailTo to proper address.
Re-run logwatch and note that reports are still going to root@localhost.
After much swearing, notice that actually, logwatch is set to look in /opt/csw/etc/log.d/conf/logwatch.conf for configuration.
Edit that file, re-run logwatch.
Notice errors from Postfix: "postdrop[13848]: [ID 947731 mail.warning] warning: mailqueueenter: create file maildrop/908447.13848: Permission denied".
Run "postfix set-permissions". Test mail; still failing.
Check permissions on another system and set by hand.
Re-run logwatch. Still no email. Re-run with debug=high and get email.
Wonder idly about futility of self-aware log watching system that can't report on its own heisenbug-induced failure, crappy packaging practices, inability to check end-to-end email connectivity, other career options.
(Update) Realize that the emails show up if "Detail" is set to Medium or High ; Low, the default, makes the report silent.
(Update) Uninstall the package and reinstall, only to find that the symlink to conf/logwatch.conf is set up at installation, and that this is probably a case of $EDITOR breaking the symlink. Apply head to desk.

Where'd that bridge go?

05 Oct 2009

Yesterday I got paged by one of my two Nagios boxes (learned that trick the hard way): a bunch of the machines in our server room had dropped off the network. Weirdly, this did not include the other Nagios box that's over there. WTF?

I logged into the server room's Nagios box, and sure enough couldn't ping the servers or the firewall. I could ping the console server...which was also on the Outside VLAN along with the monitoring box, as opposed to the Inside VLAN with the servers, which sat behind our firewall.

I was also able to ping the management cards/ILOMs/SPs/whatever the kids are calling them in the servers. Thankfully they're Sun boxes, so no Vista-like maze of flavours there...they all come with console redirection. I logged in and fired up a console, panicing because I thought that perhaps the newly-installed NUT clients had shut down the machines because I'd overlooked something.

But no...the machines were up, though hung if you tried to do any LDAP lookups. (Through an oversight, the LDAP server was also on the Outside VLAN. I'll be fixing that today.) Modulo that, they seemed fine.

So I logged into the firewall, which runs OpenBSD 4.3 in bridging mode. And this is where the weirdness lay: the bridge, and/or its component cards, was not working. ifconfig and brconfig said they were up and fine, and the ARP table was still populated (not sure what the lifetime of entries is -- isn't it around 20 minutes or so? must check -- but by this time the problem had been going on for about an hour). Yet I couldn't ping the firewall (one of those cards has an address) from either side, and I couldn't ping anything from the firewall.

pfctl -s all didn't show anything suspicious. There were no obvious problems in dmesg or /var/log/messages. I disabled, then re-enabled, the firewall to no effect. I ran /etc/netstart to no effect.

I even checked on the switches to see if the firewall's MAC address was showing up anywhere, and it was not -- not even directly after pinging it (and getting no response).

In the end I rebooted the machine and all was well.

The NIC in question is a dual-port Intel Pro 1000 (MT, I believe) that I've never had problems with. I've never come across problems like this before on OpenBSD (or, I think, anywhere else). The onboard Broadcom (boo, hiss) was acting fine...it was also on the ILOM's VLAN, and could see the other ILOMs just fine. (In fact, I should have just SSHd to the firewall using that VLAN from the Nagios box, rather than futz around with a 9600 bps console. Next time.)

So...that's my mystery for the weekend.

In other news, my older son (3.25 yrs) has taken to the stage in a big way: he now stands on top of the steps going up from our living room and sings us songs into one of at least two microphones. "Barbara Ann", anything by The Wiggles, and "Yo Gabba Gabba!" songs are prominent. This is after at least three solid weeks of guitar playing, where anything and everything gets strummed while being cradled in his arms while he sings, or maybe makes feedback sounds that'd make Yo La Tengo proud.

Meanwhile, my younger (1.5 yrs) has started saying lots of different phonemes, which is a real contrast to using "Dat!" for monkey, cereal, ball, yes, no, President Barack Obama's attempted health care reforms, and Linux. He has also begun sleeping in 'til 6:30 or 7:00 in the morning, which lets me write things like this. Both are infinitely endearing.

And incidentally, I really need to set up Nagios dependencies. I've had to ACK 27 services in a row (unrelated (I think) problem with ILOM temperature taking means SNMP checks are timing out). Either that or there's some way that you can select n services in Nagios to ack all at once. Anyone?

Eject, then reboot

03 Oct 2009

Ran into a little problem this week when I tried to do a restore from a backup at work. Bacula loaded the tape, then said it couldn't read the label. Wha?

After much investigation, during which I completely neglected to cut-n-paste the error messages, I think I've figured out what happened:

I upgraded the license key for our storage library;
I rebooted the library, 'cos that's what you gotta do;
but the tape was still in there, say halfway through after the last batch of backups;
so the drive rewound the tape after being power-cycled;
and Bacula didn't know this;
so it wrote the next backups that night at the beginning of the tape, not realizing this would be a Bad Thing(tm).

Ack. Needless to say, this was not good. Fortunately, the file in question was not a terribly important one; unfortunately, that's about the last 2 weeks of incrementals gone. Lesson learned: don't assume your backup program knows what's going on when hardware reboots from under it.

In other news: on Thursday I got 5 new Dell servers. Woot! One of 'em will be our new LDAP/web/email/FTP server (Xen ftw!); the rest are going to be running protein search engines for various researchers across BC. They're racked and I'm stoked, except that it turns out the difference between the DRAC6 Express and Enterprise, besides a few hundred dollars, is that the Enterprise does console redirection and the Express doesn't. Dammit.

I'm going to see if there's any trickery that can be done, but I'm not holding out hope. I have got a 32-port console server, but it's two racks away...might have to run a small batch o' cables up and over to make this work.

LISA updates

30 Sep 2009

I've come across a few LISA items today, and it's only 9am...

Matt Simmons is going, and got one of the blogger gigs too.
The BOFs are starting to fill up: Matt's got one for bloggers and another for small infrastructure, there's one for lightning talks, and one for uninvited talks.
OpenDNS is hosting a happy hour at a nice-looking pub, which alleges it was actually "designed and built in Ireland and shipped over in the fall of 2002, where it was then fitted on site." Huh.

Man, I'm looking forward to this.

Server cracked, restored

28 Sep 2009

"I say we take off and nuke the entire site from orbit. It's the only way to be sure."

Saturday afternoon my home web server got cracked. I found out because Google started refusing my searches, asking me to fill out a CAPTCHA form (incidentally, I hate the word CAPTCHA, and even typing it gives me hives) to prove I was human. What the hell?

So I checked on the server, which is also our firewall, which isn't good but frankly I was tired of maintaining a complex network at home, and sure enough there was some perl script running as user www-data (which Debian uses to run the webserver), sending off tons of Google queries and taking commands on IRC the way I keep hearing nobody does anymore. Crap.

Fortunately I've been running Bacula for a while now, backing up to an external hard drive, and so I figured that even though it probably would go away when I rebooted, I'd Do The Right Thing(tm) and rebuild from scratch.

This had to wait 'til the evening, so I shut down the webserver, ran backups a bunch more times, got more info, and moved the machine (a tiny li'l Shuttle box) from my youngest son's bedroom (apparently the only room in the house w/a phone outlet not covered by an ADSL filter) to our bedroom upstairs, running the network cable up the stairs.

In the end, it all went pretty smoothly. I was able to get all my packages back and restore from backup; the only thing I messed up was getting the ownership wrong on my restored crontab. (Debian uses a pool of UIDs for daemons, so you're not guaranteed to get the same UIDs if you reinstall.)

As a bandaid, I've firewalled off www-data from initiating connections out. I should have done this long before. Now I'm starting to think about the next step -- Xen, maybe, or SELinux. (I did briefly consider other distros, or even a BSD: CentOS for SELinux, FreeBSD for pf and jails. But I decided that one problem at a time was quite enough, thanks.)

What to ask when taking over external servers?

21 Sep 2009

At $WORK, I'm going to be taking over the administration of four servers that currently do stuff for a variety of researchers scattered around the province. There are a number of players here:

My department, which contains:
- Me, the guy whose services are being promised, and
- The researcher who's arranging all this (my local contact)
The agency that owns them, who I don't think has any techical staff
The agency that currently hosts and administers the servers

The owning agency has also ponied up for an upgrade to the four servers; I'll be taking delivery some time next week.

I've got some preliminary information -- what the servers do, how the users use the thing, etc -- but I'm preparing a more detailed plan. In the meantime, I've compiled a list of questions for my local contact.

In the middle of that, it occurred to me that this would be a good discussion topic. Have I missed anything? Let me know!

Will the old servers be moved over, or will the new ones replace them?
What's the primary means of talking to users? (Mailing list, status page)
Where's the list of those users? (one of the above, spreadsheet)
What info do users/owners expect from us? How? (Mailing list, status page; 2 weeks notice of downtime, monthly stats by CPU) - Are any funding decisions influenced by this information?
Where is the info for the software? - media - license #, what we have licenses for (unlimited use, # cores, etc) - support #, what it covers
Can I see a demo of the software?
Do any of the labs have shell access? What do they do with it?
What exactly is involved in maintenance? Where is this documented?
What DNS changes will be made? Who makes them?
Who makes policy/purchase decisions about these servers? How do I contact them?

Can_i_send_email_spam_from_your_servers

21 Sep 2009

title: Can I send email spam from your servers? date: Mon Sep 21 08:40:54 PDT 2009 Mon Sep 21 08:40:54 PDT 2009

Depressing.

I'm going to LISA '09!

16 Sep 2009

Just got the approval from the boss...LISA, here I come! w00t!

My submission to Canada's Consultation on Copyright

11 Sep 2009

In the spirit of Michael Geist, here's my submission on copyright reform. Originally I intended to write about how this affects me as a sysadmin, but then the stuff about my kids just came out...

Bad Time Equals LDAP Failure

09 Sep 2009

Just ran into an interesting problem: after replacing memory on a server, CentOS booting hung at "Starting system message bus..."

So what does dbus have to do with anything? This turned out to be an LDAP failure; dbus was trying to run as UID root, and since the LDAP server couldn't be contacted it hung. Why couldn't the LDAP server be contacted? The LDAP server logs only showed this:

[09/Sep/2009:12:04:32 -0700] conn=41492 op=-1 fd=112 closed - SSL
peer cannot verify your certificate.

The CA cert I use was in place, and another machine had just rebooted w/o problems (all this is taken care of with cfengine, so they were identical in this respect). I could connect to the LDAP server on the right port without any problems.

I finally figured out what was going on when I ran:

openssl s_client -connect ldap.example.com:636 -CApath /path/to/cacert_directory

and saw:

Verify return code: 9 (certificate is not yet valid)

date said it was December 31, 2001. What the what now? ntpdate to set things correctly, then I got:

Verify return code: 0 (ok)

I figure the CMOS clock (or whatever the kids are calling it these days) got reset when we had to remove the CPU daughtercard to get at the memory underneath.

And now you know...the rest of the story.

OpenBSD needs help

09 Sep 2009

I just saw on Undeadly.org that orders for OpenBSD CDs are 'way down this year. Without OpenSSH and pf, I wouldn't be able to do my job nearly as well as I do. I've ordered a set for work (good excuse to upgrade that firewall), and ordered a set for home and tossed 'em $50 as well. I encourage you to do the same.

In the words of the original rant:

Do you use OpenBSD for fun? Contribute. Do you use OpenBSD for work? Contribute. Does OpenBSD allow you to worry about the problem you are trying to solve rather rather than the tools? Contribute. Do you wish your employer used the OpenBSD quality standard in your work? Contribute. Does your employer use OpenBSD? Ask them to contribute (after you do, of course). Do you bundle OpenBSD or subprojects like OpenSSH into your product? Contribute big! (you won't, you rarely do, but hey, I'll ask anyway) Do you find yourself wondering why so few take computer software quality seriously? Contribute!

Start of school

08 Sep 2009

It's the start of school here at $UNIVERSITY, and for some reason I find myself noticing it more than last year. Then and now, my job has been one that is not flooded in September with new students (unlike a lot of my friends and coworkers), but rather it's more like a steady trickle. Grad students show up days or even weeks late; new faculty come in when they're good and ready; no one really has a firm idea when someone's showing up, but everyone's confident they'll be here Real Soon Now.

As a result, the biggest effect this usually has on me is the press of humanity in the bus and SkyTrain. My commute is a long one -- bus, SkyTrain, then another bus -- and it takes between 90 and 100 minutes, door to door. I get a lot of reading done, or I listen to podcasts, or if the Lithium Ion Gods are with me I fiddle with Emacs. This happens no matter what, but in September you've got all the people learning how the bus works, how far in advance they need to show up, and so on. The buses and SkyTrains are crowded because everyone's afraid it's the last one, or they'll be late for class, or everyone else is getting on so they must know something I don't.

And then it calms down. Some get tired of the bus and drive. Most figure out how late they can sleep in. (I vaguely remember that, in the same way that I vaguely remember kindergarten.) Things thin out. Before you know it it's December and it gets really empty. Winter brings humidity and rain, wet smells and drips on book pages.

And then it's spring, and then summer, and things get positively luxurious. There's room to stretch out, room for laptops, and lots to see. The kids' birthdays come around.

And then...September again.