Fridays...don't talk to me about Fridays

Last Friday was not a good day. First, I try installing a couple hard drives I bought for a Dell MD3200 disk array, and it rejected them; turns out that it will not work with drives that are not pre-approved. It's right there in the documentation. I was aware of last year's kerfuffle with Dell announcing that their servers would no longer work w/unapproved drives, and then their backdown on that...but the disk arrays are different. So now I have a couple extra 3 TB SATA drives to find a place for, and a couple drives to buy from Dell.

While I'm inin the server room staring at the blinking error light on the disk array and wondering if I'd just brought down the server it was attached to, I notice another blinking error light. This one was on the server that hosts Xen VMs that run LDAP, monitoring and a web server. It had a failing drive. Good thing it's in RAID 6, right? Sure, but it failed nearly a month ago -- I had not set up email alerts on the server's ILOM, so I never got notified about this. Fuck.

I send off an email to order a drive, then figure out how to get alerted about this. Email alerts are configured, but belt and suspenders: I get the CLI tool for the RAID card, find a Nagios plugin that runs it, and add the check to Nagios, running on the server's dom0. Hurrah, it alerts me! I ack the alert, and now it's time to head home.

On my way home I start getting pages about the VMs on this machine -- nothing down, but lots of timeouts. The machine recovers, then stumbles and stays down. (These alerts were coming from a second instance of Nagios I have set up, which is mostly there to monitor the main instance that runs on this server.) My commute is 90 minutes, and I have no net access along the way. When I finally get home, I SSH to work and find that the machine is hung; as far as I can tell, the CLI tool was just not exiting, and after enough accumulated the RAID card just stopped responding entirely. I reboot the machine, and ten minutes later we're back up.

Ten minutes after that, I realize I'm still in trouble: I'm getting pages about a few other machines that are not responding. Remember how one of the VMs on the original server ran LDAP? It's one of three LDAP servers I have, because I fucking hate it when LDAP goes down. The clients are configured to fail over if their preferred server (the VM) isn't responding. I check on one of the machines, and nscd had about a thousand open sockets...which makes me think that the sequence was something like this:

I'm thinking about putting in a check for the number of open FDs nscd has, but I'm starting to second-guess myself; it feels a bit circular somehow. Not the right word, but I'm tired and can't think of a better.

Gah.