I am too old for this

15 Apr 2011

About 10.45 yesterday I noticed that my SSH connections to the servers in our server room had stopped, and I was unable to make any more. I checked Nagios the machine by my desk (multiple Nagios FTW!) and found that it had noticed problems a few minutes before. I ran over to see what was going on.

After a few minutes of checking, I'd found:

our firewall machine seemed to be up just fine
but I couldn't ping anything: no DHCP lease, and not with a manually configured interface

I called IT Services and asked if problems; they said no, so I double checked again. Suddenly I could ping the firewall and other machines, but SSHing to them hung.

My guess at this point was LDAP problems. I connected monitor and keyboard to the machine hosting the LDAP server and found it responsive, but a CRAPTON of errors on eth1 (which Xen handily renames/clones as peth1):

peth0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
UP BROADCAST RUNNING NOARP MTU:9000 Metric:1 RX packets:1895496748 errors:1500778269 dropped:1505784776 overruns:0 frame:1500778269
TX packets:340023247 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
RX bytes:186743473052 (173.9 GiB) TX bytes:384744601794 (358.3 GiB)
Interrupt:18 Memory:ec000000-ec012800

I didn't know what to make of this, so I replaced the cable to peth1, ran ifconfig down/up, and got the connection back -- at which point LDAP came back up, the machines started working, etc.

Okay, weird -- but at least it's working again. I went back to my desk to try and figure out what had happened. While I was doing that, I started losing connectivity to the machines in the server room for 30 seconds at a time. What the hell?

After that, frankly, it's a blur. I was there 'til 7.45pm and here's what I think was going on.

First, the Xen host was having big memory problems that affected its networking, and the networking of the VMs within it. I was seeing a crapton of these messages:

Apr 14 15:12:51 kernel: xen_net: Memory squeeze in netback driver.

This bug said it was fixed in CentOS 5.6 -- so I tried upgrading to that (I was at 5.5, so not a big jump). Nope. Then I saw a suggestion that the problem was in memory ballooning -- that the dom0 was sucking up all the memory for some reason. The solution was to add a "dom0_mem=" to the kernel argument in Grub, ideally matching the dom0-min-mem argument in /etc/xen/xend-config.sxp. Unfortunately, I didn't realize that without specifying units, Xen assumes bytes -- so I was specifying a max mem of 512 bytes, not megabytes.

This caused the machine to panic and reboot -- but because consoles were only available via serial port, and because the IPMI console wasn't working, I was unable to see it. I had to edit the Grub entries on the fly to remove those arguments from the kernel, see what was going on, and then set it correctly.

After rebooting with a working memory limit, top showed that ksoftirqd/0 was taking up an enormous amount of CPU time -- 98% of one CPU. This was pretty much all due to eth0 interrupts. tcpdump showed that there was a lot of traffic on the management subnet, which the machine shouldn't have been seeing. I checked the switch and saw that the management vlan WAS on there as tagged (the normal, inside VLAN was default and untagged). I turned that off within the switch, rebooted the machine and things pretty much went back to normal.

All of that was doubly unfortunate because of the four VMs on there, two are the only two LDAP servers in that room -- the third is no another network, but it took a long time for the clients to fail over to it. This is piss-poor planning on my part.

As if that wasn't enough, another server's disk array disappeared, which caused MySQL to die and the website running on it to disappear. Turned out the machine had been booted with the wrong multipath drivers. When it had problems on one connection, the drive came back with a different device (/dev/sde1 instead of /dev/sdd1). This took a while to figure out, but I finally got it rebooted and the drive array back.

Now things were mostly back to normal -- except that the connection to the management VLAN seemed to be coming and going. This was shown both by nagios ("foo-ilom up! foo-ilom down!") and by good old-fashioned pings. A given ILOM/SP would respond to pings for 30 seconds, then go down; five minutes later it'd come back for 30 seconds, then disappear again. For the nth time: what the hell?

Then I remembered that, back when all this had begun, we'd been configuring a new cluster. Working on its switches, in fact, which were from a different vendor than our usual (package deal, dontcha know). I began to suspect that the problem might somehow lie there. I removed the two patch cables connecting the new switches to our network...and at last the management VLAN connection came back up and stayed up.

In all I was in the server room 'til 7.45pm last night. Part of it was spent reinstalling CentOS on a separate machine in hopes of at least getting an LDAP server up on it. I didn't stick around for that, as the VMs came back up fine, but that's definitely on the agenda.

Name (required)
Email (required)
Website
The site is named after Saint ________ the Carpeted: (required)
Comment

Carousel is a LIE!

I am too old for this

Add a comment:

Related Posts

QRP weekend 08 Oct 2018

Open Source Cubesat Workshop 2018 03 Oct 2018

mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018