Where'd that bridge go?

Yesterday I got paged by one of my two Nagios boxes (learned that trick the hard way): a bunch of the machines in our server room had dropped off the network. Weirdly, this did not include the other Nagios box that's over there. WTF?

I logged into the server room's Nagios box, and sure enough couldn't ping the servers or the firewall. I could ping the console server...which was also on the Outside VLAN along with the monitoring box, as opposed to the Inside VLAN with the servers, which sat behind our firewall.

I was also able to ping the management cards/ILOMs/SPs/whatever the kids are calling them in the servers. Thankfully they're Sun boxes, so no Vista-like maze of flavours there...they all come with console redirection. I logged in and fired up a console, panicing because I thought that perhaps the newly-installed NUT clients had shut down the machines because I'd overlooked something.

But no...the machines were up, though hung if you tried to do any LDAP lookups. (Through an oversight, the LDAP server was also on the Outside VLAN. I'll be fixing that today.) Modulo that, they seemed fine.

So I logged into the firewall, which runs OpenBSD 4.3 in bridging mode. And this is where the weirdness lay: the bridge, and/or its component cards, was not working. ifconfig and brconfig said they were up and fine, and the ARP table was still populated (not sure what the lifetime of entries is -- isn't it around 20 minutes or so? must check -- but by this time the problem had been going on for about an hour). Yet I couldn't ping the firewall (one of those cards has an address) from either side, and I couldn't ping anything from the firewall.

pfctl -s all didn't show anything suspicious. There were no obvious problems in dmesg or /var/log/messages. I disabled, then re-enabled, the firewall to no effect. I ran /etc/netstart to no effect.

I even checked on the switches to see if the firewall's MAC address was showing up anywhere, and it was not -- not even directly after pinging it (and getting no response).

In the end I rebooted the machine and all was well.

The NIC in question is a dual-port Intel Pro 1000 (MT, I believe) that I've never had problems with. I've never come across problems like this before on OpenBSD (or, I think, anywhere else). The onboard Broadcom (boo, hiss) was acting fine...it was also on the ILOM's VLAN, and could see the other ILOMs just fine. (In fact, I should have just SSHd to the firewall using that VLAN from the Nagios box, rather than futz around with a 9600 bps console. Next time.)

So...that's my mystery for the weekend.

In other news, my older son (3.25 yrs) has taken to the stage in a big way: he now stands on top of the steps going up from our living room and sings us songs into one of at least two microphones. "Barbara Ann", anything by The Wiggles, and "Yo Gabba Gabba!" songs are prominent. This is after at least three solid weeks of guitar playing, where anything and everything gets strummed while being cradled in his arms while he sings, or maybe makes feedback sounds that'd make Yo La Tengo proud.

Meanwhile, my younger (1.5 yrs) has started saying lots of different phonemes, which is a real contrast to using "Dat!" for monkey, cereal, ball, yes, no, President Barack Obama's attempted health care reforms, and Linux. He has also begun sleeping in 'til 6:30 or 7:00 in the morning, which lets me write things like this. Both are infinitely endearing.

And incidentally, I really need to set up Nagios dependencies. I've had to ACK 27 services in a row (unrelated (I think) problem with ILOM temperature taking means SNMP checks are timing out). Either that or there's some way that you can select n services in Nagios to ack all at once. Anyone?