Where'd that bridge go?
05 Oct 2009Yesterday I got paged by one of my two Nagios boxes (learned that trick the hard way): a bunch of the machines in our server room had dropped off the network. Weirdly, this did not include the other Nagios box that's over there. WTF?
I logged into the server room's Nagios box, and sure enough couldn't ping the servers or the firewall. I could ping the console server...which was also on the Outside VLAN along with the monitoring box, as opposed to the Inside VLAN with the servers, which sat behind our firewall.
I was also able to ping the management cards/ILOMs/SPs/whatever the kids are calling them in the servers. Thankfully they're Sun boxes, so no Vista-like maze of flavours there...they all come with console redirection. I logged in and fired up a console, panicing because I thought that perhaps the newly-installed NUT clients had shut down the machines because I'd overlooked something.
But no...the machines were up, though hung if you tried to do any LDAP lookups. (Through an oversight, the LDAP server was also on the Outside VLAN. I'll be fixing that today.) Modulo that, they seemed fine.
So I logged into the firewall, which runs OpenBSD 4.3 in bridging
mode. And this is where the weirdness lay: the bridge, and/or its
component cards, was not working. ifconfig
and brconfig
said
they were up and fine, and the ARP table was still populated (not sure
what the lifetime of entries is -- isn't it around 20 minutes or so?
must check -- but by this time the problem had been going on for about
an hour). Yet I couldn't ping the firewall (one of those cards has an
address) from either side, and I couldn't ping anything from the
firewall.
pfctl -s all
didn't show anything suspicious. There were no obvious
problems in dmesg
or /var/log/messages
. I disabled, then
re-enabled, the firewall to no effect. I ran /etc/netstart
to no
effect.
I even checked on the switches to see if the firewall's MAC address was showing up anywhere, and it was not -- not even directly after pinging it (and getting no response).
In the end I rebooted the machine and all was well.
The NIC in question is a dual-port Intel Pro 1000 (MT, I believe) that I've never had problems with. I've never come across problems like this before on OpenBSD (or, I think, anywhere else). The onboard Broadcom (boo, hiss) was acting fine...it was also on the ILOM's VLAN, and could see the other ILOMs just fine. (In fact, I should have just SSHd to the firewall using that VLAN from the Nagios box, rather than futz around with a 9600 bps console. Next time.)
So...that's my mystery for the weekend.
In other news, my older son (3.25 yrs) has taken to the stage in a big way: he now stands on top of the steps going up from our living room and sings us songs into one of at least two microphones. "Barbara Ann", anything by The Wiggles, and "Yo Gabba Gabba!" songs are prominent. This is after at least three solid weeks of guitar playing, where anything and everything gets strummed while being cradled in his arms while he sings, or maybe makes feedback sounds that'd make Yo La Tengo proud.
Meanwhile, my younger (1.5 yrs) has started saying lots of different phonemes, which is a real contrast to using "Dat!" for monkey, cereal, ball, yes, no, President Barack Obama's attempted health care reforms, and Linux. He has also begun sleeping in 'til 6:30 or 7:00 in the morning, which lets me write things like this. Both are infinitely endearing.
And incidentally, I really need to set up Nagios dependencies. I've had to ACK 27 services in a row (unrelated (I think) problem with ILOM temperature taking means SNMP checks are timing out). Either that or there's some way that you can select n services in Nagios to ack all at once. Anyone?
Add a comment:
Name and email required; email is not displayed.
Related Posts
QRP weekend 08 Oct 2018
Open Source Cubesat Workshop 2018 03 Oct 2018
mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018