Yesterday I got paged by one of my two Nagios boxes (learned that
trick the hard way): a bunch of the machines in our server room
had dropped off the network. Weirdly, this did not include the
other Nagios box that's over there. WTF?
I logged into the server room's Nagios box, and sure enough couldn't
ping the servers or the firewall. I could ping the console
server...which was also on the Outside VLAN along with the monitoring
box, as opposed to the Inside VLAN with the servers, which sat behind
our firewall.
I was also able to ping the management cards/ILOMs/SPs/whatever the
kids are calling them in the servers. Thankfully they're Sun boxes,
so no Vista-like maze of flavours there...they all come with
console redirection. I logged in and fired up a console, panicing
because I thought that perhaps the newly-installed NUT clients
had shut down the machines because I'd overlooked something.
But no...the machines were up, though hung if you tried to do any LDAP
lookups. (Through an oversight, the LDAP server was also on the
Outside VLAN. I'll be fixing that today.) Modulo that, they seemed
fine.
So I logged into the firewall, which runs OpenBSD 4.3 in bridging
mode. And this is where the weirdness lay: the bridge, and/or its
component cards, was not working. ifconfig and brconfig said
they were up and fine, and the ARP table was still populated (not sure
what the lifetime of entries is -- isn't it around 20 minutes or so?
must check -- but by this time the problem had been going on for about
an hour). Yet I couldn't ping the firewall (one of those cards has an
address) from either side, and I couldn't ping anything from the
firewall.
pfctl -s all didn't show anything suspicious. There were no obvious
problems in dmesg or /var/log/messages. I disabled, then
re-enabled, the firewall to no effect. I ran /etc/netstart to no
effect.
I even checked on the switches to see if the firewall's MAC address
was showing up anywhere, and it was not -- not even directly after
pinging it (and getting no response).
In the end I rebooted the machine and all was well.
The NIC in question is a dual-port Intel Pro 1000 (MT, I believe) that
I've never had problems with. I've never come across problems like
this before on OpenBSD (or, I think, anywhere else). The onboard
Broadcom (boo, hiss) was acting fine...it was also on the ILOM's
VLAN, and could see the other ILOMs just fine. (In fact, I should
have just SSHd to the firewall using that VLAN from the Nagios box,
rather than futz around with a 9600 bps console. Next time.)
So...that's my mystery for the weekend.
In other news, my older son (3.25 yrs) has taken to the stage in a
big way: he now stands on top of the steps going up from our living
room and sings us songs into one of at least two microphones.
"Barbara Ann", anything by The Wiggles, and "Yo Gabba Gabba!" songs
are prominent. This is after at least three solid weeks of guitar
playing, where anything and everything gets strummed while being
cradled in his arms while he sings, or maybe makes feedback sounds
that'd make Yo La Tengo proud.
Meanwhile, my younger (1.5 yrs) has started saying lots of different
phonemes, which is a real contrast to using "Dat!" for monkey, cereal,
ball, yes, no, President Barack Obama's attempted health care reforms,
and Linux. He has also begun sleeping in 'til 6:30 or 7:00 in the
morning, which lets me write things like this. Both are infinitely
endearing.
And incidentally, I really need to set up Nagios dependencies. I've
had to ACK 27 services in a row (unrelated (I think) problem with ILOM
temperature taking means SNMP checks are timing out). Either that or
there's some way that you can select n services in Nagios to ack all
at once. Anyone?