Sun Jan 20 20:07:35 PST 2008
Tuesday, January 15: Notify users that there will be a brief
interruption in our Internet access due to $UNIVERSITY network dep't
cutover of our connection from old Bay switches to new Cisco switches.
The cutover will be on Friday at 6:30am; the network dep't has said an
hour, but it's expected to only be about 20 minutes.
Friday, January 18, 8:30am: Get into work to find that our Internet
connection is down. I didn't get notified because the Nagios box
can't send email to my cel phone if it can't get access to the
Internet. Call network help desk and ask if there were problems; they
say no, and everyone else is working just fine. I go to our server
room and start trying to figure out what's wrong; can't find a thing.
Call help desk back, who say they're going to escalate it.
10am: Get call back from the team that did the cutover. They
tell me everything looks fine at their end; as we're the Nth
connection to be cut over, it's not like they haven't had practice
with it. I debug things with them some more, and we still can't find
anything wrong: their settings are correct, mine haven't changed and
yet I can't ping our gateway. (The firewall is an OpenBSD box with
two interfaces, set up as a transparent bridging firewall.) As the
firewall box is an older desktop that had been pressed into service
long ago, I decide it'd be worth taking the new, currently spare (YOU
NEVER HEARD ME SAY THAT) desktop machine and trying that.
Noon: Realize I have no spare ethernet cards (wha'?). Find two Intel
Pro 100s at the second store I go to. Install OpenBSD 4.2 (yay for
ordering the CD!), copy over config files, and put it into place. No
luck. Still can't ping gateway. While working on the firewall, I
notice something weird: I've accidentally set up a bridge with only
one interface, while my laptop sits behind pinging the gateway
(fruitlessly) ten times a second. (I got desperate.) When I add the
second interface, the connection works — but only for 0.3 seconds.
The behaviour is repeatable.
3pm: Right after that, the network people show up to see how things
are going. I tell them the results (nothing except for 0.3 seconds)
and they're mystified. We decide to back out the change from the
morning and debug it next week. Things work again instantly. As the
new firewall works, I leave it in place.
7.02pm: The connection goes down again. I don't get notified.
Saturday January 19, Noon: I get a call from the boss, who tells me
that a meeting at the offices isn't going well because they have no
Internet access. Call and verify that, yep, that's the case, and I
can't ping there from home. Drive into work.
1.30pm: Arrive and start debugging. Again, nothing wrong that I can
see but I can't ping our gateway or see its MAC address. Call help
desk who say they have no record of problems. They'll put in a
trouble ticket, but would like me to double-check before they escalate
it. That's fine — I didn't wait long before calling them — so I do.
2pm: I get a call from the head of the network team that did the
cutover; he'd seen the ticket and is calling to see what's going on.
He and I debug further for 90 minutes. We try hooking up my laptop to
the port the firewall is usually connected to, but that doesn't work;
he can see my laptop's MAC address, but I can't see his.
4pm: He calls The Big Kahuna, who calls me and starts debugging
further while his osso bucco cooks. We still can't get anywhere. I
try putting my laptop on another port in another room, hoping that net
access will work from there and maybe I can just string a cable
across. It doesn't.
6pm: We call it a night; he and the other guy are going to come in
tomorrow to track it down. I call nine bosses and one sysadmin to
keep them filled in.
6.30pm: Drive home.
Sunday, January 20, 10.30am: We all show up and start working. We
still can't find anything wrong. The boss calls to ask me to set up a
meeting with the network department for tomorrow; I tell him I will
after we finish fixing the problem.
11.30am: The network team lead gets desperate enough to suggest
rebooting the switch stack. It works. We all slap our heads in
disgust. Turns out that a broadcast storm on Friday evening triggered
a logical failure in the switch we were connected to, resulting in the
firewall's port alone being turned off.
Noon: The boss shows up to see how things are going. He talks with
the network lead while I'm on the phone with The Big Kahuna; we've
decided to try moving to the Cisco switches and make that work while
everyone's here.
12.30pm: The Big Kahuna tells me that the problem is the Spanning Tree
Protocol packets coming from my firewall box; the Cisco switch doesn't
like that and shuts down the switch. I go through man pages until I
find the blocknonip option for brconfig. 30 seconds later,
everything is working. Apparently, I'm the only one they've come
across who's running a transparent bridging firewall, so this is the
first time they've seen this problem.
1pm: Debrief the boss. Notify other bosses, sysadmins and users that
everything is back up again, then do some last-minute maintenance.
2pm: Drive home.
One thing: the usual configuration for other departments (that don't
run their own firewall) is to have two Cisco switches running HSRP;
they act as redundant gateways/firewalls that fail over automagically.
The Big Kahuna mentions in passing that this doesn't work with OpenBSD
bridging firewalls. (Our configuration had been simplified to one
switch only on Friday as part of debugging the first problem; I
mention this in case this is helpful to someone. I don't understand
why this might be the case, so I'm going to ask him about this
tomorrow.)
(permalink)
(comments)
Comment from Jason Antman +
Date: Wed, 23 Jan 2008 19:14:29 -0500
Interesting Story....
As to the Nagios alerts, I've had the same problem with my personal
systems. Have you looked into either a hardware SMS device (with GSM,
you can use a cheap Nokia cell and a serial interface, or Bluetooth with
many newer ones) or text-to-speech through a minimal Asterisk setup?
On the other hand, sounds like a wonderful work environment. I work for
$OTHERuniversity. Our helpdesk is open pretty much all day and evening,
but the NOC is only staffed 8-6 M-F, so network issues (even
connectivity to whole dorm buildings housing 600+ residents) can go for
up to 18 hours without any work being done.
-Jason