Tuesday, January 15: Notify users that there will be a brief
interruption in our Internet access due to $UNIVERSITY network dep't
cutover of our connection from old Bay switches to new Cisco
switches. The cutover will be on Friday at 6:30am; the network dep't
has said an hour, but it's expected to only be about 20 minutes.
Friday, January 18, 8:30am: Get into work to find that our Internet
connection is down. I didn't get notified because the Nagios box can't
send email to my cel phone if it can't get access to the
Internet. Call network help desk and ask if there were problems; they
say no, and everyone else is working just fine. I go to our server
room and start trying to figure out what's wrong; can't find a
thing. Call help desk back, who say they're going to escalate it.
10am: Get call back from the team that did the cutover. They tell me
everything looks fine at their end; as we're the Nth connection to be
cut over, it's not like they haven't had practice with it. I debug
things with them some more, and we still can't find anything wrong:
their settings are correct, mine haven't changed and yet I can't ping
our gateway. (The firewall is an OpenBSD box with two interfaces, set
up as a transparent bridging firewall.) As the firewall box is an
older desktop that had been pressed into service long ago, I decide
it'd be worth taking the new, currently spare (YOU NEVER HEARD ME SAY
THAT) desktop machine and trying that.
Noon: Realize I have no spare ethernet cards (wha'?). Find two Intel
Pro 100s at the second store I go to. Install OpenBSD 4.2 (yay for
ordering the CD!), copy over config files, and put it into place. No
luck. Still can't ping gateway. While working on the firewall, I
notice something weird: I've accidentally set up a bridge with only
one interface, while my laptop sits behind pinging the gateway
(fruitlessly) ten times a second. (I got desperate.) When I add the
second interface, the connection works — but only for 0.3 seconds. The
behaviour is repeatable.
3pm: Right after that, the network people show up to see how things
are going. I tell them the results (nothing except for 0.3 seconds)
and they're mystified. We decide to back out the change from the
morning and debug it next week. Things work again instantly. As the
new firewall works, I leave it in place.
7.02pm: The connection goes down again. I don't get notified.
Saturday January 19, Noon: I get a call from the boss, who tells me
that a meeting at the offices isn't going well because they have no
Internet access. Call and verify that, yep, that's the case, and I
can't ping there from home. Drive into work.
1.30pm: Arrive and start debugging. Again, nothing wrong that I can
see but I can't ping our gateway or see its MAC address. Call help
desk who say they have no record of problems. They'll put in a trouble
ticket, but would like me to double-check before they escalate
it. That's fine — I didn't wait long before calling them — so I do.
2pm: I get a call from the head of the network team that did the
cutover; he'd seen the ticket and is calling to see what's going
on. He and I debug further for 90 minutes. We try hooking up my laptop
to the port the firewall is usually connected to, but that doesn't
work; he can see my laptop's MAC address, but I can't see his.
4pm: He calls The Big Kahuna, who calls me and starts debugging
further while his osso bucco cooks. We still can't get anywhere. I try
putting my laptop on another port in another room, hoping that net
access will work from there and maybe I can just string a cable
across. It doesn't.
6pm: We call it a night; he and the other guy are going to come in
tomorrow to track it down. I call nine bosses and one sysadmin to keep
them filled in.
6.30pm: Drive home.
Sunday, January 20, 10.30am: We all show up and start working. We
still can't find anything wrong. The boss calls to ask me to set up a
meeting with the network department for tomorrow; I tell him I will
after we finish fixing the problem.
11.30am: The network team lead gets desperate enough to suggest
rebooting the switch stack. It works. We all slap our heads in
disgust. Turns out that a broadcast storm on Friday evening triggered
a logical failure in the switch we were connected to, resulting in the
firewall's port alone being turned off.
Noon: The boss shows up to see how things are going. He talks with the
network lead while I'm on the phone with The Big Kahuna; we've decided
to try moving to the Cisco switches and make that work while
everyone's here.
12.30pm: The Big Kahuna tells me that the problem is the Spanning
Tree Protocol packets coming from my firewall box; the Cisco switch
doesn't like that and shuts down the switch. I go through man pages
until I find the blocknonip option for brconfig. 30 seconds later,
everything is working. Apparently, I'm the only one they've come
across who's running a transparent bridging firewall, so this is the
first time they've seen this problem.
1pm: Debrief the boss. Notify other bosses, sysadmins and users that
everything is back up again, then do some last-minute maintenance.
2pm: Drive home.
One thing: the usual configuration for other departments (that don't
run their own firewall) is to have two Cisco switches running HSRP;
they act as redundant gateways/firewalls that fail over
automagically. The Big Kahuna mentions in passing that this doesn't
work with OpenBSD bridging firewalls. (Our configuration had been
simplified to one switch only on Friday as part of debugging the first
problem; I mention this in case this is helpful to someone. I don't
understand why this might be the case, so I'm going to ask him about
this tomorrow.)