The Weekend

Tuesday, January 15: Notify users that there will be a brief interruption in our Internet access due to $UNIVERSITY network dep't cutover of our connection from old Bay switches to new Cisco switches. The cutover will be on Friday at 6:30am; the network dep't has said an hour, but it's expected to only be about 20 minutes.

Friday, January 18, 8:30am: Get into work to find that our Internet connection is down. I didn't get notified because the Nagios box can't send email to my cel phone if it can't get access to the Internet. Call network help desk and ask if there were problems; they say no, and everyone else is working just fine. I go to our server room and start trying to figure out what's wrong; can't find a thing. Call help desk back, who say they're going to escalate it.

10am: Get call back from the team that did the cutover. They tell me everything looks fine at their end; as we're the Nth connection to be cut over, it's not like they haven't had practice with it. I debug things with them some more, and we still can't find anything wrong: their settings are correct, mine haven't changed and yet I can't ping our gateway. (The firewall is an OpenBSD box with two interfaces, set up as a transparent bridging firewall.) As the firewall box is an older desktop that had been pressed into service long ago, I decide it'd be worth taking the new, currently spare (YOU NEVER HEARD ME SAY THAT) desktop machine and trying that.

Noon: Realize I have no spare ethernet cards (wha'?). Find two Intel Pro 100s at the second store I go to. Install OpenBSD 4.2 (yay for ordering the CD!), copy over config files, and put it into place. No luck. Still can't ping gateway. While working on the firewall, I notice something weird: I've accidentally set up a bridge with only one interface, while my laptop sits behind pinging the gateway (fruitlessly) ten times a second. (I got desperate.) When I add the second interface, the connection works — but only for 0.3 seconds. The behaviour is repeatable.

3pm: Right after that, the network people show up to see how things are going. I tell them the results (nothing except for 0.3 seconds) and they're mystified. We decide to back out the change from the morning and debug it next week. Things work again instantly. As the new firewall works, I leave it in place.

7.02pm: The connection goes down again. I don't get notified.

Saturday January 19, Noon: I get a call from the boss, who tells me that a meeting at the offices isn't going well because they have no Internet access. Call and verify that, yep, that's the case, and I can't ping there from home. Drive into work.

1.30pm: Arrive and start debugging. Again, nothing wrong that I can see but I can't ping our gateway or see its MAC address. Call help desk who say they have no record of problems. They'll put in a trouble ticket, but would like me to double-check before they escalate it. That's fine — I didn't wait long before calling them — so I do.

2pm: I get a call from the head of the network team that did the cutover; he'd seen the ticket and is calling to see what's going on. He and I debug further for 90 minutes. We try hooking up my laptop to the port the firewall is usually connected to, but that doesn't work; he can see my laptop's MAC address, but I can't see his.

4pm: He calls The Big Kahuna, who calls me and starts debugging further while his osso bucco cooks. We still can't get anywhere. I try putting my laptop on another port in another room, hoping that net access will work from there and maybe I can just string a cable across. It doesn't.

6pm: We call it a night; he and the other guy are going to come in tomorrow to track it down. I call nine bosses and one sysadmin to keep them filled in.

6.30pm: Drive home.

Sunday, January 20, 10.30am: We all show up and start working. We still can't find anything wrong. The boss calls to ask me to set up a meeting with the network department for tomorrow; I tell him I will after we finish fixing the problem.

11.30am: The network team lead gets desperate enough to suggest rebooting the switch stack. It works. We all slap our heads in disgust. Turns out that a broadcast storm on Friday evening triggered a logical failure in the switch we were connected to, resulting in the firewall's port alone being turned off.

Noon: The boss shows up to see how things are going. He talks with the network lead while I'm on the phone with The Big Kahuna; we've decided to try moving to the Cisco switches and make that work while everyone's here.

12.30pm: The Big Kahuna tells me that the problem is the Spanning Tree Protocol packets coming from my firewall box; the Cisco switch doesn't like that and shuts down the switch. I go through man pages until I find the blocknonip option for brconfig. 30 seconds later, everything is working. Apparently, I'm the only one they've come across who's running a transparent bridging firewall, so this is the first time they've seen this problem.

1pm: Debrief the boss. Notify other bosses, sysadmins and users that everything is back up again, then do some last-minute maintenance.

2pm: Drive home.

One thing: the usual configuration for other departments (that don't run their own firewall) is to have two Cisco switches running HSRP; they act as redundant gateways/firewalls that fail over automagically. The Big Kahuna mentions in passing that this doesn't work with OpenBSD bridging firewalls. (Our configuration had been simplified to one switch only on Friday as part of debugging the first problem; I mention this in case this is helpful to someone. I don't understand why this might be the case, so I'm going to ask him about this tomorrow.)