the life of a sysadmin.
Carousel is a lie!

Linky:
[FSF Associate Member] LOPSA

Email: aardvark at saintaardvarkthecarpeted dot com

New toy

Fri Jan 25 13:12:30 PST 2008

My workplace just got me a new cel phone: the Sony Ericsson W200a Sony Walkman Phone. The provider is Rogers; minus two points for not letting me make an MP3 into a ring tone, but plus three for letting MidpSSH work. It was a lark to be able to check mail on my firewall box; Mutt was surprisingly useful. No idea how much data costs on the plan I've got, and I don't plan on actually SSHing around very much, if at all…but still, fun. And, as mentioned elsewhere, kudos for including a USB cable and making it show up as an ordinary mass storage device.

(permalink) (comments)

One day without interruptions

Wed Jan 23 16:24:35 PST 2008

It was everything I thought it would be. APCUPSd set up, new Postfix map in place for verdammt Sympa lists (replacing the old regexp-based one that allowed far too much backscatter), and a new (though very minimal) offsite Nagios installation. Beautiful.

(permalink) (comments)

Project U-13, 0.0.3

Wed Jan 23 05:58:00 PST 2008

Version 0.0.3 of Project U-13, a distro for sysadmins, has been released!

The main change is the addition of RackMonkey, which its website describes as "a web-based tool for managing racks of equipment such as web servers, video encoders, routers and storage devices", at the suggestion of Andy Seely. Also, Lynx has been installed, and there's also the skeletal beginnings of a Cfengine config file.

The ISO has been signed with my public key. Share and enjoy, and comments on a postcard, please.

(permalink) (comments)

LOLCODE

Mon Jan 21 05:54:22 PST 2008

Yes, I love LOLcats with a love that is fierce. (Though the comments all written in LOLcat just strike me as unneccessary. I know, but that's where the line is for me.)

But LOLCODE just makes me laugh and laugh and laugh:

HAI
CAN HAS STDIO?
PLZ OPEN FILE "LOLCATS.TXT"?
        AWSUM THX
                VISIBLE FILE
        O NOES
                INVISIBLE "ERROR!"
KTHXBYE

I may have to ask for the t-shirt for my birthday. Or maybe I'll just print out the syntax for the wall of my office.

(permalink) (comments)

The Weekend

Sun Jan 20 20:07:35 PST 2008

Tuesday, January 15: Notify users that there will be a brief interruption in our Internet access due to $UNIVERSITY network dep't cutover of our connection from old Bay switches to new Cisco switches. The cutover will be on Friday at 6:30am; the network dep't has said an hour, but it's expected to only be about 20 minutes.

Friday, January 18, 8:30am: Get into work to find that our Internet connection is down. I didn't get notified because the Nagios box can't send email to my cel phone if it can't get access to the Internet. Call network help desk and ask if there were problems; they say no, and everyone else is working just fine. I go to our server room and start trying to figure out what's wrong; can't find a thing. Call help desk back, who say they're going to escalate it.

10am: Get call back from the team that did the cutover. They tell me everything looks fine at their end; as we're the Nth connection to be cut over, it's not like they haven't had practice with it. I debug things with them some more, and we still can't find anything wrong: their settings are correct, mine haven't changed and yet I can't ping our gateway. (The firewall is an OpenBSD box with two interfaces, set up as a transparent bridging firewall.) As the firewall box is an older desktop that had been pressed into service long ago, I decide it'd be worth taking the new, currently spare (YOU NEVER HEARD ME SAY THAT) desktop machine and trying that.

Noon: Realize I have no spare ethernet cards (wha'?). Find two Intel Pro 100s at the second store I go to. Install OpenBSD 4.2 (yay for ordering the CD!), copy over config files, and put it into place. No luck. Still can't ping gateway. While working on the firewall, I notice something weird: I've accidentally set up a bridge with only one interface, while my laptop sits behind pinging the gateway (fruitlessly) ten times a second. (I got desperate.) When I add the second interface, the connection works — but only for 0.3 seconds. The behaviour is repeatable.

3pm: Right after that, the network people show up to see how things are going. I tell them the results (nothing except for 0.3 seconds) and they're mystified. We decide to back out the change from the morning and debug it next week. Things work again instantly. As the new firewall works, I leave it in place.

7.02pm: The connection goes down again. I don't get notified.

Saturday January 19, Noon: I get a call from the boss, who tells me that a meeting at the offices isn't going well because they have no Internet access. Call and verify that, yep, that's the case, and I can't ping there from home. Drive into work.

1.30pm: Arrive and start debugging. Again, nothing wrong that I can see but I can't ping our gateway or see its MAC address. Call help desk who say they have no record of problems. They'll put in a trouble ticket, but would like me to double-check before they escalate it. That's fine — I didn't wait long before calling them — so I do.

2pm: I get a call from the head of the network team that did the cutover; he'd seen the ticket and is calling to see what's going on. He and I debug further for 90 minutes. We try hooking up my laptop to the port the firewall is usually connected to, but that doesn't work; he can see my laptop's MAC address, but I can't see his.

4pm: He calls The Big Kahuna, who calls me and starts debugging further while his osso bucco cooks. We still can't get anywhere. I try putting my laptop on another port in another room, hoping that net access will work from there and maybe I can just string a cable across. It doesn't.

6pm: We call it a night; he and the other guy are going to come in tomorrow to track it down. I call nine bosses and one sysadmin to keep them filled in.

6.30pm: Drive home.

Sunday, January 20, 10.30am: We all show up and start working. We still can't find anything wrong. The boss calls to ask me to set up a meeting with the network department for tomorrow; I tell him I will after we finish fixing the problem.

11.30am: The network team lead gets desperate enough to suggest rebooting the switch stack. It works. We all slap our heads in disgust. Turns out that a broadcast storm on Friday evening triggered a logical failure in the switch we were connected to, resulting in the firewall's port alone being turned off.

Noon: The boss shows up to see how things are going. He talks with the network lead while I'm on the phone with The Big Kahuna; we've decided to try moving to the Cisco switches and make that work while everyone's here.

12.30pm: The Big Kahuna tells me that the problem is the Spanning Tree Protocol packets coming from my firewall box; the Cisco switch doesn't like that and shuts down the switch. I go through man pages until I find the blocknonip option for brconfig. 30 seconds later, everything is working. Apparently, I'm the only one they've come across who's running a transparent bridging firewall, so this is the first time they've seen this problem.

1pm: Debrief the boss. Notify other bosses, sysadmins and users that everything is back up again, then do some last-minute maintenance.

2pm: Drive home.

One thing: the usual configuration for other departments (that don't run their own firewall) is to have two Cisco switches running HSRP; they act as redundant gateways/firewalls that fail over automagically. The Big Kahuna mentions in passing that this doesn't work with OpenBSD bridging firewalls. (Our configuration had been simplified to one switch only on Friday as part of debugging the first problem; I mention this in case this is helpful to someone. I don't understand why this might be the case, so I'm going to ask him about this tomorrow.)

(permalink) (comments)

Coming up

Fri Jan 18 06:07:07 PST 2008

My laptop hard drive started giving scary errors a couple days ago on the way to work (I've got a 90-minute commute by public transit [uck] so I fill the time by reading, listening to podcasts, or working on Project U-13). Fortunately, working at a university means that there are two computer stores on campus. I ran out at lunch, picked up a 100GB drive, and had things back to normal by the next morning.

Well, normal modulo one false start with Debian; I decided to try encrypted filesystems just for fun. But then I suspended, came back with a newere kernel, and it could not read the encrypted LVM group anymore. Whoops.

Still lots of free space on this thing, and I'm thinking of installing Ubuntu, FreeBSD and maybe NetBSD just for fun. Of course, I've got to do it all via PXE since this thing doesn't have any CDROM drive, but that just adds to the geek points.

Project U-13 is coming up on 0.0.3, btw; Andy suggested adding Rackmonkey, which looks quite cool. There's no package for it, so I'm having to do some rather ugly scripted installation…but I can stand it for now. And I've got the barest skeleton of a cfengine file in there too. Watch the skies!

(permalink) (comments)