Posts tagged “networking”

February 08, 2012 Mac address suddenly aa:00:04:00:0a:04?
This bit me in the ass today: my workstation's MAC address was suddenly changing to aa:00:04:00:0a:04. The problem turns out to be Decnet, which got added when I installed cmus. RAWR.
March 31, 2010 Embedded embeddedness
The bge driver for OpenBSD says that the Broadcom BCM5700 series of interfaces has two MIPS R4000 cpus. And you can run Linux on an R4000, or NetBSD.

Must...stop...recursion...
February 05, 2010 NFS dotfiles
Reminder to myself: Got a file called .nfs.*? Here's what's going on:
```
# These files are created by NFS clients when an open file is
# removed. To preserve some semblance of Unix semantics the client
# renames the file to a unique name so that the file appears to have
# been removed from the directory, but is still usable by the process
# that has the file open.
```
That quote is from /usr/lib/fs/nfs/nfsfind, a shell script on Solaris 10 that's run once a week from root's crontab. Some references:
- Using dtrace instead of lsof to find open files
- Mention of a possible bug in NFSv4 and OpenSolaris
- Another explanation
- Sun Manager's list summary
February 03, 2010 Jumbo frames again
Arghh...I just spent 24 hours trying to figure out why shadow migration was causing our new 7310 to hang. The answer? Because jumbo frames were not enabled on the switch the 7310 was on, and they were on the machine we're migrating from. Arghh, I say!
October 30, 2009 There it was, gone
Following in Matt's footsteps, I ran into a serious problem just before heading to LISA.

Wednesday afternoon, I'm showing my (sort of) backup how to connect to the console server. Since we're already on the firewall, I get him to SSH to it from there, I show him how to connect to a serial port, and we move on.

About an hour later, I get paged about problems with the database server: SSH and SNMP aren't responding. I try to log in, and sure enough it hangs. I connect to its console and log in as root; it works instantly. Uhoh, I smell LDAP problems...only there's nothing in the logs, and id <uid> works fine. I flip to another terminal and try SSHing to another machine, and that doesn't work either. But already-existing sessions work fine until I try to run sudo or do ls -l. So yeah, that's LDAP.

I try connecting via openssl to the LDAP server (stick alias telnets='openssl s_client -connect' in your .bashrc today!) and get this:
```
CONNECTED(00000003)
```
...and that's all. Wha? I tried connecting to it from the other LDAP server and got the usual (certificate, certificate chain, cipher, driver's license, note from mom, etc). Now that's just weird.

After a long and fruitless hour trying to figure out if the LDAP server had suddenly decided that SSL was for suckers and chumps, I finally thought to run tcpdump on the client, the LDAP server and the firewall (which sits between the two). And there it was, plain as day:
- 3-way handshake
- client says "I speak SSL!"
- server says "I speak SSL too! Here you go!"
- but the client never sees that packet
- and neither does the firewall.
Near as I can figure, this was the sequence of events:
- We SSH'd from the firewall, with its two bridged Intel GigE jumbo-enabled NICs
- to the console server, which only does 10/100
- which somehow prompted a renegotiation of the link speed on the firewall's interface
- which settled on 100 MBit, full duplex, but with jumbo frames
- which the switch saw as completely bogus
- which prompted the switch to (silently, natch) drop all jumbo frames directed at the firewall's outside interface
- which, in the context of an LDAP lookup done by a client inside the firewall, meant that the first packet that failed was the "I speak SSL too! Here you go!" packet
- which left the client with an established TCP connection to the LDAP server, waiting for a certificate
- which meant that it never actually failed over to the other LDAP server.
This took me two hours to figure out, and another 90 minutes to fix; setting the link speed manually on the firewall just convinced the nic/driver/kernel that there was no carrier there. In the end the combination that worked was telling the switch it was a gigabit port, but letting it negotiate duplexiciousnessity.

Gah. Just gah.
October 28, 2009 Where'd that bridge go? Redux
So this morning, again, I got paged about machines in our server room dropping off the network. And again, it was the bridge that was the problem. This time, though, I think I've figured out what the problem is.

The firewall has two interfaces, em0 (on the outside) and em1 (on the inside) , which are bridged. em1 has an IP address. I was able to SSH to the machine from the outside and poke around a bit. I still didn't find anything in the logs, but I did notice this (edited for brevity):
```
$ ifconfig
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 9000
```
```
    lladdr 00:15:17:ab:cd:ef
    media: Ethernet autoselect (1000baseT full-duplex)
    status: active
    inet6 fe80::215:17ff:feab:cdef%em0 prefixlen 64 scopeid 0x1
```
```
em1: flags=8d43<UP,BROADCAST,RUNNING,PROMISC,OACTIVE,SIMPLEX,MULTICAST> mtu 9000
```
```
    lladdr 00:15:17:ab:cd:ee:
    groups: egress
    media: Ethernet autoselect (1000baseT full-duplex)
    status: active
    inet 10.0.0.1 netmask 0xffffff80 broadcast 10.0.0.1
    inet6 fe80::215:17ff:feab:cdee%em1 prefixlen 64 scopeid 0x2
```
See that? em1 has OACTIVE set. A quick search turned up some interesting hits, so for fun I tried resetting the interface:
```
$ sudo ifconfig em1 down
$ sudo ifconfig em1 up
```
and huzzah! it worked.

When I got to work I did some more digging and figured out that this and the earlier outage were almost certainly caused by running a full backup, via Bacula, of the /home partition on the machine. The timing was just about exact. The weird thing, though, is that the partition itself is smaller than var, which was backed up successfully both times:
```
$ df -hl
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/sd0a      509M   42.4M    442M     9%    /
/dev/sd0g      106G   11.4G   89.1G    11%    /home
/dev/sd0d      3.9G    6.0K    3.7G     0%    /tmp
/dev/sd0f     15.7G    2.4G   12.5G    16%    /usr
/dev/sd0e     15.7G   13.6G    1.4G    91%    /var
```
The bacula file daemon logged this on the firewall:
```
Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe
Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Error: bsock.c:306 Write error sending 36841 bytes to Storage daemon:backup.example.com:9103: ERR=Broken pipe
```
With the earlier outage it was 65536 bytes, but otherwise the same error.

Okay, so the firewall's working again...now what? I'm about to head off to LISA in three days, so I can't very well upgrade to the latest OpenBSD right now. I settled for:
- turning off full backups on the firewall (everything important is kept in Subversion anyhow), and
- running a script from cron every 10 minutes that checks for the OACTIVE flag and, if found, resets the interface.
Hopefully that'll keep things going 'til I get back.
October 05, 2009 Where'd that bridge go?
Yesterday I got paged by one of my two Nagios boxes (learned that trick the hard way): a bunch of the machines in our server room had dropped off the network. Weirdly, this did not include the other Nagios box that's over there. WTF?

I logged into the server room's Nagios box, and sure enough couldn't ping the servers or the firewall. I could ping the console server...which was also on the Outside VLAN along with the monitoring box, as opposed to the Inside VLAN with the servers, which sat behind our firewall.

I was also able to ping the management cards/ILOMs/SPs/whatever the kids are calling them in the servers. Thankfully they're Sun boxes, so no Vista-like maze of flavours there...they all come with console redirection. I logged in and fired up a console, panicing because I thought that perhaps the newly-installed NUT clients had shut down the machines because I'd overlooked something.

But no...the machines were up, though hung if you tried to do any LDAP lookups. (Through an oversight, the LDAP server was also on the Outside VLAN. I'll be fixing that today.) Modulo that, they seemed fine.

So I logged into the firewall, which runs OpenBSD 4.3 in bridging mode. And this is where the weirdness lay: the bridge, and/or its component cards, was not working. ifconfig and brconfig said they were up and fine, and the ARP table was still populated (not sure what the lifetime of entries is -- isn't it around 20 minutes or so? must check -- but by this time the problem had been going on for about an hour). Yet I couldn't ping the firewall (one of those cards has an address) from either side, and I couldn't ping anything from the firewall.

pfctl -s all didn't show anything suspicious. There were no obvious problems in dmesg or /var/log/messages. I disabled, then re-enabled, the firewall to no effect. I ran /etc/netstart to no effect.

I even checked on the switches to see if the firewall's MAC address was showing up anywhere, and it was not -- not even directly after pinging it (and getting no response).

In the end I rebooted the machine and all was well.

The NIC in question is a dual-port Intel Pro 1000 (MT, I believe) that I've never had problems with. I've never come across problems like this before on OpenBSD (or, I think, anywhere else). The onboard Broadcom (boo, hiss) was acting fine...it was also on the ILOM's VLAN, and could see the other ILOMs just fine. (In fact, I should have just SSHd to the firewall using that VLAN from the Nagios box, rather than futz around with a 9600 bps console. Next time.)

So...that's my mystery for the weekend.

In other news, my older son (3.25 yrs) has taken to the stage in a big way: he now stands on top of the steps going up from our living room and sings us songs into one of at least two microphones. "Barbara Ann", anything by The Wiggles, and "Yo Gabba Gabba!" songs are prominent. This is after at least three solid weeks of guitar playing, where anything and everything gets strummed while being cradled in his arms while he sings, or maybe makes feedback sounds that'd make Yo La Tengo proud.

Meanwhile, my younger (1.5 yrs) has started saying lots of different phonemes, which is a real contrast to using "Dat!" for monkey, cereal, ball, yes, no, President Barack Obama's attempted health care reforms, and Linux. He has also begun sleeping in 'til 6:30 or 7:00 in the morning, which lets me write things like this. Both are infinitely endearing.

And incidentally, I really need to set up Nagios dependencies. I've had to ACK 27 services in a row (unrelated (I think) problem with ILOM temperature taking means SNMP checks are timing out). Either that or there's some way that you can select n services in Nagios to ack all at once. Anyone?

July 06, 2009 Zombie bacula-sd and open port

Weird...Just ran into a problem with restarting bacula-sd. For some reason, the previous instance had died badly and left a zombie process. I restarted bacula-sd but was left with an open port:

# sudo netstat -tupan | grep 9103
tcp        0      0 0.0.0.0:9103                0.0.0.0:* LISTEN      -

which meant that bconsole hung every time it tried to get the status of bacula-sd. Unsure what to do, I tried telnetting to it for fun and then quit; after that the port was freed up and grabbed by the already-running storage daemon:

tcp        0      0 0.0.0.0:9103                0.0.0.0:* LISTEN      16254/bacula-sd

and bconsole was able to see it just fine:

Connecting to Storage daemon tape at bacula.example.com:9103

example-sd Version: 3.0.1 (30 April 2009) x86_64-example-linux-gnu example
Daemon started 06-Jul-09 10:18, 0 Jobs run since started.
 Heap: heap=180,224 smbytes=25,009 max_bytes=122,270 bufs=94 max_bufs=96
Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8

June 18, 2009 Busyness
Full day:
- Prepare new network map
- Take stand-in techie around server room and explain new network setup
- Check UPS; still not crashed
- New Sun 4240 server unable to get past POST after hooking up fibre cable yesterday to SL-500 library. Try various things, no luck. Fortunately installers coming back next week to finish the job.
- Over to server room w/boss to take pictures for website
- Get programmer familiar w/the server she'll be using, how to set up services, etc. Arguably my job, but a) she'll want to learn and b) I'm off on vacation next week.
- Unless of course the UPS folks need to schedule downtime to make it work. But then I'll just use it as an excuse to show my dad and kids around the server room.
- Gotta pick out an IPA recipe to brew with my dad. Leaning toward the Cream IPA from Radical Brewing. May need to get a cooler to use as a lauter tun, since I think it's around 13 pounds of grain — more than I can comfortably do in my paint bag strainer setup.
- Still got out to walk around at lunch time, which was nice; I have a bad habit of skipping that.
June 16, 2009 Now \*that's\* irritating...
Just discovered, while trying to test the mail server at $WORK, that my ISP filters outgoing port 25. I'd give them a call but I can't dig up my account info at the moment.
March 25, 2009 I had no idea...
...that TCP Offload Engines (TOE) were so detested by Linux kernel folks. The arguments here make interesting reading and seem convincing to me.

(From Andy Grover's blog.)
March 23, 2009 Oh, joy
NetSNMP uses 32-bit counters for disk sizes. Guess what happens when you've got one of these?

Due to be fixed in the next release, so at least that's something.
March 18, 2009 Laptop suspend mode
Okay, I feel like a bit of a tool for never realizing how cool suspend-to-ram is in a laptop. My new laptop for work is a Dell D630, which I'd got 'cos its hardware is pretty much completely compatable w/Linux. However, I've also figured out that a) Ubuntu does suspend-to-ram quite nicely (aside from a couple times when the keyboard doesn't work, but closing/reopening the lid makes it work), and b) it just sips — sips, I tell you! — from the battery.

Now to try and make it work on my own laptop, which is currently sitting at the shop waiting for me to pick it up.

Today's agenda:
- Install new 48-port switch in server room
- Update Fedora Directory Server wiki page on building RPMs for/on CentOS
- Set up mail server to accept mail for older, semi-deprecated domain
- Drink coffee, catch up on sleep
See? I am still a sysadmin.
February 04, 2009 Sleep!
I can't believe it...my youngest son, after nearly three weeks of being up four or five times each night, slept nearly all the way through without a break: he only woke up at 1am and 5:15am, which is close enough to my usual wakeup time as makes no difference. It was wonderful to have a bit of sleep.

This comes after staying up late (11pm!) on Sunday bottling the latest batch of beer, a Grapefruit Bitter recipe from the local homebrew shop. You know, it really does taste like grapefruit, and even this early I'm really looking forward to this beer.

My laptop has a broken hinge, dammit. I carry it around in my backpack without any padding, so I guess I'm lucky it's lasted this long. Fortunately the monitor still works and mostly stays upright. I've had a look at some directions on how to replace it; it looks fiddly, but spending $20 on a new set of hinges from eBay is a lot more attractive than spending $100. Of course, the other consideration is whether I can get three hours to work on it….But in the meantime, I've got it on the SkyTrain for the first time in a week; it's been hard to want to do anything but sleep lately.

Work is still busy:
- I'm trying to get tinyMCE and img_assist to work with Drupal
  1. tinyMCE is no problem, but the img_assist part wasn't working with it. Turns out you need to get dev versions of the img_assist and WYSIWIG modules, because the latest version of tinyMCE (which is required for Drupal 6) broke some parts of img_assist (which, in turn, was in the middle of a rewrite anyhow). Eventually, the admin ass't will be able to work on the website w/o having to know HTML which == major win.
- Contacting vendors to look at backup hardware. So far we're looking at the Dell ML6010 and the Sun SL500. They're both modular, which is nice; we've got (low) tens of TB now but that'll ramp up quickly. The SL500 seems to have some weird things; according to this post, it takes up to 30 minutes to boot (!) and you can't change its IP address without a visit from the service engineer (!!). Those posts are two years old, so perhaps things have changed.
- Trying to figure out what we want for backup software, too. I'm used to Bacula (which works well with the ML6010) and Amanda, but I've been working a little bit with Tivoli lately. One of the advantages of Tivoli is the ease of restoring it gives to the users…very nice. I'm reading Backup and Recovery again, trying to get a sense of what we want, and reviewing Preston's presentation at LISA06 called "Seriously, tape-only backup systems are dead". So what do we put in front of this thing? Not sure yet…
- Speaking of Tivoli, it's suddenly stopped working for us: it backed up filesystems on our Thumper just fine (though we had to point it at individual ZFS filesystems, rather than telling it to just go), then stopped; it hangs on files over a certain size (somewhere around 500kb or so) and just sits there, trying to renew the connection over and over again. I've been suspecting firewall problems, but I haven't changed anything and I can't see any logged blocked packets. Weird.
Update: turned out to be an MTU problem:
- The Thumper supports Jumbo frames
- Our switch supports Jumbo frames
- Our firewall's inside interface, a GigE from Broadcom, does not support Jumbo frames
- Our switch will silently drop jumbo frames when directed to an interface that does not support it
I had no idea there were GigE NICs that did not support Jumbo frames. Though maybe that's just the OpenBSD driver for it. Hm.
October 07, 2008 Firewall unit test
When I was at LISA, one of the sysadmins I met mentioned a firewall unit testing script that a coworker of his had come up with. The idea was to run your OpenBSD firewall in a QEMU instance, then try passing traffic back and forth to make sure everything worked as expected. I've been looking for that tool to be released, but haven't seen it....or anything else like it either…

Until today, that is, when I stumbled on NetUnit. It's a Java-based tool that tests basic network connectivity, using XML files to specify tests. So far he's got tests for ICMP/port 7 (which I never knew was the echo port), TCP ports, HTTP/HTTPS and MySQL. Not bad at all, except for my lack of Java experience.

Of course, now I want to write my own tester using Perl and QEMU. Like I've got time. But here's an idea for anyone who can use it: test your firewall using three instances of QEMU (inside, outside and firewall), and have the inside and outside hosts communicate using the serial port. "I'm gonna send an echo request, did you see it?" "Yes, did you see the reply?" It's a bit more feedback than simply noting the lack of the expected reply.

And it's not at all like conversations that start out with, "I sent you an email. Did you get it?"
September 25, 2008 That's a mighty big catchup I got goin' there
Work...hell, life is busy these days.

At work, our (only) tape drive failed a couple of weeks ago; Bacula asked for a new tape, I put it in, and suddenly the "Drive Error" LED started blinking and the drive would not eject the tape. No combination of power cycling, paperclips or pleading would help. Fortunately, $UNIVERSITY_VENDOR had an external HP Ultrium 960 tape drive + 24 tapes in a local warehouse. Hurray for expedited shipping from Richmond!

Not only that, the Ultrium 3 drive can still read/write our Ultrium 2 media. By this I mean that a) I'd forgotten that the LTO standard calls for R/W for the last generation, not R/O, and b) the few tests I've been able to do with reading random old backups and reading/writing random new backups seem to go just fine.

Question for the peanut gallery: Has anyone had an Ultrium tape written by one drive that couldn't be read by another? I've read about tapes not being readable by drives other than the one that wrote it, but haven't heard any accounts first-hand for modern stuff.

Another question for the peanut gallery: I ended up finding instructions from HP that showed how to take apart a tape drive and manually eject a stuck tape. I did it for the old Ultrium 2. (No, it wasn't an HP drive, but they're all made in Hungary...so how many companies can be making these things, really?) The question is, do I trust this thing or not? My instinct is "not as far as I can throw it", but the instructions didn't mention anything one way or the other.

In other news, $NEW_ASSIGNMENT is looking to build a machine room in the basement of a building across the way, and I'm (natch) involved in that. Unfortunately, I've never been involved in one before. Fortunately, I got training on this when I went to LISA in 2006, and there's also Limoncelli, Hogan and Chalup to help out. (That link sends the author a few pennies, BTW; if you haven't bought it yet, get your boss to buy it for you.)

As part of the movement of servers from one data centre across town to new, temporary space here (in advance of this new machine room), another chunk of $UNIVERSITY has volunteered to help out with backups by sucking data over the ether with Tivoli. Nice, neighbourly think of them to do!

I met with the two sysadmins today and got a tour of their server room. (Not strictly necessary when arranging for backups, but was I gonna turn down the chance to tour a 1500-node cluster? No, I was not.) And oh, it was nice. Proper cable management...I just about cried. :-) Big racks full of blades, batteries, fibre everywhere, and a big-ass robotic Ultrium 2 tape cabinet. (I was surprised that it was 2, and not U3 or U4, but they pointed out that this had all been bought about four or five years ago…and like I've heard about other government-funded efforts, there's millions for capital and little for maintenance or upgrades.)

They told me about assembling most of it from scratch...partly for the experience, partly because they weren't happy with the way the vendor was doing it ("learning as they went along" was how they described it). I urged them to think about presenting at LISA, and was surprised that they hadn't heard of the conference or considered writing up their efforts.

Similarly, I was arranging for MX service for the new place with the university IT department, and the guy I was speaking to mentioned using Postfix. That surprised me, as I'd been under the impression that they used Sendmail, and I said so. He said that they had, but they switched to Postfix a year ago and were quite happy with it: excellent performance as an MTA (I think he said millions of emails per day, which I think is higher than my entire career total :-) and much better Milter performance than Sendmail. I told him he should make a presentation to the university sysadmin group, and he said he'd never considered it.

Oh, and I've completely passed over the A/C leak in my main job's server room…or the buttload of new servers we're gonna be getting at the new job…or adding the Sieve plugin for Dovecot on a CentOS box...or OpenBSD on a Dell R300 (completely fine; the only thing I've got to figure out is how it'll handle the onboard RAID if a drive fails). I've just been busy busy busy: two work places, still a 90-minute commute by transit, and two kids, one of whom is about to wake up right now.

Not that I'm complaining. Things are going great, and they're only getting better.

Last note: I'm seriously considering moving to Steve Kemp's Chronicle engine. Chris Siebenmann's note about the attraction of file-based systems for techies is quite true, as is his note about it being hard to do well. I haven't done it well, and I don't think I've got the time to make it good. Chronicle looks damn nice, even if it does mean opening up comments via the web again…which might mean actually getting comments every now and then. Anyhow, another project for the pile.
August 10, 2008 Defcon NOC
Interesting article from Threat Level about the Defcon NOC. Now there'd be an interesting job...
January 20, 2008 The Weekend
Tuesday, January 15: Notify users that there will be a brief interruption in our Internet access due to $UNIVERSITY network dep't cutover of our connection from old Bay switches to new Cisco switches. The cutover will be on Friday at 6:30am; the network dep't has said an hour, but it's expected to only be about 20 minutes.

Friday, January 18, 8:30am: Get into work to find that our Internet connection is down. I didn't get notified because the Nagios box can't send email to my cel phone if it can't get access to the Internet. Call network help desk and ask if there were problems; they say no, and everyone else is working just fine. I go to our server room and start trying to figure out what's wrong; can't find a thing. Call help desk back, who say they're going to escalate it.

10am: Get call back from the team that did the cutover. They tell me everything looks fine at their end; as we're the Nth connection to be cut over, it's not like they haven't had practice with it. I debug things with them some more, and we still can't find anything wrong: their settings are correct, mine haven't changed and yet I can't ping our gateway. (The firewall is an OpenBSD box with two interfaces, set up as a transparent bridging firewall.) As the firewall box is an older desktop that had been pressed into service long ago, I decide it'd be worth taking the new, currently spare (YOU NEVER HEARD ME SAY THAT) desktop machine and trying that.

Noon: Realize I have no spare ethernet cards (wha'?). Find two Intel Pro 100s at the second store I go to. Install OpenBSD 4.2 (yay for ordering the CD!), copy over config files, and put it into place. No luck. Still can't ping gateway. While working on the firewall, I notice something weird: I've accidentally set up a bridge with only one interface, while my laptop sits behind pinging the gateway (fruitlessly) ten times a second. (I got desperate.) When I add the second interface, the connection works — but only for 0.3 seconds. The behaviour is repeatable.

3pm: Right after that, the network people show up to see how things are going. I tell them the results (nothing except for 0.3 seconds) and they're mystified. We decide to back out the change from the morning and debug it next week. Things work again instantly. As the new firewall works, I leave it in place.

7.02pm: The connection goes down again. I don't get notified.

Saturday January 19, Noon: I get a call from the boss, who tells me that a meeting at the offices isn't going well because they have no Internet access. Call and verify that, yep, that's the case, and I can't ping there from home. Drive into work.

1.30pm: Arrive and start debugging. Again, nothing wrong that I can see but I can't ping our gateway or see its MAC address. Call help desk who say they have no record of problems. They'll put in a trouble ticket, but would like me to double-check before they escalate it. That's fine — I didn't wait long before calling them — so I do.

2pm: I get a call from the head of the network team that did the cutover; he'd seen the ticket and is calling to see what's going on. He and I debug further for 90 minutes. We try hooking up my laptop to the port the firewall is usually connected to, but that doesn't work; he can see my laptop's MAC address, but I can't see his.

4pm: He calls The Big Kahuna, who calls me and starts debugging further while his osso bucco cooks. We still can't get anywhere. I try putting my laptop on another port in another room, hoping that net access will work from there and maybe I can just string a cable across. It doesn't.

6pm: We call it a night; he and the other guy are going to come in tomorrow to track it down. I call nine bosses and one sysadmin to keep them filled in.

6.30pm: Drive home.

Sunday, January 20, 10.30am: We all show up and start working. We still can't find anything wrong. The boss calls to ask me to set up a meeting with the network department for tomorrow; I tell him I will after we finish fixing the problem.

11.30am: The network team lead gets desperate enough to suggest rebooting the switch stack. It works. We all slap our heads in disgust. Turns out that a broadcast storm on Friday evening triggered a logical failure in the switch we were connected to, resulting in the firewall's port alone being turned off.

Noon: The boss shows up to see how things are going. He talks with the network lead while I'm on the phone with The Big Kahuna; we've decided to try moving to the Cisco switches and make that work while everyone's here.

12.30pm: The Big Kahuna tells me that the problem is the Spanning Tree Protocol packets coming from my firewall box; the Cisco switch doesn't like that and shuts down the switch. I go through man pages until I find the blocknonip option for brconfig. 30 seconds later, everything is working. Apparently, I'm the only one they've come across who's running a transparent bridging firewall, so this is the first time they've seen this problem.

1pm: Debrief the boss. Notify other bosses, sysadmins and users that everything is back up again, then do some last-minute maintenance.

2pm: Drive home.

One thing: the usual configuration for other departments (that don't run their own firewall) is to have two Cisco switches running HSRP; they act as redundant gateways/firewalls that fail over automagically. The Big Kahuna mentions in passing that this doesn't work with OpenBSD bridging firewalls. (Our configuration had been simplified to one switch only on Friday as part of debugging the first problem; I mention this in case this is helpful to someone. I don't understand why this might be the case, so I'm going to ask him about this tomorrow.)
September 07, 2007 Now here's a weird one...
One of the problems I've been working on since the upgrade to Solaris 10 has been the slowness of the SunRay terminals. There are a few different problems here, but one of 'em is that after typing in your password and hitting Enter, it takes about a minute to get the JDS "Loading your desktop…" icons up.

I scratched my head over this one for a long time 'til I saw this:
```
ptree 10533
906   /usr/dt/bin/dtlogin -daemon -udpPort 0
  10445 /usr/dt/bin/dtlogin -daemon -udpPort 0
```
10533 /bin/ksh /usr/dt/config/Xstartup
  10551 /bin/ksh -p /opt/SUNWut/lib/utdmsession -c 4
    10585 /bin/ksh -p /etc/opt/SUNWut/basedir/lib/utscrevent -c 4 -z utdmsession
      10587 ksh -c echo 'CREATE_SESSION 4 # utdmsession' &gt;/dev/tcp/127.0.0.1/7013
```
```
which just sat there and sat there for, oh, about a minute. So I run netcat on port 7013, log out and log in again, and boom! quick as anything.

/etc/services says:
```
utscreventd     7013/tcp                        # SUNWut SRCOM event deamon
```
which we're not running; something to do with smart cards. So why does it hang so long? Because for some reason, the host isn't sending back an RST packet (I presume; can't listen to find out) to kill the connection, like it does on $other_server.

So now I'm trying to figure out why that is. It's not the firewall; they're identical. I've tried looking at ndd /dev/tcp \? but I don't see anything obvious there. My google-fu doesn't appear to be up to the task either. I may have to cheat and go visit a fellow sysadmin to find out.
March 05, 2007 Firewalls, H323, Abstraction
Last month, my work got a new H.323 video conferencing unit, and today we had our first real test: a lecture given at SFU that was streamed to us. For the most part, it went really well; there were no big screw-ups and everything went as planned. During the second half of the conference, though, the audio was intermittently choppy. I'm not certain, but I think that a local user's Internet radio stream may have caused the problems.

If that's the case — and it would surprise me, since I'd assumed we had a pretty damned fast connection to the Internet — then I'll need to start adding traffic shaping to our firewall. Working on the firewall is something I've been putting off for a while, since it's a bit obscure…lovely pf firewall, littered through with quick rules. But there's a good tool for pf unit testing I've been meaning to try out since I heard about it at LISA. Probably won't be as big a help with the traffic shaping stuff, but at least I'll be reasonably sure I'm not screwing anything else up.

And now I'm wondering just how hard it would be to come up with (handwave) something that would combine automatic form generation, web-based testing code and summary code. We have these multiple conferences that need registration pages; while some of the information is the same (name, email address) some is different (one conference has a banquet, another wants to know if you're going to be attending all three days). Putting all this in a database and using something like Formitable to generate the form seems perfect.

Since I'm already using Perl's WWW::Mechanize and Test::More to test the pages, it'd be nice to have it look at the stuff used to generate the form and use that to test the page. (That's not the clearest way I could put that, but if I don't write this down now I'll never write it down.) And if I could add something that'd automatically generate summary pages for conference organizers, it'd be even better; stuff like email and address is always easy, but being aware of special questions would be nice too. (Though maybe not necessary…how hard is it to generate summary pages?)

Trouble is, this is a lot of deep thinking that I've never really had to do before. I suspect this sort of thing is a good programmer's bread and butter, but I've never been a programmer (good or otherwise). The more I think about this, the more I can't decide whether this is really hard, possible but too much effort to be worth it, or already done by something I haven't come across yet.

The little things I can handle, though. This crash looks like it's happening because of a mixup between rand(3) and random(3). In Linux, both have a maximum of RAND_MAX, but in Solaris the latter has a maximum of 2^31. This wreaks havoc with the let's-shuffle-the-playlist routine in XMMS, and we end up with a crash. Once I figure out how to program in C, it shouldn't be too hard to get it fixed. :-)
January 25, 2005 World's most awful hack
Problem: You are behind a FreeBSD firewall using natd. You are listening to an Internet radio station with a limited number of streams. It has taken you six tries to get in, but at last you're there. Suddenly it's time for lunch, though, and you want to take your laptop (which you've been using to listen) with you. When you come back, you'll need to try connecting all over again.

Solution: natd is just a userland program. Hack it so that, upon receiving a certain signal (USR1, say, or maybe something sent over a listening Unix or TCP socket), it will remap a certain connection to another incoming point. End effect: instead of the radio stream being directed to your laptop, it'll be redirected to your workstation where you'll have netcat or something similar to grab the stream and keep things going. Switch back once you're back from lunch.