Mail server up, ego down

date: 28 September 2002

So the new mail server is up and running. It looks, though, like I missed some fairly important things, and at least one critically important thing.

I was working on it from home last night around 8pm, and alla sudden it wasn't responding to ssh or pings. It came back up, and sure enough it had panicked and crashed. Fortunately the sysadmin (Hi Dave!) was there and was able to look at it and figure out what was wrong. My first thought was there were problems w/vinum and the promise controller again, but no: not enough file descriptors. Given that it's a moderate-to-fairly busy mail server that gets lots of spam, this was a pretty big fuckup.

Second, I'd set up vinum to make a bunch of separate raid5...um, partitions out of separate disk slices on each of four hard disks. Turns out you can divide one big vinum raid5 partition into separate slice-like entities (I'm still learning all this; forgive the imprecision of what I'm writing).

Third, I'd set MAXUSERS too low: 32, which seemed reasonable given that hardly anyone would be logging on to it. Of course, this setting controls lots of other resources, so I hadn't really thought that through. The SA set it to 0, which means FreeBSD will adjust it on the fly.

Fourth, he's got FreeBSD-stable (I think it's stable) built weekly on a box there, and I should've installed & mounted everything via NFS rather than installing from CDROM and putting everything on the disks.

All in all, I'm feeling a bunch humbler this morning. I did almost all of this on my own and thought I was doing pretty damned well, but I stil have an awful lot to learn -- plus, my first idea of what happened was completely wrong (not that I wouldn't have been able to figure it out eventually, probably, but that's not good). I'm starting to think I should set up some of the boxen I have at home here as a test lab -- I haven't really done much since setting up one as a honeypot -- and fuck around w/stuff like this: set up one as a mail server, set up the other one to hammer it w/a million messages, that sort of thing, and see where the bottleneck is. Plus NFS, plus all sorts of stuff. Plus buying Michael Lucas' excellent book and reading it cover to cover. Plus actually trying to understand Sendmail. Plus plus plus plus plus.

Sigh. On the other hand, the new mail server appears to be doing really well since Dave rebuilt the kernel last night, and the load on the back-end server (which customers use to send/receive mail) is dramatically lower -- only a half dozen big load spikes overnight, as opposed to one very half hour or less.

Original entry.