So yesterday I got an email from another sysadmin: "Hey, looks like there's a lot of blocked connections to your server X. Anything happening there?" Me: "Well, I moved email from X to Y on Tuesday...but I changed the MX to point at Y. What's happening there?"
Turns out I'd missed a fucking domain: I'd left the MX pointing to the old server instead of moving it to the new one. And when I turned off the mail server on the old domain, delivery to this domain stopped. Fortunately I was able to get things going again: I changed the MX to point at the new server, and turned on the old server again to handle things until the new record propogated.
So how in hell did this happen? I can see two things I did wrong:
Poor planning: my plans and checklists included all the steps I needed to do, but did not mention the actual domains being moved. I relied on memory, which meant I remembered (and tested) two and forgot the third. I should have included the actual domains: both a note to check the settings and a test of email delivery.
No email delivery check by Nagios: Nagios checks that the email server is up, displays a banner and so on, but does not check actual email delivery for the domains I'm responsible for. There's a plugin for that, of course, and I'm going to be adding that.
I try to make a point of writing down things that go bad at $WORK, along with things that go well. This is one of those things.
Okay, so it isn't quite as bad as the time I threw 3,000 incoming
messages for an ISP into my home directory. But I've just figured out
that the reason a) $VENDOR didn't get back to me and b) it's been so
quiet for the last few days is because all email was going to a file
called X-Original-Sender
because of one missing *
. (In fact, that
may also have been the cause of the first big error...)
God, I hate procmail sometimes.