Recently I had to track messages to make sure they were being delivered, and this turned out to be much harder than I thought. Some of this is due to the way we have our mail set up, and some of it is just due to confusion (mine) about how Postfix logs things. So in the spirit of Chris Siebenmann, here's my explanation for my future self of how this works.
First, an explanation of the flow of messages at my workplace, whether sent or received from a local network or a remote network:
Here's a log with an example message. It was sent from pepperoni@pepperoni.pizza to otheruser@example.net, which we forward off to ilikesandwiches@club.sandwich. (Rememver the days when you could be assured that "club.sandwich" was a totally bogus TLD?)
Apr 9 13:25:13 liddy postfix/smtpd[12203]: 4DC3914C086: client=mail.example.com[192.168.0.2]
Apr 9 13:25:13 liddy postfix/cleanup[12206]: 4DC3914C086: message-id=<deadbeef@example.com>
Apr 9 13:25:13 liddy postfix/qmgr[995]: 4DC3914C086: from=<pepperoni@pepperoni.pizza>, size=2019, nrcpt=1 (queue active)
Apr 9 13:25:13 liddy spamd[5110]: spamd: processing message <deadbeef@example.com> for nobody:99
Apr 9 13:25:13 liddy spamd[5110]: spamd: result: . 0 - scantime=0.2,size=1981,user=nobody,uid=99,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=42925,mid=<deadbeef@example.com>,autolearn=failed
Apr 9 13:25:13 liddy postfix/pickup[11886]: 85E1F14C089: uid=99 from=<pepperoni@pepperoni.pizza>
Apr 9 13:25:13 liddy postfix/cleanup[12206]: 85E1F14C089: message-id=<deadbeef@example.com>
Apr 9 13:25:13 liddy postfix/pipe[12207]: 4DC3914C086: to=<otheruser@club.sandwich>, orig_to=<ilikesandwiches@example.org>, relay=spamassassin, delay=0.25, delays=0.05/0/0/0.2, dsn=2.0.0, status=sent (delivered via spamassassin service)
Apr 9 13:25:13 liddy postfix/qmgr[995]: 4DC3914C086: removed
Apr 9 13:25:13 liddy postfix/qmgr[995]: 85E1F14C089: from=<pepperoni@pepperoni.pizza>, size=2323, nrcpt=1 (queue active)
Apr 9 13:25:43 liddy postfix/smtp[12212]: 85E1F14C089: to=<otheruser@club.sandwich>, relay=mail-relay.isp.tld[10.0.0.1]:25, delay=30, delays=0.01/0.01/30/0.35, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 92585180F0)
Apr 9 13:25:43 liddy postfix/qmgr[995]: 85E1F14C089: removed
My confusion stemmed from thinking of the Postfix queue ID as the message ID. THIS IS WRONG. The message ID does not change. The queue ID changes if the message goes through the queue again. And, the way we have things set up, it goes through the queue twice: once when received, then again when SpamAssassin is done with it.
(Incidentally, I had wondered why SpamAssassin did not include the Postfix ID. This makes perfect sense when you realize it's a queue ID, and SA is not part of the Postfix queue.)
The other thing that makes this difficult is that the entries for a particular message are:
Thus, the process for tracking down what's happened to a particular mail goes like this:
Look for the bit you have (who the message is from, let's say):
$ grep pepperoni@pepperoni.pizza /var/log/maillog
Apr 9 13:25:13 liddy postfix/qmgr[995]: 4DC3914C086: from=<pepperoni@pepperoni.pizza>, size=2019, nrcpt=1 (queue active)
Apr 9 13:25:13 liddy postfix/pickup[11886]: 85E1F14C089: uid=99 from=<pepperoni@pepperoni.pizza>
Apr 9 13:25:13 liddy postfix/qmgr[995]: 85E1F14C089: from=<pepperoni@pepperoni.pizza>, size=2323, nrcpt=1 (queue active)
Note the two queue IDs and search for those:
$ grep -E '85E1F14C089|4DC3914C086' /var/log/maillog.1
Apr 9 13:25:13 liddy postfix/smtpd[12203]: 4DC3914C086: client=mail.example.com[192.168.0.2]
Apr 9 13:25:13 liddy postfix/cleanup[12206]: 4DC3914C086: message-id=<deadbeef@example.com>
Apr 9 13:25:13 liddy postfix/qmgr[995]: 4DC3914C086: from=<pepperoni@pepperoni.pizza>, size=2019, nrcpt=1 (queue active)
Apr 9 13:25:13 liddy postfix/pickup[11886]: 85E1F14C089: uid=99 from=<pepperoni@pepperoni.pizza>
Apr 9 13:25:13 liddy postfix/cleanup[12206]: 85E1F14C089: message-id=<deadbeef@example.com>
Apr 9 13:25:13 liddy postfix/pipe[12207]: 4DC3914C086: to=<otheruser@club.sandwich>, orig_to=<ilikesandwiches@example.org>, relay=spamassassin, delay=0.25, delays=0.05/0/0/0.2, dsn=2.0.0, status=sent (delivered via spamassassin service)
Apr 9 13:25:13 liddy postfix/qmgr[995]: 4DC3914C086: removed
Apr 9 13:25:13 liddy postfix/qmgr[995]: 85E1F14C089: from=<pepperoni@pepperoni.pizza>, size=2323, nrcpt=1 (queue active)
Apr 9 13:25:43 liddy postfix/smtp[12212]: 85E1F14C089: to=<otheruser@club.sandwich>, relay=mail-relay.isp.tld[10.0.0.1]:25, delay=30, delays=0.01/0.01/30/0.35, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 92585180F0)
Apr 9 13:25:43 liddy postfix/qmgr[995]: 85E1F14C089: removed
Note the message ID and add it to the mix to get what SpamAssassin thought of it:
$ grep -E '85E1F14C089|4DC3914C086|deadbeef@example.com' /var/log/maillog.1
Apr 9 13:25:13 liddy postfix/smtpd[12203]: 4DC3914C086: client=mail.example.com[209.85.160.41]
Apr 9 13:25:13 liddy postfix/cleanup[12206]: 4DC3914C086: message-id=<deadbeef@example.com>
Apr 9 13:25:13 liddy postfix/qmgr[995]: 4DC3914C086: from=<pepperoni@pepperoni.pizza>, size=2019, nrcpt=1 (queue active)
Apr 9 13:25:13 liddy spamd[5110]: spamd: processing message <deadbeef@example.com> for nobody:99
Apr 9 13:25:13 liddy spamd[5110]: spamd: result: . 0 - scantime=0.2,size=1981,user=nobody,uid=99,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=42925,mid=<deadbeef@example.com>,autolearn=failed
Apr 9 13:25:13 liddy postfix/pickup[11886]: 85E1F14C089: uid=99 from=<pepperoni@pepperoni.pizza>
Apr 9 13:25:13 liddy postfix/cleanup[12206]: 85E1F14C089: message-id=<deadbeef@example.com>
Apr 9 13:25:13 liddy postfix/pipe[12207]: 4DC3914C086: to=<otheruser@club.sandwich>, orig_to=<ilikesandwiches@example.org>, relay=spamassassin, delay=0.25, delays=0.05/0/0/0.2, dsn=2.0.0, status=sent (delivered via spamassassin service)
Apr 9 13:25:13 liddy postfix/qmgr[995]: 4DC3914C086: removed
Apr 9 13:25:13 liddy postfix/qmgr[995]: 85E1F14C089: from=<pepperoni@pepperoni.pizza>, size=2323, nrcpt=1 (queue active)
Apr 9 13:25:43 liddy postfix/smtp[12212]: 85E1F14C089: to=<otheruser@club.sandwich>, relay=mail-relay.isp.tld[137.82.224.74]:25, delay=30, delays=0.01/0.01/30/0.35, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 92585180F0)
Apr 9 13:25:43 liddy postfix/qmgr[995]: 85E1F14C089: removed
And now you know...the rest of the story.
Did you know that Postfix's cleanup daemon consolidates messages going to the same envelope recipient? I did not.
Just came across maillog, which looks very cool. From TFM:
Maillog is a powerful tool for selecting and formatting entries from a
sendmail or postfix log. When a message is selected, it collects all
the mailer entries related to that message's queue id and formats them
in a more readable fashion. By default, the log fields that are
printed are: date, from, to, ctladdr, stat, and notes.
This is much better than my cobbled-together multiple-grep scripts. Rather surprised to not find it in Debian...
A while back I set up greylisting on Postfix for my home
server. It works well, but I have the same concerns now that I did
then. The script (smtpd-policy.pl
from the examples section of
Postfix' source) feels like a bit of a crock; yes, it's just the
example script, but I don't like the Berkeley DB files, and comments
in the code like "DO NOT create the greylist database in a file system
that can run out of space" make me nervous. It hasn't been a problem
-- in, oh, six months of running the file is only up to about 5.5
MB. But still: there's no provision for removing old entries, which
means an awful soul-searching battle with the database if you ever
need to trim it.
I had a brief look at the script tonight, hoping to find a way to maybe hack in MySQL support, but decided to check with Saint Google first. Sure enough, there's gps, the Greylist Policy Service for Postfix. Uses C++ for speed and MySQL/PostgreSQL for the backend, which is nice. I should be able to hack up a migration script for the old entries (just as soon as I hack up a migration script for all the old journal entries...), and all should be good.
One thing I'm noticing with greylisting, though, is just how many attempts are being made from multiple IP addresses within a short time; one attempt, today, had attempts from four different IP addresses within five minutes, all from the same made-up email address. The original Perl script has the advantage that I can change it easily -- I know Perl, and I'd be pretty much starting from scratch with C++ -- and maybe add the ability to track this sort of thing. It'd be nice to be able to tarpit attempts to do this, say on the third attempt.
Tarpitting...another problem with Linux. The TARPIT module for
netfilter has yet to be updated to work with the 2.6 kernel, and I
really don't want to switch back to 2.4 just for this. LaBrea is
nice, and I'm running a lashed-together natd
configuration on my
FreeBSD firewall box in conjunction with LaBrea running on my desktop
on a second interface. It works, but it doesn't work in the case of a
Linux webserver running on its own, outside the main firewall. I'm
even less a kernel hacker than I am a C++ programmer, and figuring out
the compiling problems and changed skbuff
route structures (say) is
beyond me. It's things like this that make me want to move to
OpenBSD. Yeah, rebuilding a server and learning a new firewall
language is a pain in the ass, but at least it's one I can handle.