WordPress Upgrades Part Two: Comment Spammers

As I mentioned, it's been a busy weekend for Gecko and I. With anything good and joyous on the Internet come spammers. Comment spam has been a minor irritant for a while -- nothing I couldn't handle by logging into MySQL directly and running DELETE statements with extreme prejudice -- but in the last few weeks it's gone off the hook. With dozens a day, it was time to start doing something automatically.

WordPress is pretty good this way -- you can set up your comments so that everything needs to be approved by the admin, or just stuff that matches certain words in the comment or URL fields. That worked for a while -- "poker", "debt" and "cialis" took care of most things. But it isn't a very sohphisticated filter, so I started looking around for something else.

I found Fahim Farook's WPBlacklist plugin, and it works pretty damned well. It imports a copy of Jay Allen's blacklist, then holds for approval anything that matches the HOLY CRAP two thousand three hundred forty five lines of regexes (a few) and domains (the bulk of the list). Plus, you can tell it to delete a comment and harvest information from it -- so it knows to watch out for that (domain, email address) in the future. All in all, I was pretty happy.

But then Gecko pointed out this elegant solution. My first name is not so obvious ("Saint? What kinda first name is that? Damn kids..."), so I put in my own simple question.

It's a brilliant idea, really: come up with a question with an answer that's obvious a) if you're at the site and b) are not a spammer's computer. Which makes me wonder what'll happen when/if AI gets a bit more common, or if spammers will start funding natural language parsing research...shudder.

In other comment spammer news, there's a really good article here about what one guy managed to find out about a comment spammer. Finally, turns out that what I was going to say was said a year ago:

...but just like everything else, the weblogging community seems intent on (a) thinking they're special and unique and nobody has ever had their problems before, and proceeding to (b) ignore all the work that has come before and reinventing the wheel. Now, certainly some adaptation of code and algorithms will be necessary. Existing tools probably can't be used as-is. Email spam fighting relies a lot on the structure of an email, the chain of headers that give away so much information to the trained eye, and none of that information is available in weblog spam. But I see from Jay's Comment Spam Clearinghouse that the latest and greatest tool available to us is a master list of domain names and a few regular expressions. No offense to Jay or all the people who have contributed to the list so far, but how quaint! I mean really. Savor this moment, folks. You can tell your children stories of how, back in the early days of weblogging, you could print out the entire spam blacklist on a single sheet of paper. Maybe with two or three columns and a smallish font, but still. Boy, those were the days.

Holy crap. I thought I was cynical. The entire article is highly recommended.