Whitelist, Blacklist, Greylist

I recently got into a spirited discussion about Akismet. What is Akismet?


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2006/10/whitelist-blacklist-greylist.html

Thanks for the linkage.

Since I posted that on your captcha post, I’ve got a live example of “test spam” that made it through Akismet: http://engtech.wordpress.com/2006/09/20/vistaprint-business-cards-isnt-a-scam/#comment-2822

What was kind of cool is that within 24 hrs it was followed by another 80-95 that Akismet all caught.

The website payload was usually articles on design, Wikipedia references, etc. Stuff that wouldn’t be considered spam by most.

I’m of a different opinion. Akismet has proved excellent for me, catching 1843 spam messages so far with zero normal comments marked as spam. There has been around five spam messages which went through but I cought those in manual moderation.

CAPTCHA’s are awful usability wise. Heck I have run into CAPTCHA’s that I can’t even read. That doesn’t even get into people who are colorblind or worse, really are blind and rely on screen readers. I would not recommend CAPTCHA’s on any blog. I can deal with one spam comment sneaking through once in awhile.

SPAM SPAM SPAM SPAM SPAM SPAM SPAM SPAM SPAM

I never bothered with Askismet, because it wasn’t standard when I abandoned MovableType and it’s crummy spam-fighting tools for Wordpress. Instead I ended up with SpamKarma2 (http://unknowngenius.com/blog/wordpress/spam-karma/), which I’ve been using ever since.

It uses multiple rules to develop a spam score. No one metric is necessarily enough to mark a comment as spam (or not spam). It uses a blacklist, but that’s just one of the criteria.

It’s worked amazingly well for me. I may have had one or two false positives in the 800+ spams It’s caught. I’ve had almost no false negatives, even though spam volumes seem to have jumped up dramatically in the last few weeks.

I agree with combining local measures with Akismet. For trackbacks though, combining CAPTCHA won’t work, for now at least.

http://haacked.com/archive/2006/10/31/CAPTCHA_For_Trackbacks.aspx

What was kind of cool is that within 24 hrs it was followed by another 80-95 that Akismet all caught. The website payload was usually articles on design, Wikipedia references, etc. Stuff that wouldn’t be considered spam by most.

Still, there’s no way this stuff would make it through CAPTCHA. And using CAPTCHA (for comments, obviously) would reduce the load on Akismet substantially. It’s one less HTTP round trip for data you know with 99% certainty is already bad.

Don’t like CAPTCHA? Take your pick. ASCII art, simple math problems, naive captcha, or Phil’s Javascript “Invisible CAPTCHA” which has a clever fallback to HTML.

Chris G, read the last post on this blog before you make such an overarching proclamation.

Does akismet use the sbl-xbl? Currently spamhaus is the most respected RBL, and most comment spam comes from the same sources, so it’s natural to integrate with them. You don’t even need akismet for that, you can easily modify your comment submission page to do that (if such a mod isn’t available now).

Not sure I understand why info should be on the graylist, I have my personal website as an info site.

Have you really seen lots of spam with info url’s ? I sure havent, most of them are com.

In fact having a info domain protected me for a long time against spam, it seemed their email harvesting programs didn’t understand there were a bunch of new top domains, however in the last year or so that has changed.

I think it’s difficult to compare fighting email and blog spam. When AOL uses a DNS blacklist to block incoming mail, you can bet that there’ll be damage done. But when Joe Blogger uses the same blacklist to block possible web spammers, what’s the worst thing that can happen?

I’m using a DNS blacklist (sbl-xbl by spamhaus.org) quite successfully on my wiki. Never had a false positive and blocked lots and lots of spam (see http://wiki.chongqed.org/CaughtSpam). It’s no silver bullet, but it saves a lot of CPU time.

Did I mention that I hate CAPTCHA? Must admit that your’s is the best I’ve seen so far.

why would spammers all collect to a particular top domain, that would be pretty stupid wouldnt it ?

Spammers are not exactly known for their genius. I don’t like blocking *.inf0 or blogsp0t.com either, but it’s sadly necessary due to the volume of spam coming from there.

I agree that CAPTCHAs are terrible for usability.

Akismet has been great for me so far. Sometimes it does make Type I errors, but I have a back-end set up to fix that. I’m not going back to CAPTCHAs as long as Akismet is around.

Have you really seen lots of spam with info url’s

http://chris.pirillo.com/2006/08/17/info-domains-are-dead/

Did I mention that I hate CAPTCHA?

There are a lot of things I hate, such as the security line at the airport, locks on my doors, and waiting in line at the department of motor vehicles. But they’re all necessary because the alternatives are even worse.

It’s no silver bullet, but it saves a lot of CPU time.

Which is crazy, because we have essentially infinite CPU time, and more being created every day. What we don’t have is infinite bandwidth, or infinite mental bandwidth.

That’s why CAPTCHA is such a good idea: it optimizes for people using a resource that is already plentiful and getting more plentiful every day.

Yeah, much agreement with Jeff and Chris Pirillo on .i-fo being dead. A lot of spam come from there.

I’ve noticed Akismet will mark . info comments as spam even if they’re valid – the .i-fo domain is that bad.

(I can’t even type . info on this site: Your comment could not be submitted due to questionable content: .i-fo matching (.i-fo))

A good alternative (at least for now) to the CAPTCHA is to ask the users browser to solve a math problem in Java Script. No user interaction is involved so there is not a usability concern. Wordpress has a plugin called ‘HashCash’ that does this. It keeps the amount of spam you have to moderate down to a minimum of human entered spam:

url: http://elliottback.com/wp/archives/2005/10/23/wordpress-hashcash-30-beta/

I use this along with Akismet and its like a Teflon wall.

Thats ridicolous to target a particular top domain like that, why would spammers all collect to a particular top domain, that would be pretty stupid wouldnt it ?

I don’t see the rules being any different for registering under another top domain so there is no reason to discriminate against info.

He goes as far as saying that the people registering one are basically idiots, we’ll, I registered mine the same day it was released so it wasn’t any spammers under info at that point.

This is just ridicolous, it really makes me mad, so now we are blocking top domains ? Great, why not start to block complete countries as well.

So you’re blocking links to blogsp0t, but anything that points to google.com goes through automatically?

There’s a potential loophole there.
http://www.google.com/url?sa=Dq=http%3A//yes-it-really-is-this-easy.blogspot.com/

So you’re blocking links to blogsp0t, but anything that points to google.com goes through automatically?

I have actually seen a number of trackbacks exactly like that-- but the blacklist takes precedence over the whitelist, so anything containing blogsp0t will be discarded.

I’ve relaxed the blogsp0t rule recently because I closed a lot of my old posts to new trackbacks.