CAPTCHA Effectiveness

If you've used the internet at all in the last few years, I'm sure you've seen your share of CAPTCHAs:


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2006/10/captcha-effectiveness.html

I work with some of the more clever OCR technologies as part of my day job, and can tell you now - if the spammers are using anything like the technology we throw at scanning business documents, virtually all of your examples could be read.

Technical measures like captchas are useless for blocking spam. Sure, they can block spam on a particular website, but that’s only because spammers go against the easiest targets.

Just like they adapted to every other anti-spam feature, including bayesian networks, which were claimed to be a so great that "a href=“http://paulgraham.com/spam.html"there is no way they can get around that/a”.

If almost everyone was using captchas, spammers would use programs that break simpler ones. If everyone used complex captchas, spammers would find some other tricks. Spam is not technology-bound at all. It’s not even the slightiest bit lower now that it was 10 years ago, in spite of all technical changes that were supposed to limit it.

And while the spammers can adapt, users cannot, and will suffer consequences of the anti-spam technology. Do you really want captchas everywhere if they won’t limit spam even a little bit, only move it from one places to others ?

Jeff another problem with CAPTCHA’s is users with visual disabilities. The Windows MSN-Live use one system where the user can listen and typing the words to continue. What you think about this?

I don’t know where this idea that they can “easily” beat it comes from, sure, theoretically they can but as long as there aren’t any good free OCR library out there I don’t think we have much to worry about.

Captchas are very effective against most spam like Jeff pointed out with his own example.

Besides, you could use something like SVG to create and image on the fly (unfortunately still not supported by IE7), or even use javascript and create a word using overlapping divs, how would they ever be able to catch that ?

I was recently a target of a spam-based DoS attack. My hosting provider recently disabled my blog due to excessive CPU and database use, all of which was due to an avalanche of comment spam. Fortunately I was able to get back on the air using an IP based block, with the promise of implementing CAPTCHA at a later date. So your post is very timely.

Your claim about “naive CAPTCHA” being effective makes me wonder: what about text-based CAPTCHAs? eg “please enter the name of a curved yellow fruit”. This would seem to be just as difficult to parse automatically but not require users to view an image.

(BTW CAPTCHAs don’t just affect users with visual disabilities. They affect all users who for whatever reason are unable to see images. Example: text-based browser users.)

Some issues with your captcha post.

A general purpose OCR engine is not the best way to defeat a captcha. Much simpler approaches with greater accuracy is possible.

A captcha image is highly constrained. It typically uses a small image with a single font with a fixed number of letter. Each letter is distributed roughly evenly and sequentially across the box.

  • The spammer may actually rely on simple statistics like density analysis to determine the likely character. The spammer only needs to know what letter the value of a particular statistic(s)
    correlates best to.

  • A relatively simple neural network to do character recognition is also not so hard. Brain-n-brawns wrote in a day a captcha defeater. http://www.brains-n-brawn.com/default.aspx?vDir=aicaptcha

A naive captcha may be best for your site, since your site does not provide the best value (no offense) for a spammer, since it creates work and is only one site. It’s not Yahoo or Google. A spammer could put more effort on targeting many other defenseless sites which in aggregate may have more traffic than you.

Wes, I think that’s a big point of Jeff’s post. For his target audience, other bloggers etc…, CAPTCHA is still a very effective means for blocking spam.

For example, my Invisible CAPTCHA is trivially easy to beat. A comment spam bot would simply need to execute javascript.

In practice though, it is tremendously effective!

The REAL weakness of CAPTCHA is that it does nothing against trackbacks and pingbacks. This makes total sence since CAPTCHA is about filtering out non-humans, but trackbacks are by definition left by other computers.

That’s where other solutions are necessary. I use Akismet for my comment filtering. Invisible CAPTCHA pretty much blocks all comment spam, and Akismet catches all the trackback spam.

I have a phpbb2 message board that I run and I was getting boatloads of spam. Adding in the default captcha did not reduce the spam in the slightest. That is, it was completely, 100% broken, contrary to your assertion that it isn’t being used in the wild.

I ended up adding a single extra required field and that blocks out 99.99% of the spam. I’ve gotten maybe 2 or 3 spams since implementing it. It’s not even a question. And it’s not even a picture. It’s just a field in the form that says “please type 1234 here”.

I think the key isn’t captcha, per se, but just being different. Security through obscurity in a sense. There’s no benefit for some spammer to fix his script to handle my dorky custom web forum. But there’s a huge benefit to cracking the default phpbb2 captcha algorithms because most users are going to just use the defaults.

-David

Brain-n-brawns wrote in a day a captcha defeater.

For a very weak captcha-- none of the characters overlapped, and there was no background noise or lines overwriting the characters.

It’s not Yahoo or Google. A spammer could put more effort on targeting many other defenseless sites which in aggregate may have more traffic than you.

And yet spammers are unable to defeat the more advanced captchas on Yahoo, Google, and Hotmail. Otherwise, why would Yahoo, Google and Hotmail continue to use them? To torture their poor users? Did you read the quote from the Google employee in the Wall Street Journal article?


Researchers are really good, and the attackers really are not," says Mr. Jeske of Google, based in Mountain View, Calif. “Having [CAPTCHA] in place we find extremely effective against automated malicious attackers.”

Ahem. extremely effective.

The proof is in the pudding, and the pudding contains… CAPTCHA. Good ones, using the rules I outlined.

  • per-character perturbation
  • characters that overlap
  • noise or lines that overlay and touch the characters

If you look closely at the chinese Captcha defeating page linked in the first comment, you’ll see that all the “unbroken” captchas have these three things in common.

and can tell you now - if the spammers are using anything like the technology we throw at scanning business documents, virtually all of your examples could be read.

Then prove it. Use the sample images provided and post the results from your OCR engine.

This makes total sence since CAPTCHA is about filtering out non-humans, but trackbacks are by definition left by other computers.

Well, kinda, but not really. Trackbacks/pingbacks are left on behalf of an author who just made a blog post. Making the author pass a CAPTCHA before leaving the pingback/trackback would be totally reasonable… of course the technology doesn’t exist to enforce this, so it’s really a moot point.

Jeff, if your orange captcha is so effective, why do you need to filter out bee-el-oh-gee-ess-pee=oh-tee website adddresses?

if the spammers are using anything like the technology we throw at scanning business documents, virtually all of your examples could be read.
– Jonathan

Then prove it. Use the sample images provided and post the results from your OCR engine.
– Jeff

But even if Jonathan’s company’s developed OCR IP can break difficult CAPTCHA’s, their IP is probably proprietary developed research NOT available to or easily developed by the average hacker. Which re-emphasizes the “researchers are really good, and the attackers really are not” quote.

At wordpress.com we have Akismet instead of Captcha. Instead of putting up any barrier, everything goes through and Akismet learns what is spam. Once something is marked as spam for one person, it is marked as spam for everyone. Instead of preventing spammers, it’s 1000s of eyes cleaning up the results.

It’s been pretty effective, but there’s been a few interesting cases:

  • compliment spam (“great post!” with website field linking to their p-rn/adsense splog site)
  • only attacking blogs that appear to still have the default post as the first post – less likely to monitor spam.
  • one p-rn spammer who finds political/pop culture keywords in a post and inserts a human crafted messages. Like: “Some people say Matt Damon isn’t that good of an actor, I really liked him in Talented Mr. Ripley” whenever it finds a post with “Matt Damon”

The one thing it has absolutely sucked is spammers-to-be. People who are just testing out spam generation algorithms that have no payload. So you’ll get random gibberish from an IP address and it will take a few days for Akismet to learn.

What I dont get is why we haven’t moved past text for CAPTCHAs. I expect in very short order for the massive amount of information google is storing with their Image Labeler: a href="http://images.google.com/imagelabeler/"http://images.google.com/imagelabeler//a to be used to determine if a human is at the keyboard. Its going to take some near impossible computing for attackers to determine that what they’re looking at is a “bird” whereas a human can do it in seconds.

why do you need to filter out bee-el-oh-gee-ess-pee=oh-tee website adddresses

Because of trackbacks. Blogsp0t is spam central.

Instead of putting up any barrier, everything goes through and Akismet learns what is spam.

This is a bad idea.

I am all for akismet testing a comment after validating it with CAPTCHA, but letting all the comments go directly through to askimet without any local CAPTCHA verification is asking for trouble. Every comment example you’ve given would have been stopped cold in its tracks by CAPTCHA.

Plus, you could reduce the comment load on the akismet servers by a thousand percent with a simple CAPTCHA in the host comments. Security starts at home.

Of course for trackbacks, which are machine entered, it’s a different issue. I get 75 spam trackbacks per hour on this blog. Multiply that by the number of blogs, and… well… like I was saying, security starts at home. :wink:

Here’s an interesting alternative: the ASCII art CAPTCHA. :wink:

http://www.thephppro.com/products/captcha/

How about CAPTCHA with random images instead of characters?

I heard about the KittenAuth a few months ago and it seems a great idea. Grid with random images of fluffy animals - click on 3 kittens to get through to the web site:

http://arstechnica.com/news.ars/post/20060407-6554.html