CAPTCHA Effectiveness

There is a program used to download files from different file-share-sites. It includes OCR engine that, in particular, works with rapidshare.de. And it usually works better than me. Could it be I am robot??? :slight_smile:
Rapidshare uses low-contrast captchas, and it surely gives benefit to OCR against eye. Overlapping semitransparent symbols with colored line/dot/curve/character noise are also used there. OCR handles this. And seems it is open source.

Here’s a Chinese site that offers Captcha-defeating software, for a price.

http://www.lafdc.com/captcha/

(warning VERY slow to load)

This site considers the CAPTCHAs used by Yahoo, Google, and Hotmail “Very Difficult” and cannot actually solve them. I used it, along with the results on this site…

http://sam.zoy.org/pwntcha/

… as my template for what a good CAPTCHA should be:

  1. low contrast only hurts humans, and doesn’t affect OCR results at all. So high contrast.
  2. per-character perturbation
  3. characters that overlap
  4. Some noise or lines that touch the characters

I haven’t seen any software that can deal with #3 and #4, which is what Yahoo, Google, and Hotmail use. Note the disclaimer at the second link:

–
PWNtcha does not work [on any CAPTCHA]. It is not an intelligent program that tries to decode a random CAPTCHA. Such a program would be nearly impossible to do. PWNtcha is simply a toolkit of image manipulation functions, and a list of known CAPTCHAs with the associated list of image operations to apply in order to decode each of them. If I have never seen your CAPTCHA, then PWNtcha does not know about it, and there is absolutely no way it could decode it.

On my blog, I tried an “accessible captcha” who ask very simple questions instead of words. For exemple, “in 656486473, what number come before 3 ?” or “what’s the result of twenty two plus nineteen ?” (it also use sometimes a visual captcha). It’s available for dotclear (but it’s in French) at http://www.atelierphp5.com/un-captcha-accessible.html

Anyway, the idea behind it is quite simple (and free) to make.

To the people still saying captcha is broken, like Jeff says, prove it instead of saying it. Of course someone spending lots of time on it can “break” most captchas but that’s not what we are talking about, we are talking about building a good enough OCR engine into a spambot that can effectively spam people using capcthas.

If it takes them five minutes per blog to find the text, post the spam, check if it’s posted, read the next image or test the next match it’s no worth it, that would effetively stop their spamming.

One thing that people seem to miss is that there is no obvious way to automatically determine what image is the captcha image, you would need to test every image on the page.

Lets say you create five captchas on the server and then printout a javascript that creates and organizes a set of divs so that the correct combination appears, tell me how any ocr software could catch this ?

I suppose you could make something that takes a screenshot of the page and tries to find the text but then again, where is the captcha located among all the rest of the text ?

Captchas are far from broken and as jeff pointed out even simple captchas are effective.

At least in this world there are real solutions, it hasnt gotten as bad as email where we have to have blocklists and kinds of stupid checks to block spam.

A user wrote…

“That pwntcha site is a goatse in disguise, please don’t post it.”

and it is in fact not. There is an unfortunate use of a distorted goatse image in one of the captcha examples, but the site itself is valid and worthy of being posted.

I’d think the world would be inured to goatse by now. ahem Anyway, captchas are good, for sure, and I’ve been poking a site I off and on moderate to implement them for anonymous comments for some time now.

To everyone who says do stuff with javascript, well, that’s the fast track to leaving out in the cold disabled users and the occasional people with javascript off or unavailable. You can also make an unbreakable image captcha, and the spammer will move to dragon naturally speaking if you provide an audio version. (I hear that used to work pretty well, until hotmail started distorting the heck out of them.)

In general, what’s going to save you is the sheer heterogenity of captchas. There are a number of libraries available, which can each be tweaked in various ways. If spammers break through, just swap the library out with another. Lazy admins get spammed because they’re too lazy to do maintenance like that, not just because their captcha is weak. (In which case their laziness will generally manifest in many other ways that endanger their server.)

Come on Jeff, explain why you’re happy to discriminate against people with visual disabilities, contrary to legislation.

Captchas are an effective tool in the war on spammers, but unless implemented like MSN with alternative mechanisms, it discriminates and excludes people, exactly what the internet was not supposed to do.

I don’t believe weighing spam against a significant percentage of internet users is fair.

Wow, I called this article months ago :slight_smile:

Trackbacks/pingbacks are left on behalf of an
author who just made a blog post. Making the
author pass a CAPTCHA before leaving the
pingback/trackback would be totally reasonable…
of course the technology doesn’t exist to enforce
this, so it’s really a moot point.

Well that would require a modification of the Trackback API and the Pingback API. The same issue applies to the Comment API. None of these APIs even consider the idea of SPAM.

I’ve tried to get in contact with the authors of the Comment API to no avail. I think it’s time we updated these APIs.

Who’s with me?

Accessibility is an issue. As mentioned before, some implementations provide an audio version so visually impaired people can get access. As the Wikipedia article (linked in the original post) mentions, people who are both visually and hearing impaired are left out.

Also mentioned in the Wikipedia article is the possibility of using a challenge that requires thinking, such as solving a simple math equation or answering a trivia question. While this inevitably isn’t totally safe from compromise, it seems at least as good as using the images, and it’s nice that it could include virtually eveyrone.

Another thing to do to block bots and improve accessibility is to have them input the same text they put in a previous part of the form (chosen randomly, preferably a required element, i.e. name (on this blog)). And to keep bots even more confused, have the captcha image, except hide it (turn off it’s display, move it out of page, size it to 0, etc) so bots pick up on it and enter wring info into the captcha box. This method maintains accessibility for all users (since they don’t need to read an image or hear a sound byte).

My only problem with captchas is when I can’t read them. I don’t have any real vision problems, but there are fonts that don’t seem to make it into my brain (some cursive fonts that are used in logos get mis-parsed). When I get a captcha that is is too bent or twisted, I often have to do it twice.

More than that and I just leave, and give up on the site.

At my company, we deal with OCR on a daily basis. Based on that experience, your findings are absolutely not suprising at all to me, Jeff. Even though we use very advanced OCR engines here, the data that comes out isn’t the best.

I have no problem with CAPTCHAs or filling them out. The ones I do have a problem with would be example #5 you give: Extreme perturbation. That one is kind of hard for me to read because the 7 is getting overwritten by the A (though I still can read it obviously).

But why is my “Enter the word” the same word, every day? If I have to enter a word, at least give me a new one every so often. Why not put up an image and have a person answer what it is. “Circle” “Triangle” “Bill Gates in jail” (oh I just chuckle every time I see that mug shot)

Ticketmaster is the worst CAPTCHA I have experienced. Sometimes it takes me two or three tries to figure out the words they are spewing forth. Eventually it gives me an easy one, but then again, wouldn’t the software eventually get in if the sites eventually give it an easy one?

Phil Haack (is that his real name?) has blogged about ‘invisible CAPTCHAs’ that use embedded javascript to solve the CAPTCHAs, so the user never actually sees the CAPTCHA (unless the browser doesn’t support javascript).

It’s an interesting idea based on the fact that spambots don’t interpret javascript:

http://haacked.com/archive/2006/09/26/Lightweight_Invisible_CAPTCHA_Validator_Control.aspx

CAPTCHA, 99.9% effective in blocking the visually impared.

What about just making it hard for spammers to actually submit the comments? Like use different names and order for HTML form elements on every blog?

The key enabler for spam is software monoculture.

In the case of email, we have no choice, you need a standard protocol for mail. But for a human activity, like commenting on a blog, all that’s required is that the user understand how to operate the interface-- what it looks like.

I don’t use Movable Type anymore, but when I did, I used a nonstandard installation and renamed some of the directories. By looking at the server logs, I saw that this stopped at least some spambots. The next step would have been to hack the MT code to change the names of the form elements, and maybe even add some “honeypot” form elements (invisible fake post buttons and comment boxes maybe?).

Doug, I think everyone would be happy to not “discriminate” as you call it if it was easy, unfortunately it is not very easy for a private person to make his site fully available for people with disabilities.

These laws that exists in some countries requiring all sites to be available for people with disabilities are absurd, who’s going to provide that technology and pick up the costs ? Where does it stop ? There is always something more that could be done for a particular group with disabilities isn’t there ?

Granted all government sites should be fully accessable but other than that these laws just proves how out of touch politicans are and how easily they can be persuaded by interest groups.

mikeb, the MS Ajax Tookit has something similar named the NoBot component, however hwo hard would it eally be to create a spambot built ontop of IE or Gecko that would easily pass the javascript tests ? It wouldn’t be hard at all.