Designing For Evil

Craigslist also uses rDNS as an antispam measure (when sending emails).

Spammers concentrate their efforts at major and active sites. So unless StackOverFlow becomes one, I don’t believe it will be a target.

I don’t see spam on this site and it’s pretty active.

A flash captcha sounds good. To make it really tough for spammers, create an animation where the image of the text is split in two parts and scroll them sideways at opposite directions and when they meet at some point, they create the text. I don’t see how any software can figure this out.

One of the (maybe) possible solutions: After comment gets submitted, show it immediately, but then run background process that will send mail to gmail account, and then check if the mail ended in Inbox or spam folder. Although, I’m not sure that google would be happy with such (ab)usage of their service :slight_smile:

To address a couple of issues. It has been pointed out that in the scenarios mentioned above there is a limited number of cat pictures (3000 for MS I believe). One way to deal with this is to continue to expand and change, using a method similar to how Recaptha does it. Taking the cat images as an example, we can surely make changes to the images to change the MD5 sum and filename to try and fool machines, but we can also add a couple of random images from Flicker and/or other image services to each test. Keep track of which of these added images have been tagged as “cats” and once they have been selected [by humans] enough times, then they become part of the cat pool. These images of course would not count for or against the determination of a human. Don’t forget, we must also expire images after they have been around too long, a success rate of even 1% for a machine can add up.

I like the idea of a database of questions that are specific to the website in question (i.e. programing for a coding site, hiking for an outdoors site, etc.) the questions could be even a little difficult and require some research. But again you would have to continue to add new questions [not from easily available public sources] and expire old ones. Not an easy task in this case.

It is a good thing that for most websites just a little deterrent is enough to keep spammers at bay.

@Mike Cohen:
What about mailing lists? That is a legal and useful use of what some would term spam software. However, now we get into the tricky area of the DMCA. Are you for the DMCA?

An idea that just occurred to me is that, make the terms of use for your site stipulate that any form of bots or scrapers is tantamount to reverse engineering… which of course, is disallowed under the DMCA. Then, level a dmca notice at the company being advertised.

It has a few flaws, I know, but maybe some feedback?

I hope Craigslist survives… I found my current apartment (I live in NYC) there so easily and the thought of paying a realtor again makes me sick. Even the newspaper is a horribly stressful method here.

Personally I feel it’s a theoretically impossible task - the difference of bots and humans is the concept of humanity itself. We keep seeing that in order to overcome bots we keep making the validation technique more and more human; but it fails…

This is because it’s like the allies are having a secret meeting at a nazi headquarters - in german. IMO the only way you can avoid it is by developing a medium of information that computers can’t intrude - speak in a language nazis can’t know, or get out of there. Not something they will have trouble with, but something they CAN’T.
Another thing is to eliminate anonymity - which is a scary 1984’ish idea…

MySpace has been fighting this battle for years. They’re mostly sorta on top of it now, to the vast detriment of the flexibility of markup and javascript that they allow you to use to pimp your profile.

Since adding their new developer platform (with OpenSocial), there’s been an upsurge in spam from badly coded apps. They’re plugging those holes pretty quickly, but some of the solutions involve limiting access by apps to the system.

Sigh

I agree: you need someone on your side who understands the mind, tools and tricks of the enemy. That is, you need your own private police force.

Jeff,

you will need a (trained) Bayesian filter to win spam battles. Spam is your friend - it trains the filter which in turns becomes more effective against spam. Best way to fight spam is to use it against itself.

Captcha is annoying and useless, as there are scripts out there that can work around it. Nothing can work around Bayesian filters.

I’d be curious to know some statistics with regard to how many posts each section gets in a day. This problem seems ideally suited to traditional machine learning techniques, but maybe the size of the data sets makes it infeasible for Craigslist. Assuming the data set was small enough or Craigslist had sufficient resources to allocate, something like an unsupervised learner which clustered posts based on a series of attributes and then using community input to label the clusters (something which is already happening with the ‘Mark as Spam’ links on each post).

CAPTCHAs are incredibly annoying for the good guys, and don’t actually stop the baddies, so give up on them please - use the amusing alternative where you get 3 random tiny photos, and you have to click on the kitten.

I assume the most popular protection methods are going to be targeted by the spammers first, so using your own off-the-wall solution might actually work best of all!

I have 2 cents to chime in about this topic. And it’s a very philosophical 2 cents really, so bear with me or just skip over.

I notice that SPAM is almost developing into a hive mind that creeps into the regions which “deserve” it the most. Yeah, sure there’s email spam, but one bayesian filter later I get practically no spam ever in my mail. Gray listing is also very powerful.

But what I notice on sites like ebay and craigslist is that whenever we as a society get lazy and try to do things “the easy way” - think of all the board room meetings parodies that we’ve all seen by now where a 22 year old “genius” says “We’ll bring the pet store to their fingers and make profit” - whenever we do that, spam shows up.

I mean personals are a notoriously lazy way to date and really, even without spam, I could never trust a posting on something as anonymous as the internet. Call me untrusting or cooky for that, but it’s simply absurd for me to try and reconcile something as intimate as dating with something as anonymous as the internet. Go clubbing/jogging/walking your dog if you want to meet random people.

All in all, the point I’m trying to make is that SPAM creeps in to places where we stretch the reach of our daily experience more than it’s meant to go. The reason why craigslist or any other online site has difficulty separating SPAM from HAM might be that there’s almost no difference between the two: indeed, how could you possibly tell if a personal ad is genuine or not?
In that sense, I think for stackoverflow.com that so long as there’s a difference between genuine content and content to make profit, you will have no problem getting rid of spam. Patterns will be easy to detect, text will be easy to recognize, and you can even setup tests that are extremely task specific like programming language based riddles or whatever.
As soon as you introduce “profit making elements”, like “hire a coder” style stuff, you will be faced with impossible to detect SPAM squall.

There’s a reason why only certain areas of craigslist are spammed. When you are looking to buy cheap $5 stools from people moving out, you are unlikely to be a big spender and your tolerance is very low; you are after all looking fora five buck stool. But go to the real estate for sale area, and you will get hundreds of spam postings. In the same way, spammers naturally go to places where people have to either be gullible to begin with or lower their paranoia threshold to be able to participate (like personal ads).

My 2 long cents.

Buggy,

“stating” and “doing” are two very different things. Anyone who has used craigslist exctensively will know that spam is just noise. As I just pointed out in another post, if you want to put spam on a backburner, use a Bayesian filter. Captcha is too primitive for that.

Captchas (specifically pictures of words) have been broken. I spent a day researching the state of the art for breaking captchas and it turns out that there is code available out there (OCaml and Python was found in about 10 minutes). The supposed gold standard of captchas, GMail’s sign up, has been reportedly broken. If you actually sit down and spend a day thinking about how to break it, and you’re even remotely talented at programming, the solutions become pretty obvious. Clearly reasonably talented programmers are doing this (what programmer do you know that knows OCaml, but is also incompetent?).

Another issue is that certain types of attacks can be jump started by human interaction. It turns out that if you are spamming a site over and over you get a pretty good idea what the correct answers to captchas are. If you have a team of low paid workers work an hour on entering them, then that is usually enough of a seed to overcome the captcha. This can also work with the aforementioned pr0n site redirecting.

Realistically, for a spammer to be effective they only have to get the captcha right about 25% of the time. Anything with choices that can be guessed (like 4 picture options with kittens) is an immediate fail.

The underline problem is that there is HUGE money in this activity. An out of work programmer could easily support himself. I know people who own million dollar per year businesses supported by this kind of activity. And those businesses are for a specific niche, I can’t imagine what a general situation would be like.

Go read about Asirra before posting something like “oh, but spammers can build a database of all the 100 or so images and do an Md5 hash to determine which are cats and dogs”.

Asirra has a database of about 3 million images and it’s always growing, thanks to their relationship with petfinder.com. Imagine if all pet websites would contribute - Asirra would probably grow significantly faster than spammers can keep up.

Although, the fact that a user has to organize 12 images into 2 categories means 1/2^12 = 1/4096 spam posts will still go through. But, combine this with requiring users to register for an account to post, and give users the option to flag posts as spam, I think this could be very effective.

With as popular as this blog has become, I’m surprised your ORANGE captcha still works so well.

@keppla

It may only be 4096 possibilities, but it resets when you get it wrong. It’d be pointless to blindly guess.

Ryan

Pretty much every technology, from the rock forward, was invented for the purposes of Good. (Caveman Ogg smashes wheat with rock, makes flour.) Almost inevitably, someone eventually comes along and uses the technology for Evil. (Caveman Grogg steals Ogg’s rock, smashes Ogg in head, pwns Ogg’s rock.) This has been repeated many, many times. We always think the problem is with the technology. (Maybe if we wrapped the rock in something soft so it couldn’t hurt people… or if we only issued rocks to people we trust… or invented rock-proof armor…) Maybe we should investigate fixing people and not technology? :slight_smile:

While the idea of having people prove themselves by adding non-spam content is attractive there might be an initial hurdle with people being disinclined to try out your site if the first time they try to contribute they are apparently ignored because they need to be moderated.

Could we address this by building a trust metric on top of OpenID? I’m thinking of something vaguely like Advogato, except you build up reputation on several participating sites and that serves as your letter of introduction to another site that trusts those sites to know who to trust…

I’ve always wondered if a CAPTCHA that ‘just’ targets URL’s is available. If a user were to spam a product then surely the only way to get anything out of it a URL must be added to the message?

Perhaps the new method of fighting spam will consist of a centralised, human group that moderates every URL posted, checks it personally and verifies the message. Imagine something like Akismet for WordPress, but run only for URL’s to verify whether the message is spam or not. I can imagine a centralised website that verifies every URL posted in any Forum or Blog software manually using paid workers could be very effective, although I’m positive that it cannot be that simple.

One thing worth noting is that a would-be spammer can use a captured captcha image from your site as a captcha on their site, thereby getting humans to do the OCR. One way to undermine that strategy is to include information that identifies your site in the captcha. Similarly, bits of text that are obviously irrelevant (to a human) can break OCR-based attacks.

One thing you failed to mention about Wikipedia is the vast amount of bot work that reverts vandalism over there. If humans were solely in charge of keeping Wikipedia in good shape then it would be in shambles. There is an IRC channel that receives every edit done to Wikiepdia, a bot then check the page for known bad URLs and string and reverts if necessary. Also Wikipedia as nofollow = true for all external links.