Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?

I'm not a huge fan of The Daily WTF for reasons I've previously outlined. There is, however, the occasional gem – such as this one posted by ezrec:


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html

My girfriend lives in ScuntHORPE and the Scunthorpe problem keeps catching me out.

I work in a school, we have filtered Internet access to prevent the kids from searching for porn, etc. So if I ask Google Maps to generate me a map taking me from Scunthorpe to somewhere… it’s banned.

And yet I can type holly into Google Images and get pages of porn.

Web filters will ALWAYS fail because once you realise there’s one in place, the game becomes how can I defeat it?, and there’s nothing better at brute-force testing than a class full of bored students with Internet access.

Really, if you think your web filter is any good, deploy it in a school and then look at the logs.

It’ll take them about half an hour to find a web proxy - even if you blacklist the word proxy. And once they’re on a proxy everything is defeated.

Our web filter is notoriously bad, banning parts of session strings, or branding parts of Google as being Blacklisted image filter or weighted phrase limit exceeded. One day the whole of Google was banned, searching for The Simpsons doesn’t work, neither does computer keyboard or The BBC. We have no idea why.

Unfortunately we buy our filtered Internet access from our local education authority, so have no control over it.

Oh, I remember that way too well. There are actually much older systems that tried to filter obscene words. I remember myself how I wrote f*** and sh** so many years ago, as fuck and shit caused your post to never go anywhere. I fail to see why censoring people. What’s so bad if I say fuck? Will the world stop spinning? I said the evil word, so what? If people want a kid-safe network, they should start creating a .kid first level domain, with very strict rules who may obtain one (and regular checks what people are doing with it). This part of the web can then be highly censored, so I guess this will have two effects again: Children will find out tricks to circumvent the filtering (as they always do ;-)) and teenagers will bother their parents to remove the Internet filtering as soon as possible, saying .kinds is only for babies. Forcing me to only surf on these pages is like forcing me to play in the sandbox. I’m getting to old for that… and I remember way too well, if you only keep bothering parents long enough (and if you are really good at it), they will give in sooner or later :stuck_out_tongue:

I said it before and I like to repeat myself here:

Technology is not for solving society problems. It’s hardly ever the cause of such a problem, thus why should it be the solution to one?

Just a few weeks ago, every single lower-case ‘t’ on www.cisco.com was missing. I mean all of them - the entire rendered HTML had been post-processed with a rampant reg ex, so all the javascript and CSS was broken too. I took a screenshot if anyone is interested:
http://img396.imageshack.us/img396/66/ciscotfailwv8.jpg

My favorite example was a filter to ensure proper references to the Queen of England. Applied to a story about honeybees.

With its highly evolved social structure of tens of thousands of worker bees commanded by Queen Elizabeth, the honey bee genome could also improve the search for genes linked to social behavior.

Queen Elizabeth has 10 times the lifespan of workers, and lays up to 2,000 eggs a day.

http://www.regrettheerror.com/wire-service/reuters-typo-tells-us-queen-elizabeth-has-10-times-the-lifespan-of-workers-and-lays-up-to-2000-eggs-a-day

I disagreeumptions have been proven right…

Yeah, it’s really not a good idea to just do global replaces for words that you think are obscene. I can see a programmer doing it if they are working on a forum for Disney or something.

You could always just replace Carlin’s Seven Dirty Words. Most of those don’t appear as subsets of other words.

http://en.wikipedia.org/wiki/Seven_dirty_words

But then even if you filter, people will find ways around it.
fuck could be fcuk, kcuf, f0k, phawk or p#4K, chinga in Spanish, or even a fake word that means the same from a TV show like frell or frak.

It is entertaining though, to see how far people will go in filtering kids toys. I’ve tested a few of my kids noisy spelling toys, and a few of them will say something like oops! if you spell a curse word.

One could argue the ‘superior’ regex is still flawed
\btit(s?)\b = breast$1
Following the capture as little as possible guideline.

Totally whining and nitpicking of course.

  1. How many different ways can you enter penis?

that’s what she said

Google could do a better job if they wanted to. They are able to filter content quite effectively in various countries, at the behest of those countries many times…

This is such a silly game.

Good post.

I implemented one of these systems. In Australia all data on a user actually belongs to the user and they can get a copy whenever they want. So I provided warning mechanism that attempts to reduce swear words in notes on that customer while allowing the users to proceed if they are sure it is OK. Also, it doesn’t stop the user saving because the note may quote the customer swearing at us.

However, there were some interesting parts of this project:

  1. It is actually quite a hard task to locate a good list of offensive words. Many words you think are OK are offensive to some people.

  2. Context is key. There are religious words that are ok in some situations and offensive in others. Also, we have found that there are many people with last names such as Cockburn (pronounced Coburn), Dong and… well, you get the picture.

Good post, perhaps one day there’ll be an effective white list plugin for the popular blogs complete with several gigabytes worth of dictionary?

I’ve seen this in a company intranet a while back, all kinds of hilarity ensued!

This is the problem with software-buttisted obscenity filters.

It’s a very difficult problem, especially when you’re working on sites aimed at young people where there is any degree of user supplied content. Kids love to swear and they love to break stuff, so the only real way of getting it to work is through full moderation, but I remember implementing a content filter for a message board on an educational site one time - we were smart enough to have tests to pick up most swearing that was standalone, then cram all the letters together to catch anything that was done by hyphenating words and so on, substitute basic 1337 characters for the relevant letters to be ready for that one- all in all it was a pretty good basic system for this task and it worked really well. Also the file that contained the dictionary of forbidden words was great- a list of all the variants of every potentially rude word you can think of is surprisingly funny.

The one thing we weren’t expecting was for the users to self-censor, so the first time we saw a message where the little dears were telling one another to f*** off was quite a surprise to us.

Turns out if you need to swear in a content filtered environment, the easy way is to drop in an html entity code for one of your letters.

When my kids were younger they used to play the flash games over on NickJr.com. My daughter, Kassandra, called me over saying that the game wasn’t working - she was around 3, meaning that this could be anything from the focus being on the wrong app to she hit a new key sequence that would take me a few hours to undo. Thankfully, this one landed in the middle. The game would, you guessed it, not accept her name – her name was a foul word. K-ass-andra.

Great implementation.

The funny thing is even Rock Band 2 has this problem. They call unsuitable Band names not classy and will refuse to list them on Xbox Live, but it’s a mystery why the filter is triggered:

http://www.rockband.com/forums/showthread.php?t=90574

I understand the certain 7 words used in a George Carlin bit can’t be used as well as other profanity. That’s fine. Now let’s figure out some other things.

My wife made a band called Stinkfoot in honor of our cat who had an infected cut on his foot. (All fine now!) So, I’m guessing the word Stink is the equivalent of $%# and *^#$#$ now?

I find that it doesn’t matter b/c people kan’t spel ineeway.

Disney has tried several times to create a safe chat environment – meaning one where people can’t communicate anything negative. For example, they tried a system where only words from a whitelist are permitted. It never works, as people always find a way to route around the restrictions.

For a great post on the topic, see http://thefarmers.org/Habitat/2007/03/the_untold_history_of_toontown_1.html

Funny, just posted about this too. Was responding to an ITT, basically saying that automatic content filtering like that doesn’t work. My favourite story was about Tyson Gay (you can guess where that’s going):

http://www.guardian.co.uk/technology/blog/2008/jun/30/computerautocorrectssurname

Actually, one thought I had was that there’s also a question of culture in this. I had to sit and think what was meant by ‘clbuttic’, 'cos here in the UK ‘butt’ is something you store rainwater in, part of a cigarette, or something football fans do with their heads…