Even word boundaries are extremely flawed. Try discussing the works of Philip K. Dick on the internet…
Ilari, that is easily the best link I have read all day, and it speaks volumes about human nature.
I want to stick my long-necked Giraffe up your fluffy white bunny.
Several years ago, I wrote a porn filter for a web search engine. The goal was to detect porn, not substitute naughty words. Testing it was quite interesting. I was paid to browse porn for a while.
It was also interesting trying to get the word list internationalized. We contracted with free-lance translators and we’d get back non porn words. For example, if we asked for the translation of tit we’d get breast in the target language, which was not what we wanted.
Add Cumbria to you list of difficult UK locations, had that pop up in one of my systems
These filters also have a social engineering effect too - my wife has stopped referring to our cats as pussies or pussy-cats in emails, because the emails are bounced. The effect being to kill off all but the obscene usage of the word and reinforce its offensiveness.
I wonder if non-American countries have this problem. This urge to limit certain types of speech utterances at any cost strikes me as a particularly American fantasy. One of the more amusing things about the internet as a whole is how it creates a huge tension between the stated American desire for free expression and the fact that once Americans see what free expression really means, they recoil in horror.
Did anyone find this post a bit ironic given Jeff’s previous post on rolling his own HTML filter for Markdown?
It seems when it comes to obscenity filtering, only fools would try and roll their own naive implementation, but when it comes to HTML filtering, well then, that’s a different matter altogether.
The subject came up on a mailing list I’m on, and somebody mentioned a university site that tried to ban the word ‘#1089;ialis’, making it impossible to discuss socialism…
(The ‘c’ in the above example is actually a Cyrillic ‘s’, because the site, with an impressive ironic flair, gave me a Your comment could not be submitted due to questionable content message the first time.)
OK, Jeff, you got me that time. Once more: #1089;ialis
Well, you know what the word is, anyway. I shall admit to having been defeated by /your/ filter
Of course, as I say that about Americans, it’s no doubt that China is probably better at censoring the internet. But at least they don’t pretend to care about free speech as much. My understanding of the Chinese approach is to skip automated filters for the most part and instead use an army of human censors. If I were Disney I’d probably do it that way — although I’d outsource my censors to India or China.
The effect being to kill off all but the obscene usage of the word and reinforce its offensiveness.
On a related note, watch the Lemon Demon Song of the Count video – http://www.youtube.com/watch?v=6AXPnH0C9UA – and try to hear the original lyrics instead of what your brain fills in due to years upon years of media conditioning. (I’m not being a conspiracy theorist; you’ll understand what I mean when you watch it.)
when it comes to HTML filtering, well then, that’s a different matter altogether
they sort of are two totally different things – HTML is a parseable computer language, not an infinitely malleable human language.
- How many different ways can you enter penis?
- How many different ways can you enter a hyperlink?
Sure, #2 is large (har har) but #1 is pretty much INFINITE.
Don’t forget foreign names too. Fk and st are common in some Japanese and German names.
When I was a bit younger and definetely more innocent, I used to play the game Ultima V. It’s still my all-time favorite game. I’m from Norway, and at the time my english wasn’t perfect. It still isn’t, but it’s at least better. I know some curses now.
Anyway, in Ultima V you had actual conversations with in-game characters. Not the variant that’s popular these days, when you get maybe three alternatives, and click on the one you’d like to say to the character. No, you actually had to type your question. Or, rather, you had to type at least the four first letters of your questions. The internal database had some keywords that triggered the appropriate response. So when I typed job the character would answer something like I’m a blacksmith. And if I then typed blacksmith he would elaborate. Whenever I typed a dirty word, the character would say With language like that, how did you become an avatar?.
Then one day I spoke with Sven. I asked him about his job. He told me he was a glassblower. I asked him about glass, and he told me some stuff about glass. I asked him about blow and he told me off for being naughty. And, being young an naive (and not english) I didn’t understand why. Ah, good old days. These days if I ask someone about blow and job at least I know what I’m asking for
Worse than useless, filters can encourage swearing. If you set an arbitrary line of what is unacceptable you are making everything which isn’t caught implicitly acceptable. I was once told of a forum which was generally quite polite (apparently they do exist and never had a problem with swearing until they implemented a filter. After that it was socially acceptable to say sh1t because the system said you could.
Even more annoying are the ones that wipe out your entire sentence or post. I’ve been on forums before and had my paragraph changed to something like _____ said a bad word
That is frustrating, because you just lost everything you typed. Also, once people figure out you have a swear filter they start doing things to avoid it such as sb/bhit and sh1t. It really is quite pointless.
On another related note I have been on other forums that when I click the Post button I get an error that says Your post contains an inappropriate word or something similar to that without telling you what the word is. On that particular forum I’ve had it block things like stupid but not ass. It was very random.
Word filters are bullshit.
Interestingly enough, I just posted something yesterday on my blog where something slipped by me and I didn’t notice until someone pointed out that I had unfortunately left out the first vowel in the word count. I wouldn’t have minded at least a little squiggly underline (maybe in yellow instead of red?) to suggest that in an article on HttpWebRequests and cookie-based authentication such a word was somewhat suspect. After all, Google is my blogging platform, and while statistically they know that both the word I wanted and the word I actually typed are valid spellings of those two words, they could probably also intuit that one doesn’t belong in the current article. No heavy-handed replacement strategy, just a nudge would have done it.