Stop Me If You Think You've Seen This Word Before

If you've ever searched for anything, you've probably run into stop words. Stop words are words so common they are typically ignored for search purposes. That is, if you type in a stop word as one of your search terms, the search engine will ignore that word (if it can). If you attempt to search using nothing but stop words, the search engine will throw up its hands and tell you to try again.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/11/stop-me-if-you-think-youve-seen-this-word-before.html

I didn’t notice that until I read this!

Oh, my, you don’t know how many times I tried to find any The The album in various web stores… If I did not know exactly the album title, I ended either with no matches or too many matches.

I remember once having an enormously frustrating time trying to use the internet to discover the name of the album by the band ‘A’ that contained their hit single, ‘Nothing’.

In that case:

If I ever have a band I’ll call it It is. Our first album will be called No Matches.

And my stage name will be Orange.

An all-stop-words title that probably gets used fairly often: What might have been.

I’m almost positive that google didn’t ignore stopwords back in 2004 either. I remember searching for combinations stop queries ‘the the’ or ‘the who’ (a matter of personal taste in music :slight_smile: and got nice results.

I think the differences in the stopword lists could be partly attributed to the different corpora used for frequency counts + some tuning of the number after checking the results. if oracle used their own corporate archive to decide the stopwords and mysql used the Wall Street journal archive - they’ll probably get different lists (obviously, the most frequent ones will be the same).

btw, back at 2004 google were granting keys and API for 1000 automated queries a day. use 10 email accounts and obtain 10 keys… without violating their policy.

Stop words have always been a pain, because they lack granularity (they just don’t exist). But going without any kind of search filtering is just as annoying because it brings up unwanted data, and even google still falls for this every so often.

What happens if you’re looking for video games written in PHP? A search for php video game on google returns a vast heap of results that are results for video game which happen to have a php extension somewhere in their URL. The same happens quite frequently when you’re looking for esoteric PHP-related concepts on the web. In this situation, neither stop-wording php nor avoiding any stop-words solves the problem: a more clever technique is required to eliminate or rate down some occurrences but not others.

Hard to believe someone in this day and age would search on the the and not the the.

One key point is that the list Jeff gives (522,000,000 for ‘the’ and so on) is not the frequency of the word, but the number of pages containing that word.

The word itself may appear many times within the page, meaning that the relative frequency of ‘the’ and ‘of’ compared with ‘reviews’ is much greater than indicated. A typical 500-word page will probably have a few dozen 'the’s.

I imagine Google has plenty of rules added by hand for these words, to reflect the fact that the semantics of ‘review’ are much more specific than those of ‘of’. Plus they seem to give a higher ranking to results which use the search keywords as a phrase rather than separately; so ‘the the’ and ‘to be or not to be’ are probably handled in that way. If you rearrange the words in ‘to be or not to be’ you will get different results.

We’re missing out, potentially, on more great band names:
What the …?
Who the …?

Stop me, oh, stop me…
Stop me if you think that you’ve heard this one before

I’m getting only 11 billion results for the:
Results 1 - 10 of about 11,840,000,000 for the. (0.30 seconds)

The worst of all in that respect is the band Can. Searching for them has always been a real pain. At least that’s not a terrible name. But the band Ours is both troublesome to search for and an annoying name. I avoid mentioning them to people even though I quite like their music.

Wikipedia, on the other hand, takes search terms literally, without parsing. Searching for The The in Wikipedia is a five second job.

How’s it going with integrating Lucene into Stackoverflow?

Having worked with Lucene in the past, I can truly say that search is one of the most interesting technologies I’ve worked with.

http://bash.org/?514353
… but google has learned their lesson.

Doing a naive search for the the on google UK gives a more reasonable result.

http://www.google.co.uk/search?rlz=1C1GGLS_en-GBGB291sourceid=chromeie=UTF-8q=The+The

Half way down the first page it suggests the the band as an alternative search query.

Out of interest, did the dictionary you used for your Google experiment contain any profanity.

I’m an internet search engine developer, and yes, we have to carefully handle stop words; we never ignore them.