Stop Me If You Think You've Seen This Word Before

Johnny - quit that.

If you search for The The with quotes, Google returns the band’s official website, their Wikipedia page, etc. Exactly what you’re looking for. Specifying an exact phrase instead of just a set of words seems to solve a lot of the issues you identify.

One key point is that the list Jeff gives (522,000,000 for ‘the’ and so on) is not the frequency of the word, but the number of pages containing that word. The word itself may appear many times within the page, meaning that the relative frequency of ‘the’ and ‘of’ compared with ‘reviews’ is much greater than indicated. A typical 500-word page will probably have a few dozen 'the’s.

An excellent point, and of course you’re right. Actual frequency count won’t be the same as appeared at least once on the page as per Google results.

Specifying an exact phrase instead of just a set of words seems to solve a lot of the issues you identify

Right, you and I know that, but the average search users don’t know to put things in quotes – they just type stuff and expect it to work.

My earliest experience with stopwords was 8 or 9 years ago. I searched for the who (without quotes) on Google. It filtered both of the words out and told me it couldn’t find any results. I didn’t use it for 3 or 4 months after that, but we ALL eventually turn back to Google.

There’s a brand new remastered Smiths greatest hits out with some rarities. Both Marr Moz endorsed it:

The Sound Of The Smiths
a href=http://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56Thttp://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56T/a

Just a silly follow-up to the appeared at least once on the page nature of the discussion: is /that/ even true? If a billion pages link the word orange to a given page, won’t that page turn up pretty high in searches for orange, even if it doesn’t contain that word in the content?

I was reading about a project (Nutch I think) the other day where each stop word is combined with their following word to form a new un-common word. For example:

The band The The was a great band

would be analyzed and produce something like: band thethe great band

(edit to last message)
well, I would suppose it would probably produce: theband band thethe wasa agreat great band

Im using Lucene too. Im very satisfied with it.
The only problem i have is when an index is updated, inserted or deleted very often. I sometimes get an error message saying that an index file isnt readable.
I didnt find a real solution for that until now. Im storing now all IDs to index / delete in a Table. A cronjob takes care of this table and does all the index stuff. So its not a real live search but with a delay of round about 15 minutes. Any suggestions for that?

Btw. I love your blog and read all the books you have recommended here…

Actually the trick for searching common english words (at least with google) is to search in another language

http://www.google.pt/search?hl=pt-PTq=the+the
or
http://www.google.es/search?hl=esq=the+the

will yield the correct results the the the band is the first result

My pet peeve is programming sties that don’t let you search for things like c++, or stl::hash

The worst of all in that respect is the band Can

I can has search?

No, but seriously, the worst group to search for (at work) is the Barenaked Ladies.

I remember trying to look up the tv series As if a few years ago with no success. It works in google now - nice.

Another thing to consider is that (I think) Google also uses N-Gram models (I seem to recall that they released a set of models up to 3-Gram or 5-Gram from their corpus).

http://en.wikipedia.org/wiki/N-gram

And in a weird bit of serendipity, Johnny Marr played in The The for several years.

Stop words aren’t a relic of early '90s computing, they’re a relic of standard pre-web information retrieval systems (reaching much farther back than the '90s!). Stop words were an enhancement to the quality of search results, just like word stemming or tf-idf.

This is from a world that searched databases of scholarly or otherwise serious information–you wanted to get everything relevant to your query (all 1000+ relevant documents perhaps), and nothing that wasn’t relevant.

Stop words allowed you to avoid the situation of returning a document that said the the are the the. to a query asking for the white house, just because you included the word the.

Google is in a whole new world. You will likely have several thousands of results for any query, and you want only the best few, so stop words are certainly less relevant than they were before. If you’ve got great results in your top ten, who cares if you return the the are the the as your 151st result?

Having been a web surfer since the days of NCSA Mosaic, I just got in the habit of not even bothering to type in stop words (which I generally define as any word that would not be capitalized in a title) on my search queries.

I guess I need to break that habit now. Thanks for the info.

Sites should do an exact phrase match unioned with the a non phrase match for any local search.

When I type something in to google I almost always use quotes. If google doesn’t find the exact match it automatically falls back to dropping the quotes and executing the query again without my intervention.

The Smiths and The The?

Jeff - Your 80s are showing.