Johnny - quit that.
If you search for The The with quotes, Google returns the bandâs official website, their Wikipedia page, etc. Exactly what youâre looking for. Specifying an exact phrase instead of just a set of words seems to solve a lot of the issues you identify.
One key point is that the list Jeff gives (522,000,000 for âtheâ and so on) is not the frequency of the word, but the number of pages containing that word. The word itself may appear many times within the page, meaning that the relative frequency of âtheâ and âofâ compared with âreviewsâ is much greater than indicated. A typical 500-word page will probably have a few dozen 'theâs.
An excellent point, and of course youâre right. Actual frequency count wonât be the same as appeared at least once on the page as per Google results.
Specifying an exact phrase instead of just a set of words seems to solve a lot of the issues you identify
Right, you and I know that, but the average search users donât know to put things in quotes â they just type stuff and expect it to work.
My earliest experience with stopwords was 8 or 9 years ago. I searched for the who (without quotes) on Google. It filtered both of the words out and told me it couldnât find any results. I didnât use it for 3 or 4 months after that, but we ALL eventually turn back to Google.
Thereâs a brand new remastered Smiths greatest hits out with some rarities. Both Marr Moz endorsed it:
The Sound Of The Smiths
a href=http://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56Thttp://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56T/a
Just a silly follow-up to the appeared at least once on the page nature of the discussion: is /that/ even true? If a billion pages link the word orange to a given page, wonât that page turn up pretty high in searches for orange, even if it doesnât contain that word in the content?
I was reading about a project (Nutch I think) the other day where each stop word is combined with their following word to form a new un-common word. For example:
The band The The was a great band
would be analyzed and produce something like: band thethe great band
(edit to last message)
well, I would suppose it would probably produce: theband band thethe wasa agreat great band
Im using Lucene too. Im very satisfied with it.
The only problem i have is when an index is updated, inserted or deleted very often. I sometimes get an error message saying that an index file isnt readable.
I didnt find a real solution for that until now. Im storing now all IDs to index / delete in a Table. A cronjob takes care of this table and does all the index stuff. So its not a real live search but with a delay of round about 15 minutes. Any suggestions for that?
Btw. I love your blog and read all the books you have recommended hereâŚ
Actually the trick for searching common english words (at least with google) is to search in another language
http://www.google.pt/search?hl=pt-PTq=the+the
or
http://www.google.es/search?hl=esq=the+the
will yield the correct results the the the band is the first result
My pet peeve is programming sties that donât let you search for things like c++, or stl::hash
The worst of all in that respect is the band Can
I can has search?
No, but seriously, the worst group to search for (at work) is the Barenaked Ladies.
I remember trying to look up the tv series As if a few years ago with no success. It works in google now - nice.
Another thing to consider is that (I think) Google also uses N-Gram models (I seem to recall that they released a set of models up to 3-Gram or 5-Gram from their corpus).
And in a weird bit of serendipity, Johnny Marr played in The The for several years.
Stop words arenât a relic of early '90s computing, theyâre a relic of standard pre-web information retrieval systems (reaching much farther back than the '90s!). Stop words were an enhancement to the quality of search results, just like word stemming or tf-idf.
This is from a world that searched databases of scholarly or otherwise serious informationâyou wanted to get everything relevant to your query (all 1000+ relevant documents perhaps), and nothing that wasnât relevant.
Stop words allowed you to avoid the situation of returning a document that said the the are the the. to a query asking for the white house, just because you included the word the.
Google is in a whole new world. You will likely have several thousands of results for any query, and you want only the best few, so stop words are certainly less relevant than they were before. If youâve got great results in your top ten, who cares if you return the the are the the as your 151st result?
Having been a web surfer since the days of NCSA Mosaic, I just got in the habit of not even bothering to type in stop words (which I generally define as any word that would not be capitalized in a title) on my search queries.
I guess I need to break that habit now. Thanks for the info.
Sites should do an exact phrase match unioned with the a non phrase match for any local search.
When I type something in to google I almost always use quotes. If google doesnât find the exact match it automatically falls back to dropping the quotes and executing the query again without my intervention.
The Smiths and The The?
Jeff - Your 80s are showing.