Stop Me If You Think You've Seen This Word Before

I am still wondering why ‘sex’ is not at the top of Google’s list… :stuck_out_tongue:

A search for the the on yahoo gives excellent results.

Searching for a phrase with the same stop word is weird! If I search for The in Google, it gives me 13,490,000,000 results; if I search for The The The The, I get 1,160,000,000 results, and so on! Indexing?

It sure would be nice if google would stop trimming special characters out of a search even when it’s a search string enclosed in quotation marks! ANY way at all to escape important symbols (my example last time I complained about this was wanting to quickly look up the syntax for the $get shorthand; man is that a useless search string once the $ gets trimmed out) would be great

Saj,

I assume there are about 12.33 bilion pages that have the word ‘the’ at least once, but not more than 3 times.

Searching for The The? Jeff, if I didn’t adore you before, I certainly do now!

I didn’t notice those words until I read this article.

@Sobani:

Exactly. Which makes the indexing bizarre! Not as smart as it could be.

Especially since stopwords differ so much from one language to another… THE is the french word for tea, OR is the french word for gold, THESE also means thesis, and so on… Dunno about other languages, but it’s pretty hard to find a good golden thesis on tea…

@Grank
Here is what you get if you use google’s code search engine:
http://www.google.com/codesearch?hl=enlr=q=%22%24get%22

Thanks for posting this. I’m building a search application and the information from your MySQL link might make things considerably faster.

Thanks again for posting this topic!

Apparently, at least to Google, stop word warnings are a thing of the past.

which is not to say that people should start implementing that into their applications, or that Databases need to change

this is Google you’re talking about, the best (by far) search algorithm up to now

Your example is funny … I just started a vinyl record website and I have this comment in my code:

#TODO, ALLOW EXCEPTIONS ON STOP WORDS, FOR EXAMPLE, ARTIST The The

The band Live is almost impossible to search for on Google. The first couple of results are relevant (wikipedia and the official website), but after that, it’s all about live bands. ‘Live music’, and ‘Live CDs’ are equally worthless queries. Of course, this isn’t so much about stopwords as a semantic failure.

There’s one band that’s even harder to search for than A; it’s the outfit that brilliantly decided to call themselves !!!:

http://en.wikipedia.org/wiki/!!!

(The article mentions that you can find them by searching for chk chk chk, like searching for love symbol to find Prince in his glyph period.)

And behold, the comment auto-link-highlighter can’t believe ! could be in a URL :-).

No, but seriously, the worst group to search for (at work) is the Barenaked Ladies.

Anal Cunt is worse, trust me. If you don’t get fired for searching those words, you’ll certainly be fired for your eclectic taste in music.

I work for Barnes and Noble as a lowly bookselling drone, and get a kick every time that someone asks for What is the What[1], since all four words of the title are stop words in the internal search system!

[1]http://search.barnesandnoble.com/What-Is-the-What/Dave-Eggers/e/9780307385901

saj, on results for the the the the: What you’ve discovered is not a billion pages with that phrase, but a flaw in Google’s page-count estimator that makes it think there are a billion. When you get the page with the first ten results, it doesn’t actually go count how many pages there are with the phrase; it makes up an estimate based on word frequencies and such. That estimate is often way too high.

I’m not sure how accurate their page counts are for single individual words, but I wouldn’t necessarily trust Jeff’s 2004 results to be exact, either.

Having stop words in Google’s phrase queries is something that I actually miss, because there was a clever hack that involved them. If you searched for, say, row the boat, what it would really do was a wildcard search for row * boat, where * is any single word. That was occasionally quite useful when I could half-remember a phrase I wanted to search for, as I could just use the as a wildcard for the words I couldn’t remember. And as far as I can tell there’s no other way to do exactly that search.