Disambiguating Search with Quasi-Evil Hierarchies

Let's say I was to search Google for the word Jaguar:

There's an immediate problem. The semantics of Jaguar only exist in my head, not in any search box. Did I mean...


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2005/11/disambiguating-search-with-quasi-evil-hierarchies.html

Google has no metadata about content except the stuff that goes into the PageRank system, and they’re already using that. If I understand correctly, any after-the-fact categorization in any search engine is done with human intervention

Certain words are much more likely to follow other words in text:

http://www.cs.bell-labs.com/cm/cs/pearls/sec153.html

Eg, if the words “Jaguar” and “cat” are found together (or nearby) many times more often than, say, “Jaguar” and “can-opener”…

This is also the basis of many excellent spam filters, so it’s already known to be eminently automatable:

http://www.codinghorror.com/blog/archives/000423.html

That said, the “human touch” could still be relevant for tweaking results on common queries. And why not? If you took the top 100 queries (probably all sex related, but humor me) and hand-optimized the results using information science experts, is that a bad thing?

Google did have something like this in the labs section, don’t know what ever happen to it. IIRC as you typed in information it used AJAX to generate a popup list of available topics based on what you had typed so far.

Your example of word proximity analysis does not necessarily solve the problem, does it? You still want to be able to type “Jaguar” and have Google know you mean cat, not car. There’s no guarantee that even with proximity analysis the results would be any different than what you’re getting now. Or am I missing something here?

Hand-optimization, IMO, is a “bad thing” inasmuch as it essentially breaks the Google model. Some queries are “optimized,” some are not; the algorithm used in one search is not the same as that used in others. Moreover, the human touch implies judgement by humans who still might not see the world the way that people do who are actually doing the searches. And that’s assuming that Google could even keep up with their Top 100 searches, which surely change by the second.

I stand by my statement that Google already has powerful syntactical tools that can help you find dang near anything, and that a two-word search will winnow your list by a huge percentage. Based on the Google searches that bring people to my blog (/BlogGoogleSearches.aspx), it seems that people frequently type MORE, not less, than they need. Perhaps that’s a self-selected audience, but still.

Try a href="http://clusty.com"http://clusty.com/a - it does what you need.

Your example of word proximity analysis does not necessarily solve the problem, does it?

I think it does, since that’s how http://www.clusty.com appears to work…

still want to be able to type “Jaguar” and have Google know you mean cat, not car

Not quite: I want Google to give me a one-click method of refining my search, in exactly the same way they do with the existing “Did you mean…” feature.

Hand-optimization, IMO, is a “bad thing” inasmuch as it essentially breaks the Google model

The whole historical argument is that hand-built directories like DMOZ and Yahoo are obsolete. I agree, however that’s considering them as an opposing poles of an either/or solution. When considered alone, search is the clear winner, but I don’t think it has to be an simpleminded choice of one method or the other. They can be quite complementary when used together.

So, therefore, “hand optimization” (eg, categorization) can still be useful.

I’m not entirely sure we’re talking about the same thing, though. You seem to be implying that somebody would go in and re-order search results, which isn’t what I’m proposing at all. I propose exactly what is shown in my screenshots: showing the hierarchy as an optional aid to filtering your search.

Have you looked at Vivisimo (www.vivisimo.com) or Clusty (www.clusty.com). They both do exactly what you’re talking about, and even bring up results for the Jacksonville Jaguars (aka “Steeler Fodder”).

Compare the search results from searching “half.com” on Google, versus the results on Yahoo. Yahoo pulls up results about the dinky township that changed their name to half.com, and searching for “half.com books” is required to get me to the site I want.

The site is technically www.half.ebay.com now, but still… get it together, Yahoo.

(Working on multiple machines not all under my direct control, Yahoo was the home page, yadda yadda. Besides, I learned something!)

+1 for Clusty. I rarely use it except when I can’t figure out the words I want to use to search on. Google would be way more useful with something like that, but I’d bet there are stupid patent issues.

Google did have something like this in the labs section, don’t know what ever happen to it. IIRC as you typed in information it used AJAX to generate a popup list of available topics based on what you had typed so far.

That is Google Suggest
http://www.google.com/webhp?complete=1hl=en

A variation of which is included in the latest Google Toolbar, at least the one in Firefox.

Holy crap, Clusty IS exactly what I wanted. Why had I never heard of this until today? Is Google really so dominant that the mainstream doesn’t publicize these great alternatives?

Also, I don’t think Google suggest is quite the same thing. That’s popular searches, not semantically related ones.

I guess I’m in the minority here, but I don’t actually see much of a problem. Given the a) incredible syntactical tools that Google allows you to construct your search with and b) the Google API that anyone is free to leverage and, say, add their own front end, it just isn’t that hard to zero in on what you want. When I look something up and it isn’t first or second in the list, I reflexively look at the number of hits, and it it’s in the gajillions, I refine my search. Add one more word to your search (any word remotely connected with your topic) and you’re golden in 99.999% of the cases. (Statistics © 2005 Mike Pope, any resemblance to real people or numbers strictly coincidental.)

Don’t forget that all Google is really doing is showing you the results of a popularity contest for your term. Just because YOU didn’t mean “Jaguar-the-mispronounced-car-name” doesn’t mean others – the majority of others, poor blokes – were not searching for overpriced vehicles.

I’m also interested in how you think Google could actually implement categories. Google has no metadata about content except the stuff that goes into the PageRank system, and they’re already using that. If I understand correctly, any after-the-fact categorization in any search engine is done with human intervention. (?)

Incidentally, there’s a certain irony here in that Google’s initial success was precisely that their ranking algorithm was spookily prescient about what you meant, as distinct from engines that weighted pages based on, say, word count. We sure have become spoiled … :slight_smile:

PS Google Suggest would in this case not help – it wouldn’t be until the second word that the search would be sufficiently refined … other than that you’d know it before hitting I Feel Lucky, I guess.

Some related posts:

Help Grandfather Google Improve Search With Wiki Directories

http://www.marketanomaly.com/?p=63

The answer you’re searching for is “browse”

http://www.humanfactors.com/downloads/jan05.asp

Your example of word proximity analysis does not necessarily solve the problem, does it?

I think it does, since that’s how http://www.clusty.com appears to work…

Well, I typed “Jaguar” into Clusty and the first hit was for the car. They pull up Wikipedia’s (not their) page as a disambiguator – is that the result of the proximity analysis?

It seems that Google does do what you ask, sometimes:

http://www.google.com/search?sourceid=navclientie=UTF-8rls=GGLG,GGLG:2005-31,GGLG:enq=%40%40identity

Notice a few hits down it says:
See results for: @@identity sql

So it kind of does what you’re looking for. However, after reading these comments I’m going to check out Clusty…

They pull up Wikipedia’s (not their) page as a disambiguator – is that the result of the proximity analysis?

Well, I’m referring to the category list on the left of the Clusty results. That pretty effectively mirrors what I see in the eBay screenshot.

Notice a few hits down it says: See results for: @@identity sql

Hmm, interesting, that is what I’m proposing. But in the zillions of Google searches I’ve performed, that’s the very first time I’ve seen that behavior!

There must be some special consideration given to a technical search term like “@@identity”?

From what I’ve been able to see (in 5 minutes of playing on Google), when you search for a word that is highly visible from one or two very different vocabularies (e.g. Tacoma or Basic) you can get those suggestions to come up. It’ll be interesting to see if I see that type of behavior more often. This is a very interesting topic indeed.

Google has people with Ph.D’s that pick up trash and clean bathrooms why would they care that an idoit like you wants to be able to put in one word and find what you want.

Maybe because making a good search engine is what Google is all about?

Now, it seems eBay have started to also second guess what we’re looking for, but I guess they only have people with degrees cleaning their toilets since it doesn’t work that well…

http://www.piku.org.uk/diary/2008/07/17/ebays-search-system-is-broken