Trouble In the House of Google

Shoot me a DM next time you need a case, I’ll send you a good one for free, save you from needless Googling and Amazoning! :slight_smile: http://www.myGearStore.com

Google does a great job filtering spam in gmail. I’m not sure how important the “report spam” button is in this, but it is certainly somewhat satisfying to press it.

Where is the will to do the same thing for their search results?

I’d like to see a similar button in Chrome (for starters) for social rating of spammy websites. Other comments have noted that this would shift the goal posts to gaming the social rating system.

Maybe one solution to this would be to weight ratings by reputation. So Google detects that you are someone who rates spammy web sites highly, and devalues the rating you have applied to all other sites.

You’re right on except for this one statement:
“when was the last time you clicked through to a page that was nothing more than a legally copied, properly attributed Wikipedia entry encrusted in advertisements?”

On a growing number of search queries in Google, I’m seeing results from Ask.com that are Wikipedia articles with ads outranking the original Wikipedia posting.

I’ve been running into a flood of these scraper sites in my search results, and more than anything I just want to exclude them. I would like to click a link next to the result to exclude that site from future searches; there’s no content I’m interested in on that site that shouldn’t show up as a hit on the original source.

Providing that feature might solve the problem for two reasons: I don’t see the scraper sites so my searches are more to my liking (and google works for me so I come back to it,) and also, google can use a large number of explicit “exclusions” to affect the rankings. They could treat it as feedback, equivalent to users saying “this site is not relevant.”

I actually have been hoping for a change in Google for a while. While Blekko shows promise, it isn’t exactly what I was hoping for. And, though I know some of the following are a bit of a stretch now, they will be invaluable in the future.

  1. First, I want to be able to filter results from my search. (This is the opposite of what Bbulkow suggested), but his option would be good too) I want to be able to click something which says, "This site is bogus and should not be in this result set" or "That has nothing to do with what I am looking for". When I look for a legitimate answer for a question, I want to be able to tell Google to take about.com and shove it.
  2. I want to be able to search for symbols. I mean seriously, if I'm trying to find an email, why does it need to be changed from "foo@bar.com" to "foo bar com" (I'm a bit sensitive here, my last name is Allen-Poole).
  3. True Boolean logic. I want to look for ((this and that) or (that and another)) and not (some-other-thing).
  4. I want a means to search for linguistic constructs. For example, if I am looking up John Smith, I want to have a search which looks for the name (two words in close proximity, separated by a middle name or a middle initial). This is more than possible.
  5. I want regexp. That is just insane though. I don't expect to grep the web any time in the near (or maybe even distant) future.
And what am I willing to trade? Time. I remember the 90's. I remember preferring AltaVista because its results were just slightly better and its logic seemed more reliable. But the amount of time I will save in proper results is invaluable and worth far more than whatever extra seconds that it takes crunching the numbers on their end (even extra minutes!).

Just think about this: it takes at least a second to read the title of a google link. It will take another couple of seconds to evaluate the text beneath. It is also not unreasonable for a website to take 3-5 seconds to load completely (though this can be optimized with tabbed browsing (though that can also decrease as the full <title> will not appear)). It takes an additional 3 seconds (minimum) for a human to read and parse a site, realize that this is not what you wanted, and then move the hand back to the mouse.

Now, if you were to have a questionable search (say, the dishwasher ratings example) and the first result is bad, the second and third result are maybes, the forth result is Amazon, and the fifth result is one which is relatively useful. This means that you will waste at least 5 * link-text + 4 * subtext + 2 * site-viewing to get to the result (assuming you stay on the good result). That makes a minimum of 19 seconds of completely wasted time before getting to something truly useful. In this case probably more because you likely will stay on the mediocre results for a while longer than 3 seconds.

If Google were to give us these options, if it were to make our searches better, even if it were at 100% increase in search time, we would end up with a net benefit (I’ve not had too many searches take 10 seconds recently). The first point alone could net some extreme benefits and it reminds me of (http://en.wikipedia.org/wiki/Travelling_salesman_problem#Ant_colony_optimization) a solution to the travelling salesman problem. And, while this is still something which advertisers could use to our disadvantage, it would be a lot harder for them to do so, especially if these were implemented on a per-user basis.

Now, I know that I am a lowly voice in a sea of spam, but seriously. Google has the ability to implement this. I’ve read their specs and I think that, if they wanted, they could even make a way to grep the web. For the first task, it wouldn’t even need to involve stored data – it could all be tracked within one session. The next question is whether Google will care.

Amusingly, I feel it obligatory to add a link to http://allen-poole.com so that some day Google may look upon me and smile.

The only feature I need Google to implement right now is giving me the ability to blacklist sites in all my queries. I’ve long wished that I could blacklist experts-exchange, and with the proliferation of scraping sites over the last year, that desire has become even greater.

You could possibly make it social (my “friends” blacklist can be added to my own), but don’t use blacklists to influence rankings. And no, don’t do any peer votings for ranks either, as these will lead to more abuse and just be added to the list of SEO techniques.

This can’t be that hard. I can already add “-site:experts-exchange.com” to my queries to remove the sites. Why can’t it be an option in my google account setting to add that to all my searches.

I find it hard to believe that this isn’t intentional by Google, although I imagine the attention gathered by this article will change things dramatically.

Btw, there is an extension button for Google Chrome to report spam. It automates a part of filling in their report form.

Its not you dude. Google is becoming the new Yahoo, one spam result at a time.

The sad part is without google the web is nothing. With all the technology improvements, nothing has really improved.

Its time to get VC out of tech and start building things that work.
The web has turned into a get rich quick scheme.

Hi Jeff, I passed on the examples that you sent back in December and the team is actively looking at improvements and changes they can make based on that feedback–thanks for sending it.

I was curious about the link to “Google, Google, Why Hast Thou Forsaken the Manolo?” and so I checked that one out. It’s true that our algorithms don’t currently think that’s a great site, so I looked into it more. The disclaimer says “Manolo the Shoeblogger is not Mr. Manolo Blahnik.” It’s a different Manolo in the shoe industry.

So I picked a url, let’s say http://basement.shoeblogs.com/category/bedding/ . Pretty much every post looked like “buy this type of bedding,” usually with an affiliate link. And over on the right-hand side are links like “Shop hassle free and buy unique Duvet Covers at thecompanystore.com” that look an awful lot to us like paid links that pass PageRank.

I support the right of this blogger to put whatever they want on their domain, but I also support Google’s right to decide how to rank our search results, and I don’t think we should be obligated to rank that site highly.

I appreciated the rest of your post and it’s safe to say that people inside Google are discussing it and how we can do better.

I’m no expert but what about taking the new syndication-source and original-source meta tags a step further.

Original content can be pinged, timestamped etc. w these tags. Webmaster tools could be used to report sites that are outranking the original content. Database would verify and adjust rankings.

Rewrites etc. would still happen but should help clean things up a bit in addition to giving content producers (and Google) an easier way of dealing with this problem.

This looks pretty inevitable.

Two aspects come to mind:

  1. It’s an algorithm not human thought that’s at work here. That gives an arms-race, the dark-part-of-SEO will catch up even if they started out pretty dumb.

  2. Google makes it’s money from adverts. A site that has a lot of adverts is working for Google. I really can’t imagine them stamping on such sites like they’re bugs. For me some of them are just that bugs, so we have a disconnect! In the absence of a published algorithm (even if it needs updating daily or more often) this sort of suspicion can’t be resolved.

Human judged content (DMOZ anyone!) looks like an answer. Many times I look at what “social” delivers I shudder. A great average of everybody, it seems to me, is not the answer.

Maybe the web just needs to fracture. Personal control over how your own search works, sharing data with people who’s opinions you respect, sites that work your way, less rubbish, less time waste, more productivity.

We could end up with different worlds, as sketched in some SciFi books for a long time. Those who live on the web, consuming, following, never creating. Those who disconnect, think for themselves enough that they deliver new and valuable work.

The web has altered our lives. It’s time those who care get back into the loop. Control your web so that your life is yours, not a side-effect of a cacaphony of “important” web companies.

Seems to me that relying purely on content for indexing isn’t going to work any more. Each web site comes from a hierarchical division of address blocks. The existence of a “bad” web site within a given address block can and should impair the score of every other web site within that address block, to a lesser degree as we ascend the address block hierarchy. The same concept should be applied to registrars.

In other words, if my ISP hosts a lot of spam sites or there are a lot of them in my address block, my site is going to take a penalty, regardless of its content. I therefore have an incentive to seek out a reputable ISP, and reputable ISPs have a very solid reason to push out spam sites.

Eradicating this trash means making it harder and harder for it to find a “home”. I can’t think of a better way to do that than to have ISPs actively working on the problem, to retain their wider customer base. If they don’t have a wider customer base, and it’s all spam? Page ranks from that ISP will snuggle up to each other at the bottom of the pit.

The basic problem here is that its too hard to keep adapting like crazy to all the ways of restructuring content, times all the possible web sites. The number of ISPs and address blocks is, however, entirely tractable for this kind of problem.

Of course, this can punish entirely innocent web sites, until the system as a whole shakes itself out. It would be nice to have this particular omelette be break-free, but I don’t see how to do that.

I use Bing at work and Google at home (don’t ask) and as odd as it is, I do get better results with Bing.

There was an algorithmic thing back in 2006 that included TrustRank. While it wasn’t exactly a social recommendation type of thing, it did distribute GoogleJuice based on links from trusted sources.

http://weblogs.asp.net/jgalloway/archive/2006/01/11/435076.aspx

If there’s an element of TrustRank in the current algorithms, it seems like that probably needs both a reset and a higher weighting.

@Matt Cutts

I can see you took the time to read, analyze and post a comment. That’s very decent of you. Unfortunately I can also see you only addressed Jeff, ignored any comments from commenters in here and approached the matter purely as a ranking issue.

Since Google Search is meant to be a service to the “user who searches” and not a service to the “user who publishes”, I’m unsatisfied by your comment. But not surprised.

The whole PageRank conundrum reminds me of the parable in Gödel Escher Bach about the phonograph that breaks when you play a specific well designed record. GEB was referring to incompleteness but it’s an equally good metaphor for computer security and quality-algorithms like PageRank.

If there’s sufficient motivation to find your algorithm’s weak points and exploit them, it’s going to happen. Complicated algorithms just require more complicated and better designed inputs.

Hi Mario, it’s actually my 11 year anniversary this week. I’m out of town with my wife, so I only have limited time to slip away and post responses. Suffice it to say that plenty of people in Google have read this article and the other articles Jeff mentioned, and lots of people will be discussing what we need to do next to improve things.

@Matt Cutts

Yes, Manolo the Shoeblogger’s site is like a lot of fashion blogs, in that it has decent number of affiliate links.

However, you didn’t answer the central question posed by the Manolo in the post you’ve referenced, “why are the scrapers ranking higher than the original content?”

You, and others at Google, have harped on for years about the need to produce interesting and original content, and yet, if a site which produces plenty of original content doesn’t throw exactly the right levers in Google’s Rube Goldberg system, you’ll preference a dozen content scrapers over it.

Manolo is very well known among fashion people and in the fashion press, in fact he pretty much invented the Fashion Blog…

http://en.wikipedia.org/wiki/Fashion_blog#Early_fashion_blogs

So, again, why should the content scrapers who are stealing his work be ranked higher than he is?

Lets hope they do and things do get improved, Matt. There’s been a growing disconnect between Google Search and its users for the past… couple of years, I’d say. To the point that previously very rare statements like “Google search engine isn’t good anymore” are becoming more prevalent. Something that would be unthinkable before.

Being that this is also the period in which Google introduced the most relevant new features and changes to the search engine UI since its inception, maybe it’s time (and excuse me the bluntness) Google realizes that may not be what users actually require the most.

I’m prepared to accept also we are simply a non representative minority. But I do seem to witness a growing cry of protest. With alternative search engines taking their place in the market offering competitive possibilities, all care is not enough. Remember how Google itself rose.

And my congratulations, BTW!