Trouble In the House of Google

I have been increasingly plagued by these auto-generated sites over the past months. As of this morning (Monday 17 January 2011, European time), my Google searches are almost entirely spam-free. Is this wishful thinking or has someone at Google been reading this thread?

I’m a little late reading this blog, but the example searches seem fine–searching for “dishwasher reviews” yields lots of legitimate looking dishwasher reviews. The first hit for “iphone 4 cases” had more case models for sale than I thought existed. StackOverflow continues to rank very (very!) high in my search results. I don’t recall being annoyed at seeing republished copies of answers, so personalization may put the real SO higher than the scammers for me. (We all get personalized search results, so we really can’t talk about a single Google search result anymore.)

I have noticed spammy looking republished content in search results of course. Does Google manually adjust search results as people point them out? Or was there very recent change to the spam recognition algorithms?

Here’s an example of a broken Google:

There’s a fairly popular YouTube video called Waffles by Julian Smith.
http://www.youtube.com/watch?v=Mj00ii1BLV8
I used a quote from it for the title of a post on my blog. Now, if you type in the quote, or something like unto it (“i’m not retarded i ate a jellyfish”) into Google, the very first result is that blog post. Which has absolutely nothing to do with that popular video.

It is even worse when unethical news organizations like NPR use automated tools to scrape other people’s content without permission: http://topics.npr.org/page/about-automated-pages

Check out the NYTimes article on how to game negative customer feedback and turn it into a high page rank. Gotta love the “no comment” from Google.

http://www.nytimes.com/2010/11/28/business/28borker.html?pagewanted=all

Google’s blog links your post ! http://googleblog.blogspot.com/2011/01/google-search-and-search-engine-spam.html

In short : there is no issue and there is less spam than before, but they are constantly working on improving their algorithms :slight_smile:

Hey, look! Google wrote a blog post on this:
http://googleblog.blogspot.com/2011/01/google-search-and-search-engine-spam.html
While updating my site and reading Google’s RSS feed at the same time, I accidentally stumbled into it.
They took your opinion seriously, and are working on it. Congrats! :slight_smile:

One thing to consider here is, who is it that Google is failing? They’re failing you, as the owner of Stack Overflow, because you want and expect (and have every reason to expect, based on the guidelines you quote) that your results should show up before scraped results. But are the failing their users, who are looking for content? I don’t think so. If I’m searching for something that has a result on Stack Overflow, then I’m looking for the content of that page. If the scraped page has all that same content, then I can find what I’m looking at, whether I go to stackoverflow.com or hijackstackoverflowtraffic.com. As a random user who is not part of the SO community, why would I care which site I find it on?

I haven’t personally observed this with SO, but I see it all the time with mailing list archives. I don’t care whether the archive link that comes up first is the site that officially owns the mailing list or some other random site. As long as they have the messages in a readable format, I can find the answer I’m looking for.

In the case of Wikipedia, we noticed around 2005 that a search on a piece of Wikipedia text would get three pages of mirror sites before us. I believe some people contacted Google asking “hey, what’s up with that?” and then a short while later it was fixed.

(It was about then Wikipedia started showing up as the top of every Google result for everything …)

I think we assumed that this was a more general algorithm penalising duplicate content. But your example suggests it isn’t.

There’s always the option to limit Google searches to a particular site.

Ex. to search pink butterflies on SO:
“site:stackoverflow.com pink butterflies”

But, I agree. Google search is starting to suck and syndication sites are taking over the internet. I really wish Google could find a way to drop the value of purely syndicated/scraped sites so the good content could be allowed to float back to the surface.

I was considering writing something to the Stack Overflow team after this happened to me the first time but, obviously, plenty of other SO users beat me to it.

http://birdswithteeth.wordpress.com/2011/02/10/libraries-raising-the-numerator/

thanks. cribbed some stuff.

@Jeff Atwood,

Google changed their algorithm just for you! Must be nice to have that kind of pull, :P.

http://googleblog.blogspot.com/2011/02/finding-more-high-quality-sites-in.html

I just tried the search for a typical “python algorithm sum”, and Stack Overflow was on the first three results.

http://goo.gl/7N2pI

“Google Algorithm Change Targets Content Farms”

Finally!!! People are waking up to the tholian web. Bill Joy (in some pop java guide) had a pretty clear grasp of how useful this all really is. He didn’t seem impressed. Lately I see google as the incredible shrink wrapping graph. You see there is a nasty little self referential dividend they get with your every return visit. Ultimately the ‘game’ plays itself and expands its /mindshare/ like a brain slug, till you can’t remember anything without them. Its going to be tough but some part of myself misses actual pages and words that dont move or flash. In the end I owe everything to books. The other thing is I’m very edgy about the ‘underside’ you know those ‘spooky’ pages that seem have died since Google…whoa spell likes G not g oogle … creepy. I have a vague memory loving to hunt and having tons of links to keep track. As for what can ‘we’ do? I think we may need to build tools for keeping track of sites, capable of analyzing presenting the connections in useful ways.