Trouble In the House of Google

It would seem that Google’s perfect system already exists in Gmail.

I’ve been using Gmail since it started in 2005. As of today I have over 50,000 archived emails. When I first started Gmail, I would get a couple of spam emails a month in my inbox and immediately report them as spam using the dedicated button.

In the past 3 years, spam emails in my inbox have completely disappeared. They still accumulate by the hundreds in my spam folder, but I never get them in my inbox.

Thanks to their reporting system, a minority of Gmail users provide cross-referenced data that allows the spam to be identified and properly categorized plus whatever else Google uses. I believe there was a white paper on their anti-spam technology a couple years ago showing their techniques.

In contrast, I have an old GMX.de free email account that is now nothing but hundreds of spam emails a month in the inbox.

Google is hardly a monopoly. I’ve been using DuckDuckGo for a while now, and it seems to work very well. I’ve also started playing with Blekko. If more people start leaving Google, maybe they’ll make more of an effort to weed out spam sites.

At the rise of Search the public couldn’t believe that a single search box could ever return adequate results from all over the web at all. SEO evaluated enormous, but when it comes to results that reflect your needs more exact SEO is not the holy grail.
For example business apps do use other indexing mechanisme like metadata combined with a social distance in addition to ranking. I do believe the public gut feeling that SEO will not provide you with the right answer will become real in the end. At least a part of that feeling. I also believe that the quality of results could improve by adding the social component in SEO.
So is it time to redefine gravity and implement a new model to assure that we still will find relevant and authentic information in the future? I do think it is time to innovate.

Great article. Thanks.

I personally find myself spending more and more time on filtering results that I know (or think) is SO scraped content.

Google should make it possible to flag other sites’ content as duplicate of that on your own site (of course they need to verify that it actually is duplicate), so that sites going to extremes in terms of scraping will get degraded in the results.

I guess Google has (long time ago) lost the human touch, and is increasingly becoming a system to beat.

Gravity is not broken, gravity is just gravity.

But yes, there needs to be an explicitly made catalog of things, because otherwise search results can contain what ever related material.

I just tried today to search for information on how to pay a certain well-known company through a certain well-known bank. Page after page of spam, spam, spam, spam, spam, spam, spam, baked beans and spam. I must have tried 10 different searches with various synonyms, phrases and exclusions, and all it did was slightly change the order and keyword relevance of the spam.

The worst part is that the results from all of these sites are identical. It would be nice if Google had at least a modicum of intelligence to say, “Hey, if Mr. User here isn’t interested in result #1, he’s probably not going to be interested in all of these identical copies of it down below.”

Three years ago it used to be that the content I was looking for often didn’t exist at all or wasn’t indexed, and I was okay with irrelevant results. I’d gladly settle for only a 50% chance of getting the results I wanted as a replacement for the ocean of obvious, pathetic spam I seem to get 90% of the time now.

Here’s a thought, Google: How about tossing all of these garbage copypasta spam sites into a “mirrors” link for the original result? Surely you can figure out which site is actually the original; you invented PageRank, so checking a few indexing dates should be practically a “hello world” level of difficulty.

I have been increasingly plagued by these auto-generated sites over the past months. As of this morning (Monday 17 January 2011, European time), my Google searches are almost entirely spam-free. Is this wishful thinking or has someone at Google been reading this thread?

I’m a little late reading this blog, but the example searches seem fine–searching for “dishwasher reviews” yields lots of legitimate looking dishwasher reviews. The first hit for “iphone 4 cases” had more case models for sale than I thought existed. StackOverflow continues to rank very (very!) high in my search results. I don’t recall being annoyed at seeing republished copies of answers, so personalization may put the real SO higher than the scammers for me. (We all get personalized search results, so we really can’t talk about a single Google search result anymore.)

I have noticed spammy looking republished content in search results of course. Does Google manually adjust search results as people point them out? Or was there very recent change to the spam recognition algorithms?

Here’s an example of a broken Google:

There’s a fairly popular YouTube video called Waffles by Julian Smith.
http://www.youtube.com/watch?v=Mj00ii1BLV8
I used a quote from it for the title of a post on my blog. Now, if you type in the quote, or something like unto it (“i’m not retarded i ate a jellyfish”) into Google, the very first result is that blog post. Which has absolutely nothing to do with that popular video.

It is even worse when unethical news organizations like NPR use automated tools to scrape other people’s content without permission: http://topics.npr.org/page/about-automated-pages

Check out the NYTimes article on how to game negative customer feedback and turn it into a high page rank. Gotta love the “no comment” from Google.

http://www.nytimes.com/2010/11/28/business/28borker.html?pagewanted=all

Google’s blog links your post ! http://googleblog.blogspot.com/2011/01/google-search-and-search-engine-spam.html

In short : there is no issue and there is less spam than before, but they are constantly working on improving their algorithms :slight_smile:

Hey, look! Google wrote a blog post on this:
http://googleblog.blogspot.com/2011/01/google-search-and-search-engine-spam.html
While updating my site and reading Google’s RSS feed at the same time, I accidentally stumbled into it.
They took your opinion seriously, and are working on it. Congrats! :slight_smile:

One thing to consider here is, who is it that Google is failing? They’re failing you, as the owner of Stack Overflow, because you want and expect (and have every reason to expect, based on the guidelines you quote) that your results should show up before scraped results. But are the failing their users, who are looking for content? I don’t think so. If I’m searching for something that has a result on Stack Overflow, then I’m looking for the content of that page. If the scraped page has all that same content, then I can find what I’m looking at, whether I go to stackoverflow.com or hijackstackoverflowtraffic.com. As a random user who is not part of the SO community, why would I care which site I find it on?

I haven’t personally observed this with SO, but I see it all the time with mailing list archives. I don’t care whether the archive link that comes up first is the site that officially owns the mailing list or some other random site. As long as they have the messages in a readable format, I can find the answer I’m looking for.

In the case of Wikipedia, we noticed around 2005 that a search on a piece of Wikipedia text would get three pages of mirror sites before us. I believe some people contacted Google asking “hey, what’s up with that?” and then a short while later it was fixed.

(It was about then Wikipedia started showing up as the top of every Google result for everything …)

I think we assumed that this was a more general algorithm penalising duplicate content. But your example suggests it isn’t.

There’s always the option to limit Google searches to a particular site.

Ex. to search pink butterflies on SO:
“site:stackoverflow.com pink butterflies”

But, I agree. Google search is starting to suck and syndication sites are taking over the internet. I really wish Google could find a way to drop the value of purely syndicated/scraped sites so the good content could be allowed to float back to the surface.

I was considering writing something to the Stack Overflow team after this happened to me the first time but, obviously, plenty of other SO users beat me to it.

http://birdswithteeth.wordpress.com/2011/02/10/libraries-raising-the-numerator/

thanks. cribbed some stuff.

@Jeff Atwood,

Google changed their algorithm just for you! Must be nice to have that kind of pull, :P.

http://googleblog.blogspot.com/2011/02/finding-more-high-quality-sites-in.html

I just tried the search for a typical “python algorithm sum”, and Stack Overflow was on the first three results.

http://goo.gl/7N2pI

“Google Algorithm Change Targets Content Farms”

Finally!!! People are waking up to the tholian web. Bill Joy (in some pop java guide) had a pretty clear grasp of how useful this all really is. He didn’t seem impressed. Lately I see google as the incredible shrink wrapping graph. You see there is a nasty little self referential dividend they get with your every return visit. Ultimately the ‘game’ plays itself and expands its /mindshare/ like a brain slug, till you can’t remember anything without them. Its going to be tough but some part of myself misses actual pages and words that dont move or flash. In the end I owe everything to books. The other thing is I’m very edgy about the ‘underside’ you know those ‘spooky’ pages that seem have died since Google…whoa spell likes G not g oogle … creepy. I have a vague memory loving to hunt and having tons of links to keep track. As for what can ‘we’ do? I think we may need to build tools for keeping track of sites, capable of analyzing presenting the connections in useful ways.