The Importance of Sitemaps

Working on a small crawler for a few years I have at least concluded for myself that the idea of using sitemaps is completely reasonable. One problem that has come up is exactly what Mark above stated about crawling calendar widgets on sites.

Another issue focused on stores encoding the breadcrumb path in the url with their unique ID combined with the bug(?) of being able to essentially cycle through them infinitely producing a url such as , /1/2/3/1/2/3/2/4/3/product.asp?product_id=987 and having a breadcrumb of Baseball Jerseys Mets Baseball Jerseys Mets Jerseys Away Jerseys Mets.

I think Google (and by Google, I mean the entire Search Engine Conglomerate) would be a lot more likely to believe your claim of, I really do have 1,000 pages each linking to interesting questions on my site! if it weren’t for simple bugs (or were smarter about making it explicit to not follow links on a calendar for example) that other websites have in their linking.

Jeff,

I think you are uniquely qualified to address a post by the Google Webmaster blog eloquently. I only wish I had the knowledge and the audience to truly address their claim that dynamic URLs are properly indexed. The evidence in this post seems to say otherwise, however.

http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html

As to the content of this post:

I did a redevelopment of http://sixteencolors.net almost a year ago. I switched to static (URL Rewriting) URLs and created sitemap XML files. My hits from search engines almost immediately increased 50-100x the previous numbers. I previously received 1-5 visits from search engines, and the number now floats around 150-200. I didn’t really advertise anymore (I was linked from a number of obscure blogs) and the number has remained steady since that time. My belief is that a mixture of the sitemaps, static urls, and switching to semantic HTML has been the reason for the increased number of hits.

Great post, Jeff! Glad to see you are back!

I didn’t see anyone had posted this, but a great place to learn about sitemap protocol is here:

http://www.sitemaps.org/

Google has no real leverage to preserve their monopoly. The minute they start under-performing, they’re vulnerable to competition.

The people who have any reason to be scared are those who have all their email, calendars, and personal data stored in Google’s cloud.

@TehOne
If your employer is blocking your access to informative websites that will help you to do your job then its time to find new employers.

It’d be like trusting a financial advisor to tell you that another company would be a safer bet than his own. It’d be nice if he did, but would you put money on it?

http://whimsley.typepad.com/whimsley/2008/03/mr-googles-guid.html

My work has now blocked access to stackoverflow

does your work use blocking software?

We had a problem early on with Websense. I followed up with Websense on September 10th, and stackoverflow.com is currently classified information technology. Just as an example.

By way of apology, I’ll share a little statistic you might find interesting: the percentage of traffic from search engines at stackoverflow.com.

This is not much of an apology. You had 1 minute to post a I am very busy right now with SO, but you did not. This implies a certain disdain…

Jeff,

You might try using LINK rel=Next|Prev href=nextquestionURL.html in the HEAD of each StackOverflow question. Lots of blog templates use it. The W3C spec is here: http://www.w3.org/TR/REC-html40/struct/links.html#h-12.3

Also, I agree with other posters here: Google prefers stackoverflow.com/page/2/ to stackoverflow.com/questions?page=2 because Google wants to index stable pages, not the results of dynamic queries.

OT for WordPress users: there’s a good WordPress plugin to generate sitemap.xml. Search for wordpress sitemap generator, it’s the top result.

There are also limits on size. The sitemaps.xml file cannot exceed 10 megabytes in size, with no more than 50,000 URLs per file. But you can have multiple sitemaps in a sitemap index file, too. If you have millions of URLs, you can see where this starts to get hairy fast.

You can have several sitemaps for your site and even several sitemap indexes, so there are no scalability problems.

Also consider submitting your sitemap.xml through Google Webmaster Tools – it will allow you to see useful statistics/warnings/errors (for example, if some pages in your sitemap.xml can’t be accessed by Google).

For those wanting to know where you find out about sitemaps, take a look at Evil Google’s webmaster tools and read the sitemaps section.

Here’s a direct link to their info: http://www.google.com/support/webmasters/bin/answer.py?answer=40318hl=en

After all, suppose a search engine came along that did everything Google did, and added a few brilliant features that we could really use. How would we find out about this wondrous product? Well, we’d… google… for…

How do you think we all learned about Google? Not from searching for search engines at whatever we each used prior to Google, I suspect.

Cuil launched not long ago, and immediately picked up a number of mentions in newspapers, blogs, forums, chat systems, watercooler conversations, and so forth. I heard about it. Odds are reasonably good that you did as well. If you didn’t–well, you just did now.

Of course, Cuil turned out to be not very good–it collapsed under the load of its own launch buzz and produced some hilariously bad results to various search terms. Nobody is talking about them any more. So it goes. (They’ve improved since then, but they’ll need to do something really special to overcome that bad first impression.)

The point is, Google’s eventual replacement doesn’t need to rely on Google to find users. The world is filled with ways of propagating information.

Jeff,

Googlebot CAN crawl all of your paginated pages. You need to use this format for your URLs:

http://stackoverflow.com/questions
http://stackoverflow.com/questions/page/2
http://stackoverflow.com/questions/page/3

It’s pretty well known in SEO circles that Google never follows a link with GET arguments because it is dangerous for it to do so - following these links can cause unwanted script execution because many programmers do not follow HTTP standards and implemented distrctive queries using scripts that are accessed with GET.

For instance, you may have a poorly implemented command like

http://stackoverflow.com/admin.php?cmd=deleteEverything

If Google followed this kind of link it would cause obvious problems, so they don’t.

The solution to this is to use a URL formatted query string. This is pretty simple using mod_rewrite, Django, Druapl, or any of a number of other frameworks.

If you make this simple change to how your article list pages are accessed, the Google crawler will suddenly see your entire site.

I’m a little aggravated that we have to set up this special file for the Googlebot to do its job properly; it seems to me that web crawlers should be able to spider down our simple paging URL scheme without me giving them an explicit assist.

Interesting point, tying back to Google’s current dominance of the search market. You’d expect spider’s to go out of they’re way to index websites, and in the case of a startup launching something new and eager to get more information and more usage for example, they’d try to do just that. But Google being the biggest show in town, gets to reverse the rules. If you want your website properly indexed, you’re expected to play by the rules - if you don’t, it’s your loss, not theirs.

That aside, I concur that the minute that Google violates their Don’t be evil moto (if it actually happens, and I hope it doesn’t) and crosses one line or other, it won’t be very hard for disgruntled internet users to switch search engines.

Google has no real leverage to preserve their monopoly. The minute they start under-performing, they’re vulnerable to competition.

Well, when your company has become the default VERB for the task you perform, that may not be as true as you think, especially when that task is INFORMATION DELIVERY. After all, suppose a search engine came along that did everything Google did, and added a few brilliant features that we could really use. How would we find out about this wondrous product? Well, we’d… google… for…
Ah.

It’d be like trusting a financial advisor to tell you that another company would be a safer bet than his own. It’d be nice if he did, but would you put money on it?

Oh my god! There’s only one search engine and all the others have gone away! So if google starts giving unhelpful and useless search results, we’ll have nowhere to turn to!

We’re doomed!

Either that or we’ll go back to yahoo when yahoo is slightly better than google again.

And the good news is that if the sum of all human knowledge amounts to YouTube and MySpace, it would merely be evidence that being able to access the sum of all human knowledge just isn’t that important anyway.

My work has now blocked access to stackoverflow

So sad!!!

@Mediocre-Ninja.blogspot.com on October 14, 2008 07:44 AM:

Wikipedia: no Sitemap file
Maybe this is a stupid question but how did you find that ?

On the wikipedia site the robots.txt has no reference and sitemap.xml doesn’t exist.

Hi Matt

Hey Jeff, do you still know of any queries where Google isn’t doing well in terms of returning your pages when you searched for the title of an article on stackoverflow.com?

Now that sitemaps.xml is in play, everything is working exactly as I would expect it to. We’ve done some pruning in robots.txt to remove duplicates, but that’s about it.

I’d be very interested to hear of concrete example queries that I could convey back to the crawl/indexing/ranking folks over at Google

The only suggestion I have is the one in the article – it’s a bit disappointing that googlebot couldn’t seem to crawl all our questions, as they are directly hyperlinked from each page:

http://stackoverflow.com/questions
http://stackoverflow.com/questions?page=2
http://stackoverflow.com/questions?page=3

However, now that I’ve implemented sitemap.xml, I’m starting to come around to the concept. It’s probably more efficient to feed search engines a sitemap.xml file containing links to each question, than it is for us to serve up full pages of markup, javascript, etc to every single search engine out there!

Great article on sitemaps. I have recently added a sitemap for my site and while I have over 10000 links, Google actually took about 200 pages out of their index leaving me with just over 1000 indexed pages. The website has been around for over a year now.

What would cause this? We even advertised with Adwords a time or two.

Thanks.