So I've been busy with this Stack Overflow thing over the last two weeks. By way of apology, I'll share a little statistic you might find interesting: the percentage of traffic from search engines at stackoverflow.com.
seems like the time to develop a sitemap.xml (and the strategy for maintaining it as it grows) is a heck of a lot cheaper and more cost effective than creating the search engine. Since you were already wanting to just leverage Google for that feature the cost associated with sitemap.xml doesn’t seem all that bad.
So, rather than being …a little aggravated that we have to set up this special file… you should just going back to being happy that the problem of search was solved so easily for you.
Meta-comment: The fact that I don’t have to create an account is great, but the fact that you can’t edit posts makes for a lot of comment chaff (like this). Is there any way one could allow post editing based on e.g. possession of a cookie?
‘There are also limits on size. The sitemaps.xml file cannot exceed 10 megabytes in size, with no more than 50,000 URLs per file. But you can have multiple sitemaps in a sitemap index file, too. If you have millions of URLs, you can see where this starts to get hairy fast.’
I can see how these constraints will led to some difficult to maintain hacks for the sitemap file for stackoverflow.There has got to be a simpler way for the Googlebot to work correctly.Hairy indeed.
Scalability is (should be) a non-issue for sitemap.xml. The purpose of this file for a large site isn’t to list hundreds of thousands of unique URLs at once, but rather to allow spiders to discover these urls ONE TIME. Once the initial discovery has happened, Google should (for a high-traffic, widely-linked-to site) continue to spider those URLs, which in turn link to neighbor URLs, and so forth. In this way you can spider an entire site of many 10,000s of URLs via a few thousand URLs in the sitemap.
'I certainly never needed a sitemap on codinghorror.com.'
Jeff I think in the case of Coding Horror there were plenty of trackbacks and other blogs that linked to your posts making it easier for the somewhat dimwitted Googlebot to find your posts.
On a side ntoe if I click the www.codingwheel.com author name in the comment above in firefox 3 I get a content encoding error page cannot be displayed. Just a heads up =p
You don’t need the sitemap – you can wait till Googlebot gets around to index your site. Apparently, that wasn’t good enough for you. Don’t blame your impatience on the poor bot
You may be drawing causality from coincidence on the sitemap.
The Google algorithm usually displays new sites high in the rankings immediately. Then, sandboxes them for a few days/weeks, until they gain PageRank. Finally, they pop back to an accurate position.
During that sandboxed period, it’s normal to search for unique terms and find other (not sandboxed) sites, yet not your own.
I’ve seen that pattern play out with every new site I launch, independent of SEO efforts (including sitemaps).