The Importance of Sitemaps

Thanks for this review really good info

Those pixeled pictures look like Transport Tycoon managers.

@Damien Neil:

First off, the Google login also ties people’s emails, browsing history, web browser (in the case of Chrome), etc., together :). Many sites have search functions powered by Google, and of course, their ads are everywhere and fund sites all the time which are much harder to avoid.

But the problem is that it’s not really about you changing search engines. The search near-monopoly might be leveraged in the advertising field. In the same way as it’s not really about Microsoft cleaving to the x86 (and x64 and Itanium) architectures, it’s about the other end where the desktop OS might be leveraged in other software markets.

In neither case am I convinced that anything is going on right now, but you have to be vigilant.

The other potentially scary things about Google are the information they can gather on you, which you really can’t stop (only cut off the flow of new information, and even then with tracking pixels and so forth you can’t truly cut them off), and the risk of censorship, which Google has a fairly decent record on, China notwithstanding, and is more easily avoided by switching search engines…if you know that the censorship is going on (it doesn’t even have to be intentional censorship – you’ll have a hard time displacing wikipedia even if you make a better product in large part because Google will likely rank wikipedia #1 for just about every possible search for a while to come).

sitemap.xml is not just a Google thing. Every major search engine understands it.

Hey Jeff, do you still know of any queries where Google isn’t doing well in terms of returning your pages when you searched for the title of an article on stackoverflow.com? I’d be very interested to hear of concrete example queries that I could convey back to the crawl/indexing/ranking folks over at Google. We always want to improve or else people will go to another search engine.

I can’t tell enough from what you’ve written if it really was a sitemaps file solving the problem or other issues that can be common to a new site launching.

Regardless, glad it was fixed. But I have to completely disagree with your idea that you somehow shouldn’t need to do something special for Google to do its job properly…

First, Sitemaps is a common standard supported by Microsoft and Yahoo, as well.

Second, people do all types of things in coding pages to accommodate those using particular browser, particular plug-ins and so on. Search engines are effectively the most common browser out there. Taking a minor amount of effort to ensure your site renders in them properly can deliver, as you have found, a huge amount of traffic gain. So if you’re having to consider them a bit, that’s just the routine of good web development, not something special in my book.

Third, by and large, you don’t have to have do stuff to be crawled. The vast majority of web sites don’t use sitemaps and get indexed just fine.

Why not reserve the page ID’s, so that page 1 always lists the first 10 questions ever posted, page 2 always lists questions 10 to 20, etc, etc.

http://stackoverflow.com/questions?page=2

And when the page number is not specified, default to the last page (e.g. 9182).

That way the page content does not change every time the search engine indexes it.

What a joke.

@Matt Johnson:
Of course now that you posted that link to codinghorror, that article is the top result: http://www.google.com/search?q=N-ary+trees+in+C
:wink:

@Jeff:
I agree with Damien Neil. I’ll have to admit that, as a Microsoft guy, you don’t surprise me by saying that Google’s dominance scares you more than Microsoft, but let it be heard that changing your internet homepage is A LOT easier, quicker, cheaper, and less hassling than changing your desktop Operating System.

Say it takes you ten seconds to change your homepage, and three hours to overhaul your PC. Mathematically, it’s 1080% more time-consuming to get rid of Microsoft than it is to get rid of Google.

Google shuts its doors this afternoon, millions of people will start using Yahoo, Ask, or MSN instantly. Microsoft does the same, and millions of people have no idea what to do when something goes wrong.

Market dominance always leads to complications, but as dependent as we choose to make ourselves on the web, it’s still a very volatile place.

crackOverFlow.com has needles in its logo. But crack is smoked not injected.

Hmmm…would it be possible I wonder to do something crazy with Mod_Rewrite to generate this on the fly? If you use a database-backed site, you could write a script which dynamically kicks out the current state of the site into sitemap format and then rewrite requests for sitemap.xml to your script.

The bot requests sitemap.xml and recieves an up-to-the-second sitemap. If you’ve got a bit site, I’m sure it wouldn’t be too hard to feed the bot a bunch of dynamically generated 50k-per-file sitemaps.

This would certainly get past the maintainability problem, but would it work?

I guess the need for sitemaps is simple: if a webcrawler starts going wild it may end in links like, I dunno http://site.com/questions?page=12345

And a human can tell there’s nothing on page 12345 BUT A BOT CAN’T

Hehe I tried to see how you had it implemented now (if at all), but your robots.txt points to sitemap.xml, which doesn’t exist (yet?).
Now I’m out to try and find a sitemap file on other websites, see how they have it covered. I’m really interested what would be the best way to get your whole site (eg all the questions on SO) indexed within the limits of the sitemap file.

-Addition-
Google: Lists a LOT of URL’s, not really useful
Wikipedia: no Sitemap file
MSDN Social: Now this is interesting, not only do they specify multiple sitemap files inside their robots.txt (probably one for each category), they have crawler specific URL’s in them, probably each generating a list of posts (http://social.msdn.microsoft.com/robots.txt).
Tweakers.net (Dutch tech site): lists multiple Sitemap files in their sitemap file, each pointing to a range of ID’s (http://tweakers.net/sitemap_index.xml).

This is all very interesting!

Thanks Jeff, this sitemap thing is new to me, too !

Wikipedia: no Sitemap file
Maybe this is a stupid question but how did you find that ?

@Qvasi: Ahh, I should have remembered the laws of internet quantum mechanics… if you complain about something not being found by Google on a popular website, it will be found :wink:

I have hundreds of interior categories on one of my sites that I wish would show up in search engines. I added a sitemap.xml, but that did not help. I kind of think you need a few (dozen?) outside links into your underlying categories for the search engines to take them seriously.

Very interesting stats on the power of Google, btw, thank you.

I have iGoogle as my start page of browser. It has nice layout for stuff and links. Plus I use Google as a search engine. I have 2 google search bars in my browser too. They remember my last searched words and they suggest words. So when I write c to the search box, it suggests at the top Coding Horror. So many times I don’t have to write but one letter and I get what I need without Google.com page even.

Google remembers posts that I have posted to a forum but moderator deleted for some reason. I had a valid post, but it was deleted along some other people’s posts. I typed couple of words to Google and I got my post back.

Two things:

  1. The dominance of google is a little uncomfortable, but will become less so if they ever provide us developers a way to force an include of symbols we need in a search. Quotes don’t do anything when you’re looking for information on anything that has a $ in it, for example. If you want to look up something for, say, the $get asp.net ajax shortcut, you’ll be sadly disappointed on how impossible it is to sort through all the results that google searching for get will give you. :stuck_out_tongue:

  2. You must be doing something really right; I was googling around recently for clarification on an answer I was writing to an SO question that had just been posted, and the very first result on google was… the question I was answering on SO. Which somehow got indexed within about 5 minutes of it being posted!