The Importance of Sitemaps

Isaac:

You are not correct. Here is an example of 50.000 pages on google that are indexed with a get-parameter:

http://www.google.com/search?q=Line+1%3A+Incorrect+syntax+near++filetype%3Aasp

Jeff, on a semi-related note, if you added the title of the post to your URL on this blog (e.g. http://www.codinghorror.com/blog/archives/the-importance-of-sitemaps.html instead of http://www.codinghorror.com/blog/archives/001174.html), you would probably get a lot more search traffic. Google counts text in the URL very strongly.

Well, well, well. Apparently I need to do more fact checking before I post well-known myths.

According to this article: a href=http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.htmlhttp://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html/a the problem may actually be the opposite of what I was saying - that these links that appear to be static such as a href=http://stackoverflow.com/questions/24109/c-ide-for-linuxhttp://stackoverflow.com/questions/24109/c-ide-for-linux/a, which is in fact a dynamic page with frequently changed content, actually hurts their changes of being re-index. From the article:

One recommendation is to avoid reformatting a dynamic URL to make it look static. It’s always advisable to use static content with static URLs as much as possible, but in cases where you decide to use dynamic content, you should give us the possibility to analyze your URL structure and not remove information by hiding parameters and making them look static.

Using GET parameters instead of the ‘permalink’ slug style actually causes Google to automatically assume the page is more dynamic.

It looks like there may also be an issue with links like this, which represent the answers to questions: a href=http://stackoverflow.com/questions/24109/c-ide-for-linux#24119http://stackoverflow.com/questions/24109/c-ide-for-linux#24119/a

It would seem that to Google, this appears to be a static marker into a static file. Therefore they may, according to this article, assume that the file hasn’t changed. If anyone links to a URL like this one, Google will see it and decide that it does NOT need to look at the page again, because it isn’t dynamic content. Of course if they parse out anchor tags for every indexed file and see in their database that #24119 wasn’t there when they indexed it last, they might decide to take another look at the file. However, this behavior is unlikely as it goes against their recommendation to NOT make dynamic URLs appear to be static ones.

I really doubt that a sitemap with ~27,000 entries is the intended usage pattern, and suspect strongly that there are gains to be made by changing the way that things are accessed. Google says so, and anyway a sitemap that large certainly isn’t very elegant.

The failure page for the Enter the word thing caused my links to be double up. Sorry about that. Might want to fix that bug too…

Jeff,
I don’t understand what you and some of the others have against google. The best I can make out is you must be following the old proverb better to deal with the devil you know (Microsoft) than the devil you don’t know (Google). In my opinion that proverb does not apply in this case because… google is not a devil. I don’t think anybody can seriously question that Microsoft has perfomed questionably acts. And they were fact to be a monopoly which was being used abusively.

Can someone please, give 1 example of where google has screwed any segment of the population? Did google tell you the only to way to put SO on the web was to go through them? Did they tell you what software to use? Did they mandate your operating system? Development tools? What exactly have they done to screw you (Jeff) or anyone?

What I know is this:

  1. I’ve been using them since 2000-2001.
  2. They’re good at searches for me. Very good. I’m a satisfied customer.
  3. I like the e-mail services. I can use pop and I do so.
  4. They don’t spam me.
  5. They don’t try to tell me what OS to use, or what browser.
  6. They don’t install spyware on my box.
  7. They don’t charge me to do searches.

What exactly is there not to like? I can see why Microsoft is nervous – they’re not in the search game and will never be at this rate – which means they don’t get any advertising $$. And they’re always afraid that someone like Google might push out an OS that makes Microsoft irrelevant for everything except maybe game consoles.

But what exactly have they done to you Jeff that makes you nervous?
You work with Microsoft products for now almost 2 decades. You know their business history and practices.

Microsoft doesn’t make you nervous. but google does?!?!

Here’s another perspective that might help you see who you should really be worried about: if tonight you suddenly decided to replace the whole Microsoft toolchain for say LAMP (Linux, Apache, MySQL, Perl/Python/Php) ok, leave the web content exactly the same would Google call you all pissed off and screw you over? Would they even care? Google does not care how you get the content on your website, just that it’s there and its crawlable. Your content and your business is your own.

Hey, you can get a sitemap generated for you for free at http://www.bitbotapp.com
-h

I know it’s been a while since you did this post (and I commented on it: http://www.ninebyblue.com/blog/increasing-search-indexing-coverage-with-an-xml-sitemap/), but wanted to follow up and see how things were going.

Looking at the site more closely, I wonder if the issue wasn’t really about dynamic content or the pagination links, but more that URLs like stackoverflow.com/questions?page= look like search results. And if Googlebot originally decided not to crawl those pages because they looked like search results, then it might not have ever gotten to the article pages themselves (without the help of the XML sitemap).

in addition, Googlebot may have backed off crawling the questions?page= URLs because it detected infinite crawling issues. (Since you can view the question lists using different filters (number of questions listed per page, by tag, etc.),…)

I imagine what you’re really interested in is having the individual articles ranking well anyway, and the question lists are there for more for user navigation than as an entry point from a search result.

I would be curious to know how things have progressed since October.

I guess Stackoverflow gets special attention from google. I checked a couple of questions by searching for their title in google. All of them were indexed within an hour of being posted.

My sitemap show the error plz help me…

Your sitemap is being refreshed at the moment. Depending on your blog size this might take some time!

Warning: fopen(D:\Domains\gosshollywood.com\wwwroot\wordpress\wp-admin\options-general.php/sitemap.xml) [function.fopen]: failed to open stream: No such file or directory in D:\Domains\gosshollywood.com\wwwroot\wordpress\wp-content\plugins\google-sitemap-generator\sitemap-core.php on line 1692

Warning: gzopen(D:\Domains\gosshollywood.com\wwwroot\wordpress\wp-admin\options-general.php/sitemap.xml.gz) [function.gzopen]: failed to open stream: No such file or directory in D:\Domains\gosshollywood.com\wwwroot\wordpress\wp-content\plugins\google-sitemap-generator\sitemap-core.php on line 1705

Jeff,

All the conflict in the information can be explained. I am pretty sure it isn’t because of the url parameters, but the nature of the content in the pages and its relation to the urls.

While your paging allows to browse all the questions through normal gets, the content served for each page in the question list constantly change. This happens with all the categories a crawler could use to get to the pages. Questions move through pages of some of the categories relatively fast.

The sitemap allowed to avoid the above situation. The other real/direct benefit, is on bandwith if you use the change date/time for each of the items, as google doesn’t have to check every single piece of info has changed.

I have hundreds of interior categories on one of my sites that I wish would show up in search engines. I added a sitemap.xml, but that did not help. I kind of think you need a few (dozen?) outside links into your underlying categories for the search engines to take them seriously.
http://protectpartner.ru

You ought to be kidding when you compare Microsoft against Google. Microsoft has such a psychological power on corporate American it’s not even funny. Some people are literally scared to try something else. Now turning off google for say bling.com, ask.com, yahoo.com is easy…

Try that with Word, Excel, PowerPoint, SharePoint, Visual Studio…

Please!

Thanks for this gem of insight. One to add in to my new site rebuild.

Keep that coding crack coming!

I’m not at all worried about google’s dominance of the web’s homepage. If google were to disappear today, I doubt it would take very long at all for people to adapt - it’s not like it’s hard to use another search engine, there’s virtually no cost to switching.

Much more problematic would be be the disruption to all those sites that depend on adsense for a large portion of their income. If google has become indispensable, it’s not for their search page, but for their services (mostly ads. but gmail and some other apps also spring to mind). It’s much less easy to switch services, and their relative advantage in that field is also far greater.

On the topic of poor indexing performance: Given the occurrence of infinite sets of URL’s (a bad paging control would suffice, as would the breadcrumb issue described by Dan) and broken urls with damaging side effects, and the necessity to actually pay for the servers used to store the indexes, it shouldn’t be that surprising that the crawler is at least somewhat conservative on new sites - it’s literally impossible to be perfect, since a naive crawler might damage the site and/or get lost in an infinite loop.

Incidentally, contrary to what Isaac states, google does crawl urls with a query string (of course, if urls change all the time you’ll have problems).

Sitemap is very important because navigation of a site will be much easier with a sitemap. If your visitors browses you site and gets lost between the thousands of pages on your site, they can always refer to your sitemap to see where they are.

http://www.atvhunter.net/