URL Rewriting to Prevent Duplicate URLs

As a software developer, you may be familiar with the DRY principle: don't repeat yourself. It's absolute bedrock in software engineering, and it's covered beautifully in The Pragmatic Programmer, and even more succinctly in this brief IEEE software article (pdf). If you haven't committed this to heart by now, go read these links first. We'll wait.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2007/02/url-rewriting-to-prevent-duplicate-urls.html

Ah, big fat softballs (spring training is already on, so there). Let’s take the thread down the other axis; may be not. The PP guys article linked references an essay on their site which is more extensive on the subject of DRY and the nature of OO. It has near the top this:

Procedural code gets information then makes decisions. Object-oriented code tells objects to do things.
— Alec Sharp

That’s from a 1997 smalltalk book. Yet, begetted by the Apache folks among others, what passes for OO code to this day, by this definition, is still procedural. Tell, don’t ask.

The essay:
http://www.pragmaticprogrammer.com/ppllc/papers/1998_05.html

Are there any “correct” way an url should be formed?

I meen should it be files.ext and folders/ wise?

So this is correct:
…/blog/archives/000797/
…/blog/archives/000797.html
…/mypage.aspx

and this is wrong:
…/blog/archives/000797
…/mypage

In other words, should the url mapping represent a file system?

Why should the user be bothered about the file extension? php, aspx, ashx, etc…

referer is a known mis-spelling that was mis-spelled so long ago, they’ve decided to leave it alone in the name of familiarisation.

Here’s an insane idea: let’s just dump HTTP altogether and come up with another standard.

HTTP? I meant the URL standard. My brain no worky so good as of late.

http://msdn2.microsoft.com/en-us/library/ms228302.aspx

Why should the user be bothered about the file extension? php, aspx, ashx, etc…

To a certain extent, that’s true-- for scripting languages. Advertising your language is like painting a big “hack me” target on your site.

But for media content types, it’s aggravating not to have the file extension:

http://www.megginson.com/blogs/quoderat/2007/02/15/rest-the-quick-pitch/

OK, a resource is a sort-of Platonic ideal of something (e.g. “a picture of Cairo”), while a representation is the resource’s physical manifestation (e.g. “an 800600 24-bit RGB picture of Cairo in JPEG format”). Yes, as you’d guess, it was people with or working on Ph.D.’s who thought of that. For a long time, the W3C pushed the idea of URLs like http://www.example.org/pics/cairo instead of http://www.example.org/pics/cairo.jpg, under the assumption that web clients and servers could use content negotiation to decide on the best format to deliver. I guess that people hated the fact that HTTP was so simple, and wanted to find ways to make it more complicated. Fortunately, there were very few nibbles, and this is not a common practice on the web. Screw Plato! Viva materialism! Go ahead and put “.xml” at the end of your URLs.

And with video files, it’s aggravating even if you have the file extension. Good luck finding the correct magical combination of codecs you need to play that video.

DRY principles aside–why am I caring about Google page rankings? Sounds like ego-boo to me.

I like to leave all my URL extensions .html regardless of what the scripting language is. After all, that’s the end product actually getting sent to the user. Then in the future, if I do decide to change to a different language, I can still keep the .html URLs.

Everything looks nice and static.

why am I caring about Google page rankings? Sounds like ego-boo to me

You care about it, because without decent PR, nobody can find your content. Unless you’re writing solely for yourself, in which case why bother putting it on the intarwebs?

But I agree with the sentiment. For me, it’s simple DRY-- I want one and only one URL for any content, and I certainly don’t want people bookmarking the non-www version of this URL, or the version with the annoying and unnecessary /index.htm URL.

Why should the user be bothered about the file extension? php, aspx, ashx, etc…

In case “asp.net + URL Rewriting” you need to tell your IIS the extension of files you would like to “redirect”.

You are assigning an ISAPI filter for a particular extension.

For example
http://www.featurepics.com/online/Red-Fox-Pics178160.aspx

(Redirect rules applied to
http://www.featurepics.com/image/img.aspx?fid=178160show=image)

works, but Red-Fox-Pics178160 doesn’t.

My favorite URL rewriting behavior of all time is SharePoint (WSS) v2, AKA the 2003 version. It will take a perfectly formed URL like

http://wss-site.intranet/site/subsite/

and will actively append default.aspx to this immediately. It a perfectly backwards implementation of URL rewriting: let’s take a good URL, and make it less friendly! Awesome!

I guess I’m missing something. If you are using URL rewriting to map external links to different (or “cleaner”) URLs, how is this helping your page rank?

If you rewrite all variations of http://(www.)?foo.com(/index.html?)? to http://foo.com/ then aren’t you now supporting many external links - that will all get ranked sparately - to the same ultimate destination? Isn’t that precisely what you were trying to avoid?

Am I misunderstanding how URL rewriting works?

external links to different (or “cleaner”) URLs

In the above case, we are mapping different URLs to a single URL. We are narrowing choice, not widening it.

Try it yourself. Navigate to http://google.com and see what shows up in your address bar. Or, try Scott’s site. Navigate to http://www.computerzen.com and, again, see what shows up in the address bar.

In the above case, we are mapping different URLs to a single URL. We are narrowing choice, not widening it.

I understand that but doesn’t that mean that people could be linking to you by many different URLs and Google would keep stats on each of those “source” URLs thereby dividing your stats across many variants of the same effective URL, and lowering your rank? Or does Google resolve the URL to its destination value and see them all as the same URL?

Of course nothing stops people from making up random hyperlinks, but you should do your damndest to make sure those links NEVER appear in the address bar of the browser.

does Google resolve the URL to its destination value and see them all as the same URL

That is the magic of 301, Permanent Redirect:

http://www.seroundtable.com/archives/007233.html

In the rules you saw above, [RP] means 301 “permanent redirect”, and [R] means 302 “temporary redirect”.

So, the value of URL rewriting comes from the fact that Google correctly canonicalizes links its spider finds before entering them into their index. Otherwise, it would be a net loss, as URL rewriting would allow countless distinct URLs to the same content.

Interesting. Google was successful in the beginning because its search page wasn’t a crowded, cluttered mess of ads or irrelevant links, and produced good results because their index modeled the real behavior of the web reasonably well. Now implementation details of their index scoring have become an independent variable to determine how much traffic a web site gets.

Are we seeing Google transition from an “observe, and correctly evaluate” methodology to “observe, and define rules to influence mass behavior”? Something about that bothers me.

Another problem you don’t mention above is the problem with Google’s PageRank algorithms (and probably other search engines): If your page context gets indexed on different URLs, Google counts it as a duplicate website and automatically lowers the PageRank - on the site that was added last. You can use Google Sitemaps to reduce the problem, but you can’t completely rule it out unless you use redirects.

I’m pretty sure there’s a free version of ISAPI_Rewrite.