URL Shortening: Hashes In Practice

ryan14 · August 22, 2007, 12:00am

I don’t have time to look thru all the previous comments so forgive me if this has been said… But maybe they just use base 36 (or similar) id’s (like reddit). It probably doesn’t give as many uniques as hashing, though.

1 36#zzz.
46655

But as far as I know tinyurl and others expire their unused links periodically.

AlexC · August 22, 2007, 12:00am

Sheesh, this guy got up on the wrong side of the bed:

Wow, Jeff. You either need to lay off the Mountain Dew,
get more sleep, or spend more time programming and
less time blogging.

Even if the ultimate answer to the question of how URL shorteners work is a simple one, the process whereby the design was reached is worthy of study; more generally, any opportunity to be inspired by a real-world example into a consideration of theory is worth taking. Jeff’s original post reminds me of Borges’ Pierre Menard, who writes a “technical article on the possibility of improving the game of chess, eliminating one of the rook’s pawns. Menard proposes, recommends, discusses and finally rejects this innovation.”

Ral · August 22, 2007, 12:00am

Disclosure: I’m the author of the next service.

You forgot ItGo.es (http://itgo.es/). For the shortened URL version of Steve McConnell’s blog you should type http://itgo.es/10x in your browser, which is shorter than all you are speaking about. But you can also add a description with ‘?’, for example, http://itgo.es/10x?blog

I use Base 36 with a really simple database to create the shortened URLs.

Joel · August 22, 2007, 12:00am

Alex,

I would agree with you if this was actually “the process whereby the design is reached.” I may be wrong, but I doubt the developers of these URL shortening sites even considered using a hash function. The thought process was probably more along the lines of, “I know, I’ll store the URL in a table and use the autoincremented primary key as the new URL’s path. Done!”

codinghorror · August 22, 2007, 12:00am

I doubt the developers of these URL shortening sites even considered using a hash function

Well, it’s the very first thing I thought of.

It’s a superior solution from a technical perspective, IMO, because you wouldn’t have to do duplicate checking. This is what hashes were born to do. Am I wrong?

As you can see in my post, I determined that the resulting length of most hashes makes this a prohibitive line of thought. I still think a 40 to 64 bit hash would totally work, but there’s absolutely no way you could get a hash-based approach down to 3 characters.

brodie1 · August 22, 2007, 12:00am

wow. when I stumbled across tinyurl I thought that some idiot had just come up with this idea and made a bunch of money. I knew that I could program the CGI for it in about 3 seconds, and my only programming background comes from “Python for dummies”. I never thought it would take that much thought. I guess I need to give more credit

Gar · August 22, 2007, 12:00am

Imagine a url shortening service a bit like mailinator does for mail.

You request the shortened url, it is “active” for a period of time and then it goes away, able to be reused after a period of time.

How many of you expect a shortened url to be good for a week? Month? Year?

Peter_Hosey · August 22, 2007, 12:00am

url123.com sounds similar to http://notlong.com/ , which is my current favorite. It’s not as short as some of your examples, but you do get to assign a (hopefully-)descriptive nickname to the abbreviation.

ropata · August 22, 2007, 12:00am

A project for hosting redirected url’s yourself:
http://lilurl.sourceforge.net/

An elegant redirection service:
http://xhref.com/

WillT · August 22, 2007, 12:00am

My current favourite shortening service is http://linkpot.net/, which has a big list of short English words and just picks the next one off the list for each URL you shorten.

The idea is that linkpots are easy to type into mobile phones or to read out loud to someone, and are no harder to use on a computer than tinyurl.

Some kind of “hash the URL, take the first n bits, use them as an index to the list of words” scheme was suggested, but just popping a stack of words works equally well.

Debra_Hamel · August 22, 2007, 12:00am

For TwitterLit posts on Twitter I’ve been using Tweetl, which has the advantage of offering stats. (To find the stats for any URL, e.g., http://tweetl.com/000, add an s: http://s.tweetl.com/000.)

Debra_Hamel · August 22, 2007, 12:00am

Sorry about the above link. I honestly was picking that example out of the blue. I didn’t intend it to be a link to one of my own sites.

nickf10 · August 22, 2007, 12:00am

I’m really surprised no one has mentioned Lore Sjoberg’s EvilURL! http://evilurl.com/ A great tongue-in-cheek URL shortener for those who get a giggle from swear words. It even gives you a preview of what your short-url is going to be before it saves it, so if you don’t like it, just try again!

Joel · August 22, 2007, 12:00am

It’s a superior solution from a technical perspective, IMO, because you wouldn’t have to do duplicate checking.

Sure, you don’t have to worry about duplicate entries for the same URL, but now you need to handle the rare collision.

After all, who wants to lug around a DB solution for something as trivial as string translation?

SQLite could handle this application easily. If you consider using SQLite “lugging around a DB solution,” well, sorry.

Ian_Johns · August 22, 2007, 12:00am

Therefore, although tinyurl adds new strings sequentially, it also avoids duplicates by returning previously cached URL links … And it does this very, very quickly.

I didn’t mention this because I thought it was fairly obvious. I assumed every URL shortening services does duplication checking, as the power law of popularity still applies to URLS…

Discussing hash algorithms which could avoid duplication for URL shortening services, I was more trying to emphasize that the reverse-lookup is performed very, very quickly. So regardless of entering in a huge URL of 100s of characters (like Google/Yahoo/MapQuest/eBay/etc. links), the reverse-lookup on previously entered URL cache is very, very fast. That’s what I find impressive. Maybe I shouldn’t be impressed, but I am. That’s the part of the scheme that I find most interesting – not that duplication is avoided, but just the speed of the reverse-lookup.

And I’ve noticed we’ve been doing a little dance today over two topics – I hope we’ve both got Petzold’s sense of humor.

tinywords · August 22, 2007, 12:00am

Quick note - love your site and your techclectic topics…

on this one, though, I think you drove off into the weeds:

“Some of the services use an absurdly small number of characters in their hash keys-- 1YU, 808, cuy-- to represent the entire Steve McConnell blog URL.”

The string to the right of the URL, for instance the “1YU” part, is just a simple rolling string index; the next hit on that site might return 1YV, then 1YW, 1YX.

One good way to do this would be to keep two tables, one with the rolling string index and the sequence of the request, and a second table to hold the source URL. The second table wouldn’t need an search index at all.

Table 1

Key Request

N1162 · August 22, 2007, 12:00am

I think you are right. In theory it should pick any key below the most recent one, and it seems to work on the few tests that I have made.

Try these randomly chosen URLs:

http://qurl.net/A
http://tinyurl.com/y

This is also a fun method of testing how popular they are (unless they increase the index to seem more important). TinyURL has been used a lot it seems

N1163 · August 22, 2007, 12:00am

Cool page: http://tinyurl.com/404 There is a box saying “The storage meter is history.”

So basically it is “404 - storage meter not found”

PWills · August 22, 2007, 12:00am

+1 to tinywords

You’re way overcomplicating things. The URL redirectors with short keys – and by the way you are missing the Carl Franklin favorite shrinkster – are just primary keys to a database table.

James · August 22, 2007, 12:00am

Why would you even need 2 tables? It seems to me that one would work just fine - and retrieval would be faster because joins wouldn’t be required.

I think your way of approaching it though is probably how they would do it - the hash seems like a bit of overkill for a simple problem.