URL Shortening: Hashes In Practice

You left out http://xrl.us/4gie

That’s shorter than anything else you posted.

I like shorl. It has not the shortest urls, because they use some sort of syllable approach, but it is just fun to pronounce the hashed urls. Almost dadaistic.

Yeah, if it were me creating that kind of site, I’d be plopping every link in the database (being careful not to duplicate, of course) and just letting an auto-increment field do the work of assigning a unique ID. They may be using some ASCII conversion to mask the totally numeric key.

I use the mURL Firefox extension.

I’ll summarize the article for those in a hurry: URL shortening services could do something really dumb that produces too-long urls that sometimes collide. They would do this by utterly misusing a common computer science concept. Instead, they do something completely obvious, which sill be a surprise to only one person on earth: the author of this post.

I’m still trying to figure out Twitter and the likes. It all seems like communication for communications sake. Or maybe I’m just getting old…

We’ve discussed the topic a bit:
http://www.morningtoast.com/index.php/2007/07/communicating-is-now-out-of-control/

I’ve always thought that these sites just used primary keys, and then recycled keys that no one was clicking on?

My understanding is that tinyurl et all don’t work as permanent links… if no one is accessing that url then it will eventually be lost.

but I could be on crack. I just tried a tinyurl from 1 year ago and it still points to the same place.

“The URL no longer contains any hints whatsoever as to the content of the URL. It’s completely opaque. The only way to find out what’s behind that hyperlink is to actually click on it. This is not a great user experience for the person doing the clicking.”

Not sure about the other services, but TinyURL allows you to use http://preview.tinyurl.com/key which will redirect you through TinyURL’s site to show you the URL you will be taken to.

In other news, Twitter’s system is a bit flawed at the moment. Even though it automatically mini-fies a typed URL, it still counts the full URL as part of the 140 characters, resulting in a tweet that is shorter than 140 characters once it goes through the server. It means you either need to have a short url to begin with, or you need to mini-fy it manually [both options defeat the purpose of having the service there in the first place.] I’d like to see it automagically count URLs as four characters, and then replace it with the world “link” once it goes through the server.

TinyURL used an incrementing table - now no longer does so but I suggest it’s still not a hash. From wikipedia: someone counted until they saw the latest url was tinyurl.com/dicj , then added the next one so http://tinyurl.com/dick became a link to Dick Cheney’s page - now replaced with an apology.

In that respect: what’s the shortest base and protocol you could get? http:// is four letters. You could do shorter by providing wap: or ftp: links and tinyurl.com is 10 chars. I think x.yy where x is one letter and y is the tld (comes in 2 letters minimum) is the shortest. Are there 2-letter protocols? Probably none that most browsers understand, but ftp://g.to/xxx saves you some more letters. www.g.to/xxx even more when you make sure your client understands links starting with www and without protocol:// - you’d beat anything but one-letter-protocols. Sadly, webbrowsers don’t undertand that…

Following up on George Santos’ comment about compression, Jeff, another interesting way to think about this problem is compression. We want the URLs small, so any system that did try to use an algorithm instead of a simple lookup table would want small and unique, not regular and random. The problem with hashing, as you say is that the point is to not have conflicts. This means a much larger bit space than necessary to obtain that (birthday). With compression, the space isn’t random or uniform, it’s carefully tailored to the specific type and frequency of the data.

The example URL might be broken down like this by the compression algorithm:

http://forums.” (symbol 1, common)
“construx” (symbol 2 uncommon)
".com/" (symbol 3, very common)
“blogs/” (symbol 4, common)
“stevemcc” (symbol 5, uncommon)
"/default.aspx" (symbol 6, very common)

the very common strings will be high up on the Huffman tree and therefore can be represented by fewer bits. the uncommon ones will be higher and require more bits. “.com/” is probably the most common symbol thus may need only 1 bit to encode. While “construx” is an uncommon symbol and might need many more (perhaps 16).
Overall, these 6 symbols could likely be encoded in 40-50 bits, or 7-9 mixed case alphanumerics. This is obviously more characters than the simple lookup table you mentioned (which is really a degenerative case of compression anyway), but it has advantages in privacy (the host doesn’t have to store a list of all URLs), and redundancy (the host can publish its encoding table and anyone who has it can then reconstruct any address without need for a database, even a browser plug-in)

Joel: your math seems wrong.

a-zA-Z0-9 is 62 characters, which means you can get 6 bits of data in a single character.

128bit md5 divided by 6 says it would take 21 characters to represent md5 in a-zA-Z0-9 which isn’t much savings over the 32 hex digits it would normally take.

Jeff said: It’s a superior solution from a technical perspective, IMO, because you wouldn’t have to do duplicate checking. This is what hashes were born to do. Am I wrong?

I say: Yes, you are wrong; you’re replacing duplicate checking (something you only have to do once, when you request a new URL) with hash collisions - essentially breaking the algorithm you’re using.

The normal use of hashing is to cut down the search space (a kind of bucket sort) or to provide a secondary check(sum) that what you have is what you thought you should have (e.g. MD5 on zip files, essentially checking for tampering or corruption) - it’s not to provide a “unique” key.

The chances of meaningfully tampering with an MD5-checksummed file without altering its MD5 key are vanishingly small, but still greater than zero; in much the same way, the chances of generating the same MD5 (or any other kind of hash) key from two (or more) arbitrary strings (e.g. URLs) is vanishingly small, but still greater than zero.

you could do something really dumb that produces too-long urls that sometimes collide

I don’t think it’s dumb, personally. I’m not discounting the KISS solution, but it’s something worth thinking about.

For example, the idea of using some form of two-way URL specific compression on the URLs, as others have suggested, is quite clever.

TinyURL.com says on their homepage that they’ve minified 42 million URLs so far.

That means we’d need at least a 56-bit hash, which shouldn’t have significant collisions (based on the birthday paradox rule of thumb) until 2^28 or 268 million URLs were in play. To be very safe, we could go with a 60-bit hash, which would allow 2^30 or 1.07 billion URLs before collisions were a problem.

If we can encode a 128-bit GUID in 20 characters…

http://www.codinghorror.com/blog/archives/000409.html

Then we can probably encode a 60 bit hash in 10 characters.

Speaking of URL compression:

http://anres.cpe.ku.ac.th/pub/url-compression-ncsec.pdf


The prototype of this [URL compression] algorithm has been tested on a Pentium III 800 MHz machine, with 512 MB RAM. The test data set contains about 1.3 million unique URLs. The average size of the URLs is about 55 bytes. Size of a compressed URL is about 28 bytes on average, including all of the overhead, or about 20 bytes when the TreeNode is discarded. So the actual compressed data size, excluding all overhead is only 10 bytes. This yields about 50% reduction in size for web spiders’ purpose with an ability to store and retrieve the URLs on the fly. It yields about 64% size reduction for search engines, where only the ability to retrieve the URLs is required.

Really an interesting approach, and if anything, more of an argument that the large search engines should provide a URL compression service. They certainly have the data set to make the compression work…

I’m definitely with you in that storing a fixed-length representation of a potentially infinite string sounds like just the job for a hash function (collisions aside), and that it is a rather elegant solution. However, I’m curious: how would you then quickly retrieve the URL for a given hash (which, after all, one could argue is the primary function of these URL shortening services)?

If we can encode a 128-bit GUID in 20 characters…
http://www.codinghorror.com/blog/archives/000409.html
Then we can probably encode a 60 bit hash in 10 characters.

Really, don’t do that. Having worked on a well-known link-shortening site, I can tell you that it’s no fun at all fielding several emails a day from people who have entered the link incorrectly. These shortened links tend to find their way into print, where people re-type them. Characters get missed, '1’s become 'I’s, '0’s become 'O’s and so on. We once had a major national newspaper mis-transcribe a link this way.

We were using base36, so just A-Z, 0-9, case-insensitive, and people still had trouble. If you go using every symbol on the keyboard you’re in a world of hurt (especially given your example string, “[RbhlkkXVW+q4s(YSF0" would get URL-encoded as "%5BRbhlkkXVW+q4s(YSF0”). If I did it again, I’d use Doug Crockford’s base32 encoding (http://www.crockford.com/wrmg/base32.html), which introduces redundancy to mitigate collisions caused by transcription errors.

Besides, as many here have pointed out, trying to hash a URL is completely pointless. URL shorteners just need to stick the URL in a database table, ASCII encode the primary key (and if you value your users’ privacy, salt the result and then use that as the lookup, to prevent guessability of links at the expense of creating slightly longer links).

Finally, how in God’s name would you go about decoding a one way hash back into a URL string anyway?

I can tell you that it’s no fun at all fielding several emails a day from people who have entered the link incorrectly. These shortened links tend to find their way into print, where people re-type them. Characters get missed, '1’s become 'I’s, '0’s become 'O’s and so on. We once had a major national newspaper mis-transcribe a link this way.

Interesting. Then maybe the way to go is to use actual WORDS as the lookup value, rather than random-looking blobs of text. Will Thompson pointed out http://linkpot.net/, which apparently does this.

Anyone remember the old, old AOL password scheme? Two words joined by a dash? Like “JOKE-ORDER”, “BAR-COLD”, etcetera? Combining two common words might work. According to some quick preliminary research we can define “common words” at about 15,000 for most English speakers… so that’s 15,000 * 15,000 = 225,000,000 possible combinations.

Others have also mentioned some of these services allow you to explicitly name the short URLs, if the name isn’t taken. Then you could use short words that provided hints (or not) about the content… but I imagine these would get taken VERY quickly, just like the ongoing domain name land-grab.

how in God’s name would you go about decoding a one way hash back into a URL string anyway?

We still need the Table part of the HashTable; instead of an incrementing index, the hash becomes a GUID-like primary key. The more I think about this, and the more comments I read, the more I think …

  1. we should be exploring some type of URL compression, as in the paper I referenced above.

  2. Google, MSN, and Yahoo should be the people providing the link-shortening service, not random agencies and individuals on the internet.

@ Sean

Your favorite saying is incorrect. "Taken with a grain of salt"
means there’s not much meat there… so little that you’ll only
need a single grain of salt to season it.

So your favorite saying is actually saying the opposite of what
you thought it said.

Jeff’s interpretation, not yours is what people mean when they use the phrase. The whole little meat, grain of salt thing sounds like folk etymology.

When someone says you should take something with “a grain of salt” you are being asked to use a measure of skepticism. When Jeff said you should take it with a salt lick, he was suggesting that you’d need considerably more than a healthy measure of skepticism, in fact, so much skepticism that you should be very distrustful whenever statistics are brought in. That’s a bit overstated, you should probably take Jeff’s advice with a grain of salt.

I think you’re incorrectly ignoring the problem of collisions. You shouldn’t be thinking about avoiding a “likely” collision. You should be thinking about avoiding them entirely. Even if there is a 1 in 100 chance of collision at your expected volume, it is much too high. The statistical likelihood of something is totally irrelevant once I send gramma a link to pictures of the kids and she winds up on “hot asian babes exposed.”

So in this application, collisions aren’t just an annoyance, they’re absolutely unacceptable. Hence, while hashing might seem like a good idea at first, their well-known “not unique” property should steer you away from them very early.

That’s not to say you shouldn’t use a smart hash internally to make duplication checks fast and easy. That, I believe, would be very smart.

Some comments mention that certain services allow tagging on a descriptor like “example.com/xyz?blog”. You could do much the same thing with any service, though, by using a hash (sic!): “example.com/xyz#blog”.