URL Shortening: Hashes In Practice

codinghorror · August 21, 2007, 12:00am

I've become a big fan of Twitter. My philosophy is, when in doubt, make it public, and Twitter is essentially public instant messaging. This suits me fine. Well, when Twitter is actually up and running, at least. Its bouts of frequent downtime are legendary, even today.

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2007/08/url-shortening-hashes-in-practice.html

Kyle · August 22, 2007, 12:00am

http://qurl.net/pages/list

Indeed. After those three are used up, it will bump up to a fourth letter. Given the 26+26+10 that it appears they are using, the method isn’t so complicated after all.

codinghorror · August 22, 2007, 12:00am

The string to the right of the URL, for instance the “1YU” part, is just a simple rolling string index; the next hit on that site might return 1YV, then 1YW, 1YX.

It helps if you read the entire post, because that’s exactly what I described in the last two paragraphs.

+1 to tinywords

Actually I count it as -1; I subtract one point for using two tables when one would suffice, and I subtract another point for not reading my post.

http://qurl.net/pages/list

Oof. Browsing this list, you can see why these sort of redirection services are quickly banned at a lot of sites. Unscrupulous people create a lot of spammy redirects…

Note that a GUID encoded in Base64 is 22 characters long instead of the usual 32-28 characters.

And in ASCII85, it’s only 20 characters…

http://www.codinghorror.com/blog/archives/000409.html

GeorgeS · August 22, 2007, 12:00am

You’re not considering that they might huffman compress their urls. If they did an analysis on the most common characters in urls, they could generate a table to compress any new url to tiny-ify. Then they could base64 the resulting string to probably get pretty tiny urls.

LucM · August 22, 2007, 12:00am

Something strange happens on my work station.

When I click on
http://tinyurl.com/404

My email from Yahoo opens !!!

I don’t take any drug.

Sean · August 22, 2007, 12:00am

Your favorite saying is incorrect. “Taken with a grain of salt” means there’s not much meat there… so little that you’ll only need a single grain of salt to season it.

So your favorite saying is actually saying the opposite of what you thought it said.

Ian_Johns · August 22, 2007, 12:00am

Although not directly stated even in this article’s last two paragraphs, a more interesting aspect of tinyurl’s URL cache is its exceptionally fast speed in reverse look-up of URL’s previously cached.

Yes, tinyurl stores new URLs sequentially in a “simple rolling string index”. This technique alone would be simple always fast since it only needs to referense/deference an index but would end up producing millions or billions of duplicate URL entries.

However, tinyurl also performs reverse-lookup returns the original index for any URL previously cached. And it does this very, very quickly. Therefore, although tinyurl adds new strings sequentially, it also avoids duplicates by returning previously cached URL links.

I just re-tested this using tinyurl links from emails 5-6 years old. These links immediately resolve to their original tinyurl link without creating a new tinyurl link.

codinghorror · August 22, 2007, 12:00am

Therefore, although tinyurl adds new strings sequentially, it also avoids duplicates by returning previously cached URL links.

I didn’t mention this because I thought it was fairly obvious. I assumed every URL shortening services does duplication checking, as the power law of popularity still applies to URLS…

Chubber · August 22, 2007, 12:00am

So what you are talking about here is a second-level DNS lookup table where not just the domain name is converted to a tight integer field (a domain name of N characters is reduced to 4 bytes) but also the rest of the URL.

And while you don’t worry about the birthday paradox, that formula calculates the number of instances you need in order to have probability = 0.5 (or 50% chance). Try it again with the probability equal to 0.0001 (1/100th of a percent), about the error level that I would be happy with when requesting page A and actually getting page B. It takes a lot less URLs. MD5 makes a great hash when looking up sessions on a computer or verifying passwords, but logging every URL in the universe? Yikes!

And tinyurl has a nice function that lets you give out a “preview” url with a few more characters that shows the person where they are going before they get there.

Of course, it is only a matter of time before these services are infested with popup ads, doubleclick cookies and google ads.

jLl · August 22, 2007, 12:00am

It’s a symmetric key, not a hash.

Hashes are meant to be one-way encryption that, in all common encryptions (MD5, SHA-1, etc.), create a hash of a uniform length irrelavent to the length of the input.

qURL (at least) gives you the key to decrypt, though their cipher, and get the original URL. In this case, the length of the key varies with the length of the input text.

http://qurl.net/dH - http://google.com/

jLl · August 22, 2007, 12:00am

[correction]
It SEEMS more like a symmetric key then a hash.

RobertS · August 22, 2007, 12:00am

The other problem with the really short ones, of course, is that as they get more popular, the links get longer!

It would be foolish to have links expire; it’d make things far too easy for spammers.

Kevin · August 22, 2007, 12:00am

Am I missing something here? Why would you consider using a hash?

Since a hash is one-way, there would be no value in using it. It’s not like the server can convert the hash back into the actual URL. How is this “Hashes in practice”?

PWills · August 22, 2007, 12:00am

+1 to tinywords, to counter Jeff’s -1

With respect, Jeff, I think you missed the mark on this post.

You said: “But what really struck me about these services is how they’re a perfect embodiment of a classical computer science concept–the hash table”

Rich said: “by transforming the key using a HASH FUNCTION into a hash”

These services take a long URL, store them in a database, and then provide a key to that table. The key is computed based on the quantity of data in the table. It is NOT computer through a reproducible algorithm operating only on the URL input. Therefore what we’re talking about is not a ‘Hash Table’, it’s just a ‘Table’.

(Reminds me of the old quip “What do they call Chinese Food in China? Food, of course.”)

Or, said another way:

Database table != Hash table

Joel · August 22, 2007, 12:00am

Wow, Jeff. You either need to lay off the Mountain Dew, get more sleep, or spend more time programming and less time blogging.

Here’s a shorter version of how URL shortening services work:

They assign an incrementing integer to each new URL. This integer is usually encoded in the URL in some higher base (often 36 or 64). THE END

codinghorror · August 22, 2007, 12:00am

It is NOT computer through a reproducible algorithm operating only on the URL input. Therefore what we’re talking about is not a ‘Hash Table’, it’s just a ‘Table’.

That’s fair.

To generate VERY VERY small keys (eg, “808”, “cyu”, “etc”), hash functions clearly won’t work. Not enough bits, too many potential collisions. Clearly the short URL generators with tiny keys are not using hashes. It’s impossible.

But I think it is eminently possible to generate a typical 40-64 bit hash using an existing hash algorithm based on the URL, which would resolve down to 8-10 characters.

It might even be possible to write a custom hash that did better on URLs.

But in the end, the simplest answer is probably the correct one-- iterate through all possible combinations, growing the string as necessary when you exhaust the key space.

So I re-award tinywords 0.5 and subtract 0.5 from myself.

ian13 · August 22, 2007, 12:00am

I have only used url shortening services for the most obtuse of URLs. Generally speaking very complicated QStrings like to send my mom a link to a product at Costco or Amazon (sometimes they get really long).

While I love the succinct nature of shortened communication in text messages and twitter, I hope it doesn’t lead to a greater down of language as a whole. However, you have to admit that some thing http://sentenc.es/ is very sexy because it forces people to get to the point and saves time.

Ethan · August 22, 2007, 12:00am

sending out very long hyperlinks in email is always risky; you never know when
the email clients will insert line breaks in the links and render them unclickable

… which is why we need to pry those broken, ancient mailreaders from the 80-column terminal curmudgeons, and convince them that email is stopped being an 80 column, ASCII format 20 years ago, and if they insist on treating it as such, they have no right to complain when the rest of us want to move on and send formatted text with flowing word wrapping, reliable quoting, etc. (i.e. html-email)

Bryan_Price · August 22, 2007, 12:00am

A system that recycled their links wouldn’t be of much use.

Scott · August 22, 2007, 12:00am

But the web interface for Twitter still won’t let you enter urls if the resulting text BEFORE shortening the URL will be longer than 140 chars. It’s a design flaw. It should check the resulting string AFTER shortening the URL.

Heck, why even use the URL at all? Why not convert any URLs found to a short URL and represent the link with the word “link”? Then calculate the total length of the string and check to see if it 140 chars?

"I’ve become a big fan of Twitter. "
hmmm, quit a reversal from this comment?
http://www.lazycoder.com/weblog/index.php/archives/2007/03/07/twitter-when-users-use-your-app-creatively/#comment-59420