Primary Keys: IDs versus GUIDs

One drawback to incrementing integers as keys is that users notice the incrementing nature and then assume things about the data that is not reliable, such as order of creation, date of creation, data that was created near each other in time. And I’ve had tremendous difficulty trying to break them of this thinking, even after adding a column to record this seemingly necessary information.

Personally, I don’t think GUIDs NOR auto-increment integer IDs should be in the URL at all. Two reasons:

  1. http://codinghorror.com/blog/archives/primary-key-ids-vs-guids.html is way more readable than the two alternatives.

  2. Ideally, the primary key shouln’t be exposed for something as permanent and public as an URL. Keep the primary key private or transitory if exposed (such as an admin interface in which the url to edit a post isn’t likely to be shared).

As you can tell, I’m a fan of surrogate keys (http://en.wikipedia.org/wiki/Surrogate_key)

One real world scenario that causes problems with auto-increment primary keys being exposed in the URL is when users import data from one blog to another using BlogML (or something similar). Having the same PK in the new system is unlikely.

If you’re exposing database keys of any kind in your URLs, that’s a whole different problem.

http://www.codinghorror.com/blog/archives/000093.html

The traditional “auto incrementing 32-bit integer” identity column is plenty fast, but it has some problems. I highly recommend reading Joe Celko’s post on this.

http://groups.google.com/group/microsoft.public.sqlserver.programming/msg/6d61dbf80d6f0fb6?hl=enrnum=14

I too have made the mistake of building a database with GUIDs in the clustered index. Imagine my delight later, during performance tuning, when I realized that I had a table structure designed to perfectly fragment. Oops.

I agree with Dave Markle. Natural or candidate keys are by far the best way if you can get away with it. I think most people jump to using auto ids or guid without actually considering them. Also a lot of the time you can get away with using composite keys instead.

http://en.wikipedia.org/wiki/Relational_database#Keys

Yes, but I’m quoting the 3rd edition that is copyrighted in 2005. The Joe Celko article you linked was written in 2001. It doesn’t really matter, I’m sure he still feels the same way about both today.

The point is that Celko is not suggesting a GUID is a better alternative to an identity column. Rather, he says it’s worse. But, you used his post to imply his opinion is the other way around.

Wow. This is very… wrong. GUIDs don’t work like that, and don’t solve the problems you think they solve.

Let’s start at the beginning. Which flavor of UUID are we talking about? A proper GUID includes a machine-specific fixed portion, which is great for distributed systems, but vastly increases the probability of collisions in data from that node. A UUID consisting of almost exclusively random data may have a lovely distribution, but the probabilites in a high-throughput system can quickly reach, say, the chance of winning a state lottery or getting struck by lightning – small, but not negligible.

Therefore, any system that needs to merge multiple relations with unique identifiers needs to have robust methods of resolving collisions, however unlikely they may be. In short, if your code won’t work with plain integers as primary key identifiers, it’s still broken if you replace integers with UUIDs. You’re just postponing the problem.

Point of interest: Microsoft CRM 3 uses GUIDs for primary keys. Not sure about previous versions.

I’ve been developing for MSCRM for a little while now, and at first I was like ‘Eugh!’ but now I’m all like, ‘Ooooh.’

Gotta agree with Anonymous Cowherd. GUIDs are not actually “globally unique”, you know. Depending on how they’re generated, the chance of collision is small but it exists. That’s negligible for low-volume throwaway stuff like URLs or COM interfaces etc., but it’s absolutely unacceptable for large databases. Terrible idea.

Wow, so sorely mistaken. Good luck with all the storage arrays you’ll need with such setups.

Also plainly missing is anything about Oracle or PostgreSQL.

Once again CAPTCHA: orange
Strange setup…

GUIDs may be much worse than autoincrement numeric IDs as primary keys for specific database backends like InnoDB. A nice discussion of this topic specific to MySQL can be found here http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/

My suggestion:

Step 1) Figure out (perhaps by analyzing data flow) where data from various sources gets merged.

Step 2) Write and test code that detects spurious matches and handles them robustly. In general, this code may be dog slow and nasty, but it needs to work, not merely with unit tests (anecdotal evidence) but with stronger guarantees (a mathematical proof would be nice).

Step 3) Now, switch from integers to GUIDs. Your performance-sapping merging code will now be called once in a blue moon.

Starting out with GUIDs is premature optimization, and will obscure failures.

The Application I work on, switched to GUIDs as DataBase keys 3 years ago, and the only cons that we’re experiencing is the “debug hell” (Point 2).

As for Web URLs, having a GUID key in the DataBase does not force you to also write that key in the URL; there are many techniques to create user-friendly URLs even if you have GUIDs (URL-rewriting, static page generation, Key/Title caching) depending on the size of your Website.

On the other hand, if you’re going to use a key in the URL, ID or GUID are not so different (in user-friendliness) when your ID is more than 100000; eventually you’ll be copying/pasting the URL anyway…

And, is user-friendly URLs really a must? For blogs, maybe; for small catalogs, maybe; for Microsoft Downloads, definetly not.

Jeff:

Of course I agree that the world isn’t the same as it was back then. But I would suggest that the tables that most need GUIDs as PKs are the ones that would be hurt most in the area of performance, simply because they’d be so big – “Transaction” or “Order” tables come to mind. These are the tables that are going to get hurt the most with large key sizes.

And if you’re using GUIDs for other tables (state description tables, account tables, time-dimension tables, etc…), well then you’re going to be in for real trouble.

For me it comes down to replication and distributed transactions. If you aren’t doing that sort of stuff, you generally don’t need GUIDs. I prefer the ease of communication of a 32-bit integer when debugging or troubleshooting over a GUID any day.

And I wouldn’t be so quick to put Celko up on such a high pedestal – he sometimes has worthwhile things to say, but he really isn’t all that. It’s not just his tone that I take exception to. Ask him how we writes his joins, and what he does when he has to mix INNER JOINs and OUTER JOINs in the same query. You’ll vomit for sure.

GUID’s are probably OK for many online applications. But if you ever try to migrate a terabyte of legacy data and into a modern system, resolving data quality issues along the way, you will find there is a lot of benefit in a creating you keys to be more deterministic so you can track a record in the target back to the source easily. So long as you design you keys properly you can switch to a different key generation algorithm for new records when the target system comes online without any serious performance implications.

I have a right to exist!

0AFDC426-C52E-46F0-8752-5F9C71159FD9

I hate UIDs (or misuse and abuse of them). They are uninformative and easily hide bad design decisions, and as mentioned above, are rarely truly unique. Two object share an identity, when they are identical, i.e. equal by some standard. Giving objects distinct ID-numbers implies that they are not identical and that identity cannot be inferred from other available data; this is often not the case. It is preferable to use forms of identification that are descriptive, without resorting to artificial identification.

“rarely” unique ?? Well, I hope you have some math to back that up.

Also, someone said the chance of a conflict is unacceptable for large databases ?? I don’t know how you design your databases but that’s easily solved in any database I have worked with.

If it’s worth identifying, one day it may be worth exposing on the web. Give the thing a cool URI, something which won’t be broken if you change your DB setup. A uniform id string, which is backed by a system which enables global disambiguation. Better still give your relations a URI too, one day you may like to expose them to the Semantic Web.

Where I work, we only use GUIDs for FKeys. All PKeys are based off of other unique data.

All clustering discussions aside(a whole different beast than primary key concepts), let’s go through an example.

Let’s say you have a magazine subscription system. A potential subscriber wants to subscribe to your magazine. Well, his wife already has subscribed him and even provided the exact same data for this person, including contact and billing info.

Using a GUID as a primary key for the subscriber data, this person is now getting two subscriptions of the same magazine. Not good.

Basing it on such data as last name, first name, email, and possibly birth date or some other unique and distinguishing info, you TOTALLY eliminate this, ONLY if the unique primary key constraint isn’t circumvented by substituting some other piece of data for one element of the key. In that case, oh well, call customer service and have them cancel one subscription.

Now, let’s say the subscriber wants to subscribe to another magazine that you provide. You use GUIDs to identify the subscriber, also providing a user-friendly subscriber ID. In the subscriptions table you use the GUID to identify the subscriber’s subscriptions, non-uniquely, but indexed on the guid. In the magazines table you use the GUIDs to identify the magazine, unique indexed and as primary key. NOW you COULD use the pair of GUIDs in the subscriptions table as a primary key.

As look ups, GUIDs work great, for a primary key on parent data? I wouldn’t use it. I’m probably reading the blog entry all wrong, but it’s kind of a no brainer when it comes to primary key concepts.

Could we go to pineapple someday? I’m getting tired of oranges and need a change of taste. :wink: