Primary Keys: IDs versus GUIDs

What is the impact on using GUIDs for joins? Isn’t faster to join on an integer?

Does anyone realize that you can set specific GUID values in your table? I’ve had more than one occasion where I’ve had to put test data into our server, and using the GUID FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF has helped me tremendously. You can easily delete the sucker when you’re done, and have the system generate a new one on the fly.

Besides, GUIDs are required if you’re performing replication, as many have stated. If you know this is going to be a requirement, you might as well use them from the get-go.

Granted, I’ve seen some piss-poor design schemes where GUIDs could be reproduced (since there is no truly random number in any computer system, no matter how advanced), but this comes back to weighing the ease of using an auto-generated key vs. using your entire table as a key, and having one field being the difference.

The issue I see with using guids is their lack of perdictability. For example we have integration setup between two system. System A is a public facing web-site, System B is our billing system. We acquired a bunch of new buildings and we needed to put them on our public facing site, but keep them hidden. Rather then adding the records and then getting the GUIDs and putting them in our exlusions table, we simply were able to get the last ID value that SQL Server had generated (NOT THE MAX ID) and then insert the ID values into the exclusion table before the records were generated. This prevented anyone from getting even a momentary glance at data that is not ready for public viewing.

Predictablility has it’s place. How do you know if someone delete a record when you are using GUIDS?

If you are using sequential GUIDs, what’s the difference between the IDENTITY column? They are both predictable.

I will leave the GUIDs to heterogeneous items like software packages that need to have some uniqueness to them so they don’t overlap.

The only response I truly have seen that is a drawback to the predictable sequential IDs is the possibility of data scrapping. But if they got to one piece of information through some web page they can also get to the rest, they just have to write it differently.

Just to let everyone know, SQl Server and MySQl are not the only DB servers out there. Real database servers do not have any requirements on the data to do proper replication!

I have always used a primary key of an identity, with the GUID as the secured URL id, so my tables would look like this if they were update-able on the web page.

Table

id

created
createdBy
guid

else I just use the PK and that is IT.

I started to write something but then I read Dug Wilson’s entry and it summarizes exactly what I was thinking. So I’ll try to go a different direction.

This pro-con list is nice, in an abstract kind of fashion. When I think of real-life database and application design scenarios, there are very few times a GUID would be totally necessary. Almost every single one involves web-related apps with “exposed” data (like in the QueryString or Cookie).

Int counters exposes chronology for pages and users (like with MySpace userId’s) which you might not want for your given website, or if you fear about content scraping (where someone could program to hit your site with a QueryString that counts on its own).

The only other time I like GUIDs (or generated uniqueidentifier in SQL Server) is when you have multiple tables where the data from both might be used by the application in the same object. For instance I have seen apps where the Employees class could pull from separate tables with very similar data and a column identifier like “EmployeeType” was not feasible in this situation; they had to be separate tables. In that case a simple row counter could cause confusion.

We use a short string that serves as a “GUID” without actually being of the standard GUID format. We have other data which can be combined to generate a completely unique value across the enterprise (location-transation number). The value is slightly longer than a standard GUID but it is visually recognizable and sortable. We feel that this gives us the best of both worlds.

When implementing, people who consider using Natural keys obviously haven’t had to do anything with them (other than write pretty diagrams) I hate to throw a reality wrench in the works, but Natural / Candidate keys are stink unless they’re ‘just numbers’ and that’ll never change (yeah right) - you’re setting yourself up for failure.

On top of that, you’re really asking for trouble if you’re candidate key is alphanumeric and/or if your candidate key is composed of more than 1 column.

All for the sake of conceptual purity (wank). If you need to enforce a relationship between tuples, use a unique index on them instead and stop crying about needing to have more than 1 index on a table. Do you have 1 million inserts a second? If it does matter, you’ll have other issues - natural versus surrogate keys are the least of your concerns.

Primary Keys are best when meaningless (i.e. surrogate keys). UUID’s are meaningless too, so that’s fine. But UUID’s have a performance cost feel like overkill most of the time. I don’t really see the gain unless replication is necessary (and even then, figure out if it’s necessary first). UUID’s just feel kludgy. Don’t get me wrong, I’m no hater, but they’re best used sparingly.

  1. URL’s shouldn’t use the primary key be it an int or a GUID. Figure out a more user-friendly way. People always try to hack URL’s.
  2. Using GUID’s doesn’t stop screen-scrapers or robots (except maybe the real simple ones). If your site will/did link to them, they will find them.

If you need to generate keys without having to go to the database every time it’s simple - just find some schmucks crappy broken code on the net or spend beer money and read Fowlers Enterprise Patterns of Architecture.

(PS By all means use candidate keys for lookups, searches, and external identifiers, but for the love of god don’t use them as the primary or foreign key)

I’m guilty in the past of using candidate keys (just after graduating uni). Through trial-by-fire I have seen the light.

There is DB theory, and then there are the physical limitations of the DBMS being used, which you can either make work for you or against you.

With SQL Server, the smaller the key used in a clustered index, especially if you have non-clustered indices, the better. Hence, putting a PK (which is a clustered index) on a customer name of varchar(500) is bad in this situation. Putting it on their federal tax id is a little better. Using an autoinc field to id the customer makes no sense.

Using auto-incs for a State table, where the abbreviation is char(2), makes no sense, put it on the State code.

How did the (bad) concept of putting key info in URLs creep into this conversation?

If you don’t know the difference between a clustered and non-clustered index in SQL Server, then you should pause a read before building any more databases.

So what do you do if you have the GUID 7897F8AE-456B-FBC7-78900A0EEBAB and you want to know what it is?

I prefer to have a single sequence per database, and have my primary keys globally unique across the database, but not globally unique across the universe.

When surfacing these keys to the user (be it in GET parameters, or in text boxes, or on mailing labels), I like to format it in some special way that is somewhat unique to the application.

This way, at first glance I can see US-10257134, and know that it goes to one system while S347142 goes to another system, 238-73-2175 is a social security number, and XU721-8WF7L-172AB-3275A-A9FFX is a windows registration key. Maybe I don’t recognize the key format… but I’m still no better off than the GUID case.

Identities only make sense on really top-level things like Individuals and concrete instances. How does a GUID help a lineitem in an order?

Customer has Orders
Order has OrderItems

OrderItems have no context outside of an Order, so why would you attach a GUID to it? What really does it buy you?

Also, there’s no penalty for having both a simple ID for tables and queries as well as having a GUID for identity (if you need it) as well as any other logical id’s the thing might need in it’s lifetime.

The nice thing about autogen id’s is that you never update them (assuming you could, but there’s no need). You may want to update the External Id’s because yes, things like product identifiers change from inception to marketing.

(Person)
PersonId
PersonName
PersonGuid

(Product)
ProductId
ProductName
ProductGuid
ProductInternalId
ProductExternalId
CreatedDateTime
UpdatedDateTime
RowVer

And if you were that complicated, create a Keys table (sorry I couldn’t resist fizzbuzzing it.)

(Key)
KeyId - yet another (surrogate) autogen key that has no meaning
ProductId - references ProductId in Product table
KeyTypeId - (e.g. external, internal, guid)
KeyValue - {23-AD-23} or ‘VWTT240L99’ or whatever.

Also, another poster said that gaps in keys can indicate deleted records. In Sql Server (at least) that’s a lie your computer told you. Failed inserts or rolled back transactions will do the same. The key, once allocated, is gone regardless of the transaction outcome.

Heck in reality we don’t even delete rows on major entities (like Person or Order, although you might with OrderItem) - instead we only logically delete them (a whole other lot of problems).

The good thing about opinions … everyone’s got one.

Ixnay on blindly using natural keys for PKs. It leads to the dark side.

Case in point: I have seen at least three insurance systems that used “policy effective date” as a component of a complex natural PKey. Sure, it’s what uniquely identifies each Policy record. But guess what - effective dates can be changed through various business processes. Now all of the sudden a process which should be as simple as updating a single field in one row has turned into cascading a complex natural key value down to every other table that is unfortunate enough to reference it via a foreign key relationship.

Using GUIDS tends to promote use of surrogate keys, which in core entities is a Good Thing.

I say go with GUIDs for primary entities until you have a PROVEN need to reduce storage requirements on a specific table - which is a lot more rare than most people think. Also, if there is ANY chance of a future merge of two databases (you sell your system to two companies that later merge), then seriously consider going GUID to prevent a future nightmare when the time comes to integrate the two sources.

For domain entities (lookup tables), I say go with a natural key ONLY if it is guaranteed to be immutable. The choice between GUID and INT for other lookup tables should be on a case by case basis - depending on real usage patterns and the need for security (lots of good points brought up by others, no need to regurgitate).

Thats my 2 cents and a pickle.

I designed my first CRM database containing GUIDS as keys app. 10 years ago now (time flies). During the develoment cycle I’ve been cursing at myself for doing so, cause yes, debugging is hell (“what was that key again…”).

But when one of my cusomers asked me to migrate two of their branch offices into one, database migration was a breeze. The whole process took less than a day. I’d say that compensated for my debuging problems.

Joe Celko does not agree with your topic nor does the article you linked. His post merely states that the identity column is bad. He doesn’t say that a GUID is good. In fact, here is a quote from his book “SQL For Smarties”:

The algorithm used for generating new GUIDs has been widely criticized…
Besides the usual problems with exposed physical locators, each GUID requires 16 bytes of storage, while a simple INTEGER needs only 4 bytes on most machines…
Indexes and PRIMARY KEYs built on GUIDs will have worse performance than shorter key columns…
The real problem is that GUIDs are difficult to interpret, so it becomes difficult to work with them directly and trace them back to their source for validation…
The GUID cannot participate in queries involving aggregate functions; you would first have to cast it as a CHAR(36) and use the string value…
Other features of this data type are very proprietary and will not port out of a Microsoft environment…

Saying that Joe Celko agrees with you was very misleading. He only agrees with you that the identity column is bad.

Recently used GUID for a major product, the reason for it was that data was to be collect seperatly then combined into the main database. Using GUID it was was easy to handle this, just a copy from the other databases to the main, did not have to worry about keeping keys in sync or code to make sure that each copy of the collection software used a different key ID range.
Also because of the sources of data different programs had to be written, in different languages and tools, to get all the data.
That was a fun job, non-sarcastic.

Based on that here are some additional cons

The format for GUIDs is different depending on the library and the tool being used. This is even the case the in the microsoft family of software. It became to standardize on a format then write my own procedure to handle the languages/tools which did not generate that format using thier own commands.

Using the old int ID you had an easy way to see if something got deleted or if a range of data is missing. sequential GUID can help alot with this but are still not easy to catch with the eyeball.

For alot of things in debugging and coding integers are alot easier to work with then strings.

If users are pulling data directly from the database into things such as excel or word it is harder to explain that huge string vs a short number. Minor training issue, and not a problem in most cases.

You no longer have a short unique ID in every table that can be printed on reports/display so that users can identifiy the record when talking with each other. You have to add some more code to handle this.

His post merely states that the identity column is bad.

Which is, more or less, what I’m saying: mapping row identity to a traditional 4-byte incrementing integer can lead to some pathological behavior on the part of the database, users, and developers. Not that it is wrong, per se, but it’s less than ideal.

I’m interested in learning about alternatives to the traditional choices.

here is a quote from his book “SQL For Smarties”:

Which was originally published in 1996, because I have an original copy. The world of 1996 is not the world of 2007. I’ll agree that a larger PK field has significant performance ramifications, but given today’s hardware, you’d need to have a very large table before performance costs become prohibitive.

Remember that storage is never just disk space. It means memory cache and new I/O bottlenecks (disk, memory, and network bandwidth, as rare as network-constrained databases are) as well.

For basic testing the solution is simple, just MD5 some identifier and call it a GUID. =p As a collary, GUID columns are significantly faster for storing MD5 values if you ever have to search against them, for whatever reason. Too bad they don’t extend to larger SHAs.

There is one thing that I dont think I understand properly in the article that introduces “COMBs”.

The performance is very low when NEWID() is used, and it is greater when he introduces a new stored procedure that calls NEWID() and performs a lot of other operations too (like casting from one type to another, getting the current timestamp, etc).

How can operation_A be slower than operation_B, when operation_B is a complex set of steps, one of which is operation_A?

“Honestly, I would have never expected programmers to be so clueless. Not truly unique? What is this, the argument between the Sun blowing up 5 billion years from now or “only” 4.5 billion years? Who cares? There are less than 10^8 seconds in an year; if you generate a billion GUIDs per second, you will need a thousand years before the chance of a collision is 1/2. How many of you generate a billion GUIDs per second?”

The clueless fool is you. Please go back to school (in case you’ve already finished) and learn about basic probability theory.

You seem to believe that a low probability means you’re guaranteed a thousand years of GUID generation before you get a collision. Of course that’s rubbish; assuming perfectly equal distribution you could still get an identical pair tomorrow!

But the distribution of real-world GUID generators is not perfect. Chris Jaynes said he got duplicates every week, using the Microsoft generator. That’s not so great, eh?

And you also ignore that the “Pro” argument wasn’t just uniqueness in a single DB but across all servers in the world. Wanna bet that your “billion” GUIDs that make a collision likely would get created really fast if all DBs around the world started using them as keys?

Bottom line: Using GUIDs as DB keys because you expect them to be “globally unique”, without any further checking, is stupid. They simply aren’t.

You are more likely to be hit by lightening before you get two of the same, a fool saying he got it every week is doing something else wrong, that should be fairly obvious to anybody. Do the math for gods sake.

What are you really arguing about ? Guids do not need to be unique across all databases in the world, why the h-l would they need to ? Nobody expects that except the people like you trying naively and wrongly to use it as an argument against them.

Are autoincremented int’s unique in ANY way at all ?? What are you going on about ?? My good, there’s a lot of people here that shouldn’t even be in front of a computer.