Primary Keys: IDs versus GUIDs

I’m surprised that no one has mentioned the birthday paradox! In n-bits we have only 2^n/2 GUIDs that can be generated without collision. The probability that at least two of 50000 objects drawn from [1, 2^32] are the same is approximately 1 - e^(-(50000 * (50000 - 1))/(2^32)), or 44.1%. Autoincrement is guaranted to be a unique key in your database, but with GUID you’re pretty much assuring, on a high-use system, that you will eventually collide.

You will not collide using guids if you design your database properly, ever.

Your math is wrong, where did you get 32 bits from ? A guid is 128 bit.

For those that jsut can’t seem to get the math correct, this is the probabilitites:

http://en.wikipedia.org/wiki/Universally_Unique_Identifier

Thanks, RK. I loved the “shouldn’t even be in front of a computer” reply :slight_smile: And yes, I did take into account the birthday paradox in my calculation.

Bottom line, anyone who thinks collisions between GUIDs is a realistic possibility has… well, problems grin. I once saw a picture with a cow which fell down through a roof and destroyed a computer (this was in an advertisement for a backup program); anyone having nightmares about this happening?

(Somewhat related on the topic of programmers being unable to realistically estimate stuff like this, I have a friend who insisted that a 32-bit auto-incrementing field was too little - what if he reached the limit? He was creating a database for tracking customers.)

You stated: “You can generate IDs anywhere, instead of having to roundtrip to the database”

I assume you mean getting a new ID and then using it in other tables, databases?

If so can’t you solve this with transactions?

I would say their use is pretty obvious, changing worflow ? Not sure what you mean by that, always used guids because they make a h of a lot more sense than using ints or some made up varchar key (can you belive it?) that some people here suggested.

Seem to be some religious war against guids by some people, the math is there to support the claim they are GLOBALLY unique, for all intents and purposes they are globally unique, you are more likely to be hit by a meteorite than get a collission and even if you did, what’s the big deal ? You handle it gracefully and generate a new guid.

Either it’s a religious thing or you just lack the intelligence to understand guids. Something weird is certainly going on with this blatant bs about guids.

It seems obvious to me that GUIDs in databases are a very bad idea for many of the reasons stated in this comment stream as well as the fact that a lot of the supposed advantages seem like disadvantages to me. But let’s assume for the sake of discussion that they are better. Are they so much better that it’s worth changing minds and workflows? Are they twice as good? I doubt it. Usually a new solution has to be ten times better for it to be worth fighting to overcome the inertia of the old solution.

By the way, I had to stop reading the article about GUIDs in databases when it completely undermined its own credibility by claiming SSN is a natural key. There are people with two SSN. There are multiple people with the same SSN. SSN is a huge bureaucratic fustercluck and anybody who uses it as a natural key WILL eventually run into trouble if your database is the huge success you hope it will be. But don’t trust me. Google it for yourself.

@Anonymous Cowherd:

Actually when implemented correctly, using the machine’s MAC address (when the MAC address is taken from a serious vendor’s NIC) and a timestamp is the only way to guarantee GUIDs are actually unique.
The person having conflicting GUIDs complained they where conflicting with MS office. It is highly unlikely that the same MAC address has been used in the past, on a computer that had a clock way into the future, by microsoft and this person, incidentally corresponding to the moment you generate a new one. In case the conflicts arise regularly, probably type 4 GUIDs are generated by a tool that doesn’t use enough entropy to generate the random numbers (for instance seeds the standard random generator with the number of seconds since midnight).

@RK:

People get hit by lightning. People win state lotteries.

SQL is not relational, but has the features needed to implement relations. Making a SQL database a relational database – that is, ensuring that your SQL tables are relations and not tabular data – requires some effort. Spurious PK matches invalidate the chain of assumptions that make a SQL table act enough like a relation to maintain data integrity and permit joins to produce meaningful results.

GUIDs are a bit of a mess. There are various ways of generating them that have vastly different amounts of entropy. The guy who had weekly collisions was probably using an algorithm that included the machine’s MAC address and a timestamp. Under those circumstances, it is quite likely that collisions will take place.

Recognizing spurious PK matches is the cost of using surrogate keys. This cost must be paid, no matter what form the surrogate key takes. Clever usage of a good GUID algorithm can then be used to amortize this cost to nearly zero; this allows easy-to-understand and easy-to-test methods to be used to resolve conflicts, even if the runtime costs are relatively high.

GUIDs aren’t a loss, quite the opposite; but pretending that they make the underlying problems go away is naive.

no one has mentioned threading/locking yet, and this stuff has big consequences on scalability.

you can generate any number of guids at once, in parallel, but you can only generate one ‘autonumber’ at once (they have to be queued up). Hence, in a highly concurrent application, guids avoid one major bottleneck.

i think the article about MySQL 5 goes into this in detail.

That’s a pretty cool concept in my dumb old opinion. And not the sort of thing we usually think of since we grew up on boring old single-core computers.

So use guids, cause your gandkids kids are going to thank you.

lb

I think all this discussion shows is that there is no “right” way of doing something. People have their own experiences, their own needs, their own scenarios, that change the impact of using one key type versus another.

Although this discussion was quite good!

I am addicted to GUIDs. I love them. I work with data on many different databases and have a few apps at work that move data from oracle to sql and back and its no problem with GUIDs.

The cool thing about using Guids, in this fashion is that they reduce the amount of collision detection required when importing foreign data.

Our product has a plug-in architecture that relies on this detail to allow people to drop Dlls into a folder on our CMS and have new functionality without jumping through hoops to ensure that the references the plugin has to itself don’t point to something else in the database…

Shame you can’t use a Guid as an Attribute though (well you can use a string and a property that returns a guid but yeah …)

Remember that there are two keys on a table - the natural key and the surrogate jey, celko proposes only the natural key and he gets really confused over the use of IDENTITY and GUID for the surrogate - read Codd and Date on this rather than celko.

So long as you use a sequential guid rather than NEWID() for instance then you are ok from the locking and fragmentation stand point - consider the randomness of the NEWID(), say you are inserting 100 rows into your table, the liklihood on a table with even just 20 - 30K rows is taht each one of the 100 rows will be inserted on a seperate page, require index nodes to be locked and possibly page splits - the locking story here is horrendous; whereas IDENTITY or NEWSEQUENTIALID will at least cause significantly less of a locking burden.

When the GUID is being created outside the database, perhaps the middle tier or the client then you are going to have the locking horror, but a way round this is to prefix your guid with some sort of node number so that at least the rows being inserted by that node are around the same pages in the tables etc… There are a number of ways of doing that, and when it boils down to it you probably wouldn’t use GUID’s in that situation anyway.

Using your sample, the natural key would be ‘value’, but thats a bad name - it should be more representative, say ‘Colour’, then the ‘ID’ column is what’s termed the surrogate key which is then used in any foriegn key references in other tables (this is the mistake celko makes copying the natural key everywhere causing a locking disaster if the natural key needs to change - inconsistency between the database and application).

Some good links there !

Tony.

Thats a fantastic post, that helped me making a decision that I can have workflow Instance ID (Windows workflow foundation) as Primary Key http://aliwriteshere.wordpress.com/2007/04/02/worflow-instance-id-as-primary-key/.

Hey {CF76B38A-888E-4BEA-B08D-13AFBAAD8745}, That’s my name!

I know this blog entry is getting stale, but I’ve been trying to apply the concept of using GUID’s in a distributed environment, but the problem I’m running into is that the environment (MS FoxPro) I’m using does not support a native 16 byte implementation of the GUID (a la SQL’s Uniqueidentifier), I have no trouble generating a string representation of the GUID (by accessing the Windows API), and I can even come up with the binary 16 byte representation of the GUID programmatically.

So my question is this: What is the best way to represent a GUID in a non-MSSQL table? A 38 char field, or a 16-byte binary field? I know that the 38 char record is much larger than the 16 byte field, but the 16 byte hex field is extremely human-unfriendly… Which is the best way to go?

I supposed this article was targeted to outline the physical cons and pros of both.
I’m not going to repeat what one or two already said, but I’d like to underline: always look for a real primary key for your table, then and only then, if you didn’t found one, evaluate the best use of either integer identity or Guid columns.

Can anyone tell me if it’s necessary to use a GUID for your primary key in a database that is to be hosted in a SQL Server 2005 clustered environment?

I’ve been told if you were to use an Identity Column that there is a chance that concurrent requests may result in one or more servers in a cluster attempting to insert a row with the same primary key, resulting in a failure.

Is this really true?

Do developers need to concern themselves with the hosting environment when doing database design?

Do you have to use a GUID or some sort of composite primary key combined of a column with a serverid and another with an identity column?