Primary Keys: IDs versus GUIDs

What I meant by rarely truly unique was that UID schemes don’t usually guarantee unique IDs, i.e. “truly unique”, excuse for the bad wording.

Now, with the parent subscriber table having it’s non-GUID primary key, the index will actually be smaller and easier on those hashtables. Thus providing a performance boost when it comes to a transactional database. GUIDs as look ups aren’t that horrible IF they are either combined with other identifying data including another GUID. It provides a more rather unique way of identifying the row. And by providing a repeatable character set in you key combo, fragmentation should be reduced. Based on what I know from Oracle, fragmentation comes from not having enough repeatable character strings. I would assume SQL Server would be similar in that problem.

Disk space any more is a rather moot point with a new drive being so cheap. Performance is where all the good and bad happen. If you can’t find a performant index you are going to see the GUID performance issue and think it is solely based on this idea. Another option, is to not actually use the UNIQUEIDENTIFIER data type and save the data as a NVARCHAR. The performance of a NVARCHAR is going to be far superior to UNIQUEIDENTIFIER anyday.

(Putting on fire retardant suit now… :wink:

What’s wrong with using a true candidate key?, why make work and move away from the actual relationships within the data. What would Codd et al make of it all…:slight_smile:

I like the fact that once you’re through using a GUID, you can sell it on eBay. (Yes, I’ve seen it).

I have to say that I don’t think of a database where every table has it’s ID field as a GUID is very practical.

I also have to say that I’m not a big fan of IDENTITY columns either. If I were a perfect programmer it would be different. I prefer to keep lookup/supporting tables constraint and IDENTITY free which allows me to fix mistakes (like accidentily deleting a lookup record or worse). Those tables are often small enough to not hinder performance too much (enough to care) when foreign keys are not linked from main data tables.

Most databases have a composition of 90% supporting tables and 10% business data tables. Using a plain integer for the supporting tables and a GUID for the business data is a more practical approach.

Just my humble opinion.

GUID’s have there place but it isn’t as the PK of the table. Being able to see the sequence inserts of data far out ways everything else. SQL Server has a nice datatype of rowversion which is a timestamp but also a glorified GUID. It is unique across the database but has the benefit of changing when the row is updated.

I think it is a mistake to use a GUID for replication purposes only. Lets not mix apples with ‘orange’'s here :wink:

Since space is almost a non-factor… IMHO you should always include a pk, and rowversion (GUID) on every table. If you care about the rowversion changing on update then also add an additional column of a static guid to the table.

“Database guru Joe Celko seems to agree.”

You say that. I think Mr. Celko does not mean what you think he means. (Although I dislike his smarminess) he’s right - IDENTITY columns are horrible - but

GUID’s are far worse.

You should be using a ‘real’ key - ie. something derived from your data model - not a ‘surrogate’ key, which is what integer keys in all their variants

including IDENTITY and GUID are. A GUID is not different from IDENTITY in most situations and is worse in many.

A ‘real’ key is often called a candidate key and is derived from real attributes of the entity you’re modelling. There are very good reasons for this. Say

you’re a manufacturer with a range of products that you identify somehow - let’s assume “name”. Now “name” is probably used within your business from the

shop floor to the sales brochure. That’s a primary key and it has many advantages:-

  • it will be unique (you can’t have your people referring to two different products by the same name, or there’d be chaos)

  • everyone in your business will know the names of the products they deal with

  • if your data is presented outside the database antwhere from a brochure to a data dump, everyone will be able to see what it relates to on sight

  • if you can’t find a unique key like this there is something wrong with your data model which will lead to anomolies and bugs sooner rather than later

The only reason why IDENTITY (really ‘surrogate’) keys were ever used was that candidate keys can get pretty big. As a result many people use IDENTITY as an

“alternative key” to improve performance but make them invisible to the users. Only nongs use them as the actual primary key.

Ironically, the performance argument is nonsence. Let’s say you’ve got your PRODUCT_NAME but decide to introduce PRODUCT_ID as an IDENTITY field for the

primary key. This is what will happen:-

  • you now need two indexes, one on PRODUCT_ID for “performance” and another on PRODUCT_NAME so your users can do searches using the attribute they know.

Inserts now take longer because two indexes need to be updated - whoops

  • if that extra index on PRODUCT_NAME isn’t unique your users will be able to enter duplicate records and you’ll have data corruption until the end of time.

  • your code will become sprinkled with constructs like “SELECT DISTINCT …” because the parameters to the query (as entered by the users) will probably

have to use the now non-unique natural key for “usability”

  • if you ever need to recover or fix data corrupted by a buggy application, it is going to take your support staff a lot longer if they have to trace

relationships through a set of arbitrary numbers. Heaven knows how they’d cope with GUID’s (You don’t have buggy applications?). Also if you ever need to

migrate/replicate data to another application with a different schema the analysis is going to be a lot easier. That’s a common requirement in the world

these days with widescale enterprise integration. All those ETL (extract/transform/load) tool vendors are making money meeting a real need.

So you’ve now got a redundant index that is going to give you a lot of trouble, why not get rid of one? How about the arbitrary one that no user can

understand in the real world? ie. PRODUCT_ID. If you do that, your performance comes back, and you eliminate a whole lot of scope for error and corruption.

Now, let’s take the purported advantages of GUID/IDENTITY/integer:

Pro Con

Unique across every table, every database, How do you detect duplicates?

every server

Allows easy merging of records How do you detect duplicates? Where do you put the business

from different databases rules relating one record to another? In the ETL script
right? And then you have to keep the results in a mapping
table for ever afterwards because GUID/IDENTITY value is
not predictable. Congratulations, you’ve just got a
a multi-database, 3 way join. Perhaps not so easy after all.

Allows easy distribution of databases I’m not sure if you mean just duplication of read-only data for

across multiple servers performance, or multi-server capture of input. The first is
easy with flat out replication. The second requires more
work to prevent duplicate entry but I can’t see how GUID’s
help at all - you still need to check your input for the
candidate key to detect duplicates. This is not too hard
with industrial grade RDBMs (not easy, not impossible), but
you’d still need to do it with GUID’s and you need to do
it against the real natural-world key - a cross-database query.

You can generate IDs anywhere, instead of having Strawman. In the case of IDENTITY, detection of duplicates always

to roundtrip to the database requires the RDBMS. I’ll concede GUID’s can be generated
on the front end, but you’re not winning as you still need
to check for duplicates.

Most replication scenarios require GUID columns anyway True enough, but this is due to the technical deficiencies of

                                                      current products.  Implementation details should not drive
                                                      design, only influence pragmatism in implementation.  When
                                                      I use replication I stick the replication id on as an extra
                                                      field the users can't see.  If the real primary key isn't
                                                      enough for error detection, I add a trigger or two.

Lastly, getting back to the PRODUCT example. Let’s say you really have a need for distributed generation of keys. You couldn’t go wrong if you took a leaf

out of the real world and use a generated_but_determinate identifier. Let me take three examples that illustrate:

  1. Computer vendors

    Their product “names” are a form of structured key which is constructed according to an arbitrary convention, like:

      product line(3 chars)-model number-revision number
    

    These numbers appear in their catalogs and all internal documents. If you’ve ever ordered equipment you’ll know you’ve had to specify this stuff.
    This allows different units in the company to generate guaranteed unique id’s that have meaning

  2. MAC addresses

    As every one knows MAC addresses are not random, rather the leading edge is assigned to a manufactor who uses the remaining bits to structure an id.

  3. SNMP object ids

    These identify data objects inside network devices. Each is globally unique (across the world), and describe a hierarchy of objects inside each
    network device. Each manufactor is assigned a node on the tree and is free to assign any hierarchiachial structure they need underneath that node.

These are good id’s as they don’t overlap, can be generated without central control and are readily understood by people working with them. Neither IDENTITY

nor GUID are understood at all outside the database which contains them.

GUIDs make the job of bringing data together so much easier and it comes in very VERY handy if you have an application that is multi-user based or has multiple globally distributed databases.

I would argue that performance and disk space is not a problem unless you have millions of rows and hundreds of tables. A few hundred thousand rows will never be a problem!..even spazzy MS Access database can handle that.

It really depends on the needs for your applicaiton. I deal with distributed systems and disconnected systems constantly so a sequential increment of a int won’t cut it. You have to be able to spin up a surrogate key on a disconnectd client.
I know there is a performance / storage trade off. I also know if you are arguing over this, you are either missing largely impact issues elsewhere or building one of the top 10 largest sites in the world.
For the rest of us dealing with disconnected, remote applications (e.g. - real apps of any size) - you use a useful tool.

I don’t know anything about the mathematical probability of GUIDs being unique, but I do have some practical experience in the subject.

I worked on a team years ago where we had to generate our own GUIDS for COM type-libraries. Using Microsoft’s most modern UUID generator at the time, we ran into conflicts no less than every week or two. Often times, our conflicts were with MS Office, or other Microsoft GUIDs.

I’ve also seen inserts fail due to collisions in test databases where GUIDs were used. Even our simple load tests on a later application (at a different company) proved that GUIDs were not anywhere near “globally unique”…

To the people who keep blathering on about natural keys: You are obviously not maintenance programmers in commercial environments.

When the business changes, so does the meaning of your data, and so does the meaning of your natural keys. By using a surrogate key and not exposing it to the end-user, you can fiddle with the other constraints on your database tables without breaking foreign key relationships. While natural keys would be great in a perfect world, it’s far too easy to choose natural keys poorly at design time. For example, a lot of people seem to think that SSNs are a good natural key.

\begin{sarcasm}Because not everyone is Joe Celko, and we can sidestep our grievous inability to design databases by using surrogate keys.\end{sarcasm}

As for GUID vs IDENTITY, I think it’s as simple as asking yourself whether or not you really need a GLOBALLY unique identifier. Can data exist in multiple databases (Service Oriented Architecture blah blah)? Use a GUID. Data warehousing or replication? Use a GUID. Otherwise, default to IDENTITY and enjoy IDENTITY’s transparency.

Just my two cents. Thanks for listening!

i like guids so much i changed my name!

The traditional “auto incrementing 32-bit integer” identity column is plenty fast, but it has some problems. I highly recommend reading Joe Celko’s post on this.

Most of the “technical/hardware” issues Joe mentions are incorrect. You can read all about it in other posts in the same newsgroup (http://tinyurl.com/2jdo9y) by other individuals that are more knowledgeable about the inner workings of Oracle/SQL Server database engines than Joe Celko.

Also, as far as I know Joe Celko doesn’t “approve” of GUIDs. He preaches the use of natural keys.

In your example above, if the table is “Fruits”, why not use the fruit name as the key?

By the way, great blog.

Damn, this is worse than FizzBuzz. Honestly, I would have never expected programmers to be so clueless. Not truly unique? What is this, the argument between the Sun blowing up 5 billion years from now or “only” 4.5 billion years? Who cares? There are less than 10^8 seconds in an year; if you generate a billion GUIDs per second, you will need a thousand years before the chance of a collision is 1/2. How many of you generate a billion GUIDs per second? And if you do, what exactly do you propose that would do the job better?

As for (not) using surrogate keys… sorry; less than a month ago there was an article on the daily WTF about someone having to declare one of his twins’ birthdate in a different month than the other, because the database of the insurance company was using natural keys, and they couldn’t handle two people with the same names, addresses, and birthdate. Codd was a genius, but I simply can’t find a serious reason for not using surrogate keys; natural keys have too many problems.

a 64-bit GUID can be stored in an 8-byte integer (which are supported by most modern dbs) so the size concerns over a 4-byte integer are not as bad as with Jeff’s 16-byte CHAR representation decision.

Oracle is really a different beast in this respect.

First, it doesn’t have built-in support for auto-incrementing INTEGER keys. You have to set up a SEQUENCE object, then explicitly query that object’s NEXTVAL property to get a new value for your key. At best, you can automate this query step using a trigger, but we don’t tend to do this in practice where I work. Though that is mostly a result of our DBA team not having a good system for managing triggers.

Second, Oracle has no built-in UUID datatype. It does have a SYS_GUID function that returns a UUID, but it is stored as a 16-byte RAW value. RAW values are kind of a pain because you have to convert them to a numeric or text datatype before you can really do any logic on them.

So, performance penalties all around with Oracle. I might just have to do some tests and see what we get.

Using a GUID as a primary key for the subscriber data, this person is now getting two subscriptions of the same magazine. Not good.

You’re oversimplifying the problem, and the solution…

The tradeoff of getting the benefits of a surrogate key is that you may need to include an additional unique index on your table. Which I would think would be very obvious in this case. Even so, your natural key or unique index won’t protect you from this problem, as anyone with experience managing customer data knows. You also need to have a de-duping mechanism for information such as names and addresses, to deal with things like typos and whitespace inconsistencies.

The performance of a NVARCHAR is going to be far superior to UNIQUEIDENTIFIER anyday.

I don’t see how this could be. Barring conversions and any other implementational differences, a VARCHAR storing a text representation of a GUID is going to be at least twice as large as a GUID, isn’t it? I’d think there would be at least a small performance impact, just as there is with GUIDs over integers.

Guids suck, they take up more room in the DB, they take more DB horsepower to keep indexed, and they aren’t readible.

Natural keys stink for all the reasons mentioned.

I see nothing wrong with the good old, tried and true, auto-incrementing int ID as a key.

The ONLY argument for the guid is if you are going to use the id in a URL as it obfuscates things better. That’s it, and there are other ways around that now aren’t there?

RAW values are kind of a pain because you have to convert them to a numeric or text datatype before you can really do any logic on them

I should have checked that before I submitted. Apparently comparison and equality logic works fine on a RAW field. And you can make a RAW field a primary key in Oracle with no problems.

So, my guess is that the only drawback beyond INTEGER keys is the size.

Marcel, on a purely theoretical basis you would be correct. You must also take into account the algorithm and random number generator used to generate the GUID. In cases where conflicting GUIDs are being generated often I would suspect the GUID generators are at fault.