Primary Keys: IDs versus GUIDs

LOL - Forgot to enter my name… Anonymous on July 23, 2009 2:48 PM

Like Andrew, I think URLs should be in the CONS list. Having hackable URLs make it much easier to debug and content admins who use my apps are able to grok a number in the query string (/Page.aspx?PageId=14).

The problem with these pros/cons, is that none of these pros make a whole ton of sense.

“Unique across every table, every database, every server” - The primary key needs only be unique within it’s own table. Everything else here is overkill.

“Allows easy merging of records from different databases” - so does a properly tuned insert…select statement.

“You can generate IDs anywhere, instead of having to roundtrip to the database” - but, the primary key belongs to the database. you should only need to create one when you’re inserting into the DB anyway, and already incurring the roundtrip. Generating PK IDs outside of the DB feels like a serious design flaw, leaving your app and DB tied together in a way that seems to fly in the face of design considerations like orthogonality.

There may be specific considerations where this is a fit, but I hate to see this presented as a great idea for everyone when in most cases it’s unnecessary slow down for questionable gain.

1 Like

My strategy is to use identity keys for anything that’s a sequence (such as a very large stream of chronological data - and no, the timestamp is not a valid option). For tables containing records that are only loosely related or totally unrelated to each other, I use a GUID. It’s convenient because, as mentioned in the original post, it can be generated right away without a round-trip to the server.

If a table’s going to have billions or trillions of records (we have several of those), it’s rarely a good candidate for a GUID key because the potential for collisions does exist. It’s also better to use an int (or small/tinyint) for anything representing a “type” because it can be mapped to an enum or a set of constants at the application level.

Replication is a moot point. If you have multiple publishers sending to a single subscriber, just give each server a static key (one key for the whole server) and make the primary key on the subscriber a combination of the identity and the server ID. Perfect uniqueness, almost-perfect indexing.

Oh, just a P.S.: I read Joe Celko’s post and was kind of furrowing my eyebrow until he finally laid all his cards on the table with this statement:

“If you have designed your tables correctly, they will have a
meaningful primary key derived from the nature of the entity they
model.”

Sorry Joe - what you’re sellin’, I’m not buyin’.

And I almost missed this gem: “If you use IDENTITY as a key, the values tend to cluster on physical data pages because they are sequential.” No Joe, that’s what a clustered index is for. Identity keys don’t have to be the clustered index and mine usually aren’t.

I’m not going to pretend that I know everything about databases, but the only thing I read in Joe’s post that was actually correct was his statement that it’s a religious war. There’s definitely religion in there, fighting tooth and nail for frivolous causes based on no evidence or erroneous/outdated evidence.

1 Like

How about neither.

One should strive for natural keys in a database as opposed to generated ones.

I’ve seen both used in databases and for the most part they could have been replaced by natural keys. It is very rare that a table shouldn’t have a natrual key or a natural composite key. Most developers choose generated keys as a way of maintaining primary keys on tables, but really, you can avoid it if you try with a good data model.

  1. The id=4 URL problem:
    Not checking access to data/pages is criminal and one should think about changing his job if not doing it.

  2. Every number may change. A nice customer ID 12345 may change to M12345 or 200012345-2009. Maybe not tomorrow, not in a month. But at the next merger or the next product manager, CEO or legal issue.
    A GUID will never change.
    So keep em both if you need a second one, a GUID forever, and a nice ID (as TemporalObject) for showing up.

  3. I’ve almost never seen a customer or whatever business relevant ID of 1. Either they are 0001, 10001 or 2009001 (e.g. for invoices).

  4. Performance:
    Distinct between reading (reporting) and writing. Mostly you read x times more then write. And most of the time you don’t need operations on the GUID. If you use a GUID and a nice and readable ID do your db operations on the latter. Consider having special tables and views jsut for reading concerns, perhaps already prepared for the UI and use cases, avoiding any joins.

  5. 100% thumbs up for Groover.

Its very useful in scenarios when you have a url like

somewebsite.com/projects/1

and 1 is your projectId primary key, the user can change the value and try to see what he is not allowed to see, of course you will have an authfilter but still exposing integer primary key I have always found to be dangerous.

Same url with a guid would be

somewebsite.com/projects/C87FC84A-EE47-47EE-842C-29E969AC5131

even if he changes a few characters here and there the chances of it being present are very very very slim.

Suppose , we want to keep ,C87FC84A-EE47-47EE-842C-29E969AC5131 as Document Name.

Which will be stored in Fileserver , but only path and name is stored in DB.

If anyone hit below URL they able to download file (if proper permission given)

somewebsite.com/project1/C87FC84A-EE47-47EE-842C-29E969AC5131.txt

Yes , correct , chances are very slim , but intentionally, someone can write programming code and try to generate GUID random with particular sequence and hit browser and open/ download the Document forcefully. Chances of success to get document are very less but easily achievable by programming hack.

  1. What are the chances , that Filename of GUID will not be repeated.if all different projects / filename id on same database same server.
  2. Can we have same GUID in another project , where we have two different Database tables for different projects ?
    e.g : somewebsite.com/project2/C87FC84A-EE47-47EE-842C-29E969AC5131.txt

Any more good ideas / suggestions ?

Thanks in Advance…!

Hey

Great blog, and I know its been a long time. But I want to point out, that if you are already using int, transitioning to a guid might be hard as you need to backward compatible, older sdk’s etc.

Another strategy to keep ids unique across database across servers, you can use the way facebook generates primary ids for its users. Facebook uses a large sharded MySQL setup, both for generating id values and for storing the actual data itself (so it isn’t a separate system).

Each shard gets a very large chunk of id values out of a 64-bit range. It then uses an auto_increment MySQL field and some simple math to calculate the new value to generate for the id out of the range allocated to the shard. For example you could assign shard #1 a range of 0 to 100 million. shard #2 could be 100 million to 200 million, etc… Each could then allocate new unique ids independently of each other. Since we are dealing with 64-bit integers, you can make very large ranges for each and still have very large numbers of shard. Each shard keeps track of the id values generated and what type of data is associated with it, so given a completely unknown id you can figure it out.

In fact, if each shard had a range of 0 to 100 billion, you could still have 92 millions shards.

My view is to use the largest natural integer key that the database can use as the primary key. Especially if it can be used with a key bucket to prevent transaction locks waiting for generated IDs.

BUT I would also have a GUID or something like it as a business key. I would call this business key rather than something specific because for my business key I opted for a randomized string using SecureRandom to build a 16 character string of letters and numbers (no special symbols)

Another thing, the primary keys are NOT to be exposed outside of the database. So queries against the primary keys directly are not allowed through the API.

Having integer keys simplifies the indices needed by the database making FK lookups faster.

Here’s one thing you would likely notice for large enterprise applications, chances are not everything can be mapped to a table, but some things need to come out as a hierarchy of tables with nested 1-n relationships tied to a “top table” It is that top table taht would have the business key and querying said table will get the primary key needed for all the nested tables.

Couple of points:

GUIDs are not guaranteed to be unique. It’s just highly unlikely that two will be generated with the same value, but, if you were careful, you could do the same thing with semi-random, 128-bit integers.

GUIDs are frequently stored as strings. In addition to performance penalties related to size and how many you can fit in an index page, strings compare less efficiently than integers, and frequently there is code set conversion required for strings.

1 Like

The problem with natural keys is that sometimes they change when you think they can’t. For example, a person who is a victim of identity theft might change their SSN.

Also, you might consider making a distinction between indices used for performance and indices used for business logic. For example, when you create a unique constraint, the DBMS will create an index simply because that is the only efficient way to enforce the constraint. But, if you are creating an index to increase performance, that has no impact on business logic. For performance, the more keys you can fit on an index page the better your performance will be, assuming a large data set, and nothing is as compact or as easy to compare as an integer, assuming the integer type is within the capabilities of the CPU.

2 Likes