Primary Keys: IDs versus GUIDs

There is nothing wrong with exposing parameters into URL, just verify it! Anything can be entered into the URL including SQL so it should always be verified (against database, cookie and/or session), using a GUID to help here is just lazy and the wrong tool for the job. Remember the Web is stateless, each page needs to check who is logged in and that they have access to the resource that the URL (or cookie) is requesting.

Using GUIDs helps with security and obfuscation as well. When data must be selected and is databound in some control, such as a listbox, the user can easily figure out certain things about your database.

This is especially true for web pages where a user might be able to figure out how to gain access to certain resources they might not otherwise have access to (ie: view source - scroll to listbox - “OOH! I have access to IDs 1, 2 and 4, but not 3.” - “Let’s plug 3 into this hidden field and see what happens”). Or SQL injection. Or query strings. You get the point.

Using GUIDs will prevent this. BOING!

In weighing up the options, many people presume ‘replication’ and ‘merging’ is a rare occurance.

However, when you are dealing with disconnected datasets in the client (which is now more common than not), every transaction from client to server is effectively a database replication/merge scenario.

GUIDs are unique identifiers, nothing more. Numeric IDs can give you many useful feature ‘for free’. I disagree with GUIDs feeling natural - we’re talking about a database which is not natural - numbers are natural and mean something to humans plus you get chronological ordering built in. The performance on a numeric ID is so much better.

Regarding replication, the best way (not my opinion, but in every real world case) is to have a numeric primary key and a GUID column for replication.

Unfortunately the academic best practices which are thought up by the gurus do not translate into the best method for the real applications!

You have to think about how the database is going to be used as that is the absolute goal of a database. Doing things in the disputed ‘correct’ way is of no use if the solution does not give the project the best end result.

Consider this:

(employee) What is your customer ID?
(customer) 2A734AE4-E0EF-4D77-9F84-51A8365AC5A0

And you believe this will happen? Never never.

So does anyone has an practical solution?

I am pro-guid…

So you are making my points really!

Use both - they have complimentary and valid uses - that’s why I’ve still got a job!

I just cannot understand why people think that GUIDs are the new id - no they are there for a purpose - I don’t think sequential numbers will ever have their day.

As I say - think about it, don’t just say mine is better than yours - you won’t even get job!

I will not argue for using either GUID, IDENTITY columns or even natural keys. I think it all depends on your needs, if you are building a datawarehouse you’ll definitely NOT stick with IDENTITY, you would probably create the Id on-the-fly, SMALLINT, INT, BIGINT or whatever suit your needs.

IMHO using GUIDs when developing Web-applications is nice and easy to deal with, for instance creating new customer, new user etc. notice the usage!! Small tables, but wide access to functionality, much easier to generate the GUID in the application and commit, anyway that’s just my opinion. I saw another poster commented the cons of natural keys, I agree :slight_smile:
(GUID are also un-avoidable when dealing with merge-replication, and merge-replication should normally exist in a normal sized environment where users updates/ inserts to the frontend-SQL and the data needs to be replicated to the backend)

I have read through all the posts (I believe), some misconceptions exists here and there, but the one from :

Aaron G on March 21, 2007 8:00 AM
"If you use IDENTITY as a key, the values tend to cluster on physical
data pages because they are sequential. The result is that if the most
recent rows are the most likely to be accessed, there will be locking
contention for control of those physical data pages. What you really
wanted in a key is some spread of the rows over physical storage to
avoid having every user trying to get to the same page at the same
time.“
Aaron’s comment
"And I almost missed this gem: “If you use IDENTITY as a key, the values tend to cluster on physical data pages because they are sequential.” No Joe, that’s what a clustered index is for. Identity keys don’t have to be the clustered index and mine usually aren’t.”

  • Really hit my eyes… Aaron doesn’t know how a clustered index works :slight_smile: Joe does! Aaron missed the point regarding pages and Joe is completely right, it would be very nice to create a clustered index on different filegroups… however this can be accomplished at a physical level, but that is another story.

Willem Luijk on February 21, 2009 3:15 AM:
You are mixing customer id and surrogate values, don’t do that! The GUID is created for relationship and not some, in your case, customerId.
I have always drawn a sharp line between Id’s and numbers; IMHO as a rule of thumb: Id’s are arbitrary values and shouldn’t be used as anything else, whereas numbers are describing something.
So in your case, CustomerId is in fact a GUID, but you need to model a CustomerNo as well :slight_smile:
And the CustomerNo can be in a friendly easy reading format, i.e. Customer: Joe Montego, could have a CustomerNo like : Montego01 , easy to understand and Mr. Montego is very happy, he doesn’t need to look at some stupid insurance/ invoice or other crappy system generated, lazy db’s/ developer/ -solution where only the nerds thinks it’s cool to have a customernumber like 4533-49900.900-88 LOL right? (and where he usually ends up with… yada yada… I only have my phonenumber and the service employee ends up with looking their own nerdy-invented stupid customerId up using that phonenumber ROTFL)

LOL - Forgot to enter my name… Anonymous on July 23, 2009 2:48 PM

Like Andrew, I think URLs should be in the CONS list. Having hackable URLs make it much easier to debug and content admins who use my apps are able to grok a number in the query string (/Page.aspx?PageId=14).

The problem with these pros/cons, is that none of these pros make a whole ton of sense.

“Unique across every table, every database, every server” - The primary key needs only be unique within it’s own table. Everything else here is overkill.

“Allows easy merging of records from different databases” - so does a properly tuned insert…select statement.

“You can generate IDs anywhere, instead of having to roundtrip to the database” - but, the primary key belongs to the database. you should only need to create one when you’re inserting into the DB anyway, and already incurring the roundtrip. Generating PK IDs outside of the DB feels like a serious design flaw, leaving your app and DB tied together in a way that seems to fly in the face of design considerations like orthogonality.

There may be specific considerations where this is a fit, but I hate to see this presented as a great idea for everyone when in most cases it’s unnecessary slow down for questionable gain.

1 Like

My strategy is to use identity keys for anything that’s a sequence (such as a very large stream of chronological data - and no, the timestamp is not a valid option). For tables containing records that are only loosely related or totally unrelated to each other, I use a GUID. It’s convenient because, as mentioned in the original post, it can be generated right away without a round-trip to the server.

If a table’s going to have billions or trillions of records (we have several of those), it’s rarely a good candidate for a GUID key because the potential for collisions does exist. It’s also better to use an int (or small/tinyint) for anything representing a “type” because it can be mapped to an enum or a set of constants at the application level.

Replication is a moot point. If you have multiple publishers sending to a single subscriber, just give each server a static key (one key for the whole server) and make the primary key on the subscriber a combination of the identity and the server ID. Perfect uniqueness, almost-perfect indexing.

Oh, just a P.S.: I read Joe Celko’s post and was kind of furrowing my eyebrow until he finally laid all his cards on the table with this statement:

“If you have designed your tables correctly, they will have a
meaningful primary key derived from the nature of the entity they
model.”

Sorry Joe - what you’re sellin’, I’m not buyin’.

And I almost missed this gem: “If you use IDENTITY as a key, the values tend to cluster on physical data pages because they are sequential.” No Joe, that’s what a clustered index is for. Identity keys don’t have to be the clustered index and mine usually aren’t.

I’m not going to pretend that I know everything about databases, but the only thing I read in Joe’s post that was actually correct was his statement that it’s a religious war. There’s definitely religion in there, fighting tooth and nail for frivolous causes based on no evidence or erroneous/outdated evidence.

1 Like

How about neither.

One should strive for natural keys in a database as opposed to generated ones.

I’ve seen both used in databases and for the most part they could have been replaced by natural keys. It is very rare that a table shouldn’t have a natrual key or a natural composite key. Most developers choose generated keys as a way of maintaining primary keys on tables, but really, you can avoid it if you try with a good data model.

  1. The id=4 URL problem:
    Not checking access to data/pages is criminal and one should think about changing his job if not doing it.

  2. Every number may change. A nice customer ID 12345 may change to M12345 or 200012345-2009. Maybe not tomorrow, not in a month. But at the next merger or the next product manager, CEO or legal issue.
    A GUID will never change.
    So keep em both if you need a second one, a GUID forever, and a nice ID (as TemporalObject) for showing up.

  3. I’ve almost never seen a customer or whatever business relevant ID of 1. Either they are 0001, 10001 or 2009001 (e.g. for invoices).

  4. Performance:
    Distinct between reading (reporting) and writing. Mostly you read x times more then write. And most of the time you don’t need operations on the GUID. If you use a GUID and a nice and readable ID do your db operations on the latter. Consider having special tables and views jsut for reading concerns, perhaps already prepared for the UI and use cases, avoiding any joins.

  5. 100% thumbs up for Groover.

Its very useful in scenarios when you have a url like

somewebsite.com/projects/1

and 1 is your projectId primary key, the user can change the value and try to see what he is not allowed to see, of course you will have an authfilter but still exposing integer primary key I have always found to be dangerous.

Same url with a guid would be

somewebsite.com/projects/C87FC84A-EE47-47EE-842C-29E969AC5131

even if he changes a few characters here and there the chances of it being present are very very very slim.

Suppose , we want to keep ,C87FC84A-EE47-47EE-842C-29E969AC5131 as Document Name.

Which will be stored in Fileserver , but only path and name is stored in DB.

If anyone hit below URL they able to download file (if proper permission given)

somewebsite.com/project1/C87FC84A-EE47-47EE-842C-29E969AC5131.txt

Yes , correct , chances are very slim , but intentionally, someone can write programming code and try to generate GUID random with particular sequence and hit browser and open/ download the Document forcefully. Chances of success to get document are very less but easily achievable by programming hack.

  1. What are the chances , that Filename of GUID will not be repeated.if all different projects / filename id on same database same server.
  2. Can we have same GUID in another project , where we have two different Database tables for different projects ?
    e.g : somewebsite.com/project2/C87FC84A-EE47-47EE-842C-29E969AC5131.txt

Any more good ideas / suggestions ?

Thanks in Advance…!

Hey

Great blog, and I know its been a long time. But I want to point out, that if you are already using int, transitioning to a guid might be hard as you need to backward compatible, older sdk’s etc.

Another strategy to keep ids unique across database across servers, you can use the way facebook generates primary ids for its users. Facebook uses a large sharded MySQL setup, both for generating id values and for storing the actual data itself (so it isn’t a separate system).

Each shard gets a very large chunk of id values out of a 64-bit range. It then uses an auto_increment MySQL field and some simple math to calculate the new value to generate for the id out of the range allocated to the shard. For example you could assign shard #1 a range of 0 to 100 million. shard #2 could be 100 million to 200 million, etc… Each could then allocate new unique ids independently of each other. Since we are dealing with 64-bit integers, you can make very large ranges for each and still have very large numbers of shard. Each shard keeps track of the id values generated and what type of data is associated with it, so given a completely unknown id you can figure it out.

In fact, if each shard had a range of 0 to 100 billion, you could still have 92 millions shards.

My view is to use the largest natural integer key that the database can use as the primary key. Especially if it can be used with a key bucket to prevent transaction locks waiting for generated IDs.

BUT I would also have a GUID or something like it as a business key. I would call this business key rather than something specific because for my business key I opted for a randomized string using SecureRandom to build a 16 character string of letters and numbers (no special symbols)

Another thing, the primary keys are NOT to be exposed outside of the database. So queries against the primary keys directly are not allowed through the API.

Having integer keys simplifies the indices needed by the database making FK lookups faster.

Here’s one thing you would likely notice for large enterprise applications, chances are not everything can be mapped to a table, but some things need to come out as a hierarchy of tables with nested 1-n relationships tied to a “top table” It is that top table taht would have the business key and querying said table will get the primary key needed for all the nested tables.

Couple of points:

GUIDs are not guaranteed to be unique. It’s just highly unlikely that two will be generated with the same value, but, if you were careful, you could do the same thing with semi-random, 128-bit integers.

GUIDs are frequently stored as strings. In addition to performance penalties related to size and how many you can fit in an index page, strings compare less efficiently than integers, and frequently there is code set conversion required for strings.

1 Like

The problem with natural keys is that sometimes they change when you think they can’t. For example, a person who is a victim of identity theft might change their SSN.

Also, you might consider making a distinction between indices used for performance and indices used for business logic. For example, when you create a unique constraint, the DBMS will create an index simply because that is the only efficient way to enforce the constraint. But, if you are creating an index to increase performance, that has no impact on business logic. For performance, the more keys you can fit on an index page the better your performance will be, assuming a large data set, and nothing is as compact or as easy to compare as an integer, assuming the integer type is within the capabilities of the CPU.

2 Likes