Primary Keys: IDs versus GUIDs

Keep in mind that index size can adversly affect perf - smaller index size is always better - what happens when the index goes from fitting into cache to not fitting into cache - I guess I am just lucky in that I get to work with vldb data. in a vldb you should never use a guid - but if you need to send something to a cookie such as keeping track of a user then you should send a guid - and then change it on each visit - this makes it impossible to steal a users session.

Two seperate ideas here - guids are bigger - the web contains hazards that should not affect your db design performance wise.

I agree with those advocating natural keys in URLs.

They’re “nicer” looking, they’re more memorable, and they’re search engine friendly.

I run a video games website and we recently replaced all of our “review.php?id=123” links with “reviews/123/name_of_the_game.html” and got noticable improvements in search engine results with zero performance hit. In fairness the 123 was kept, to ensure backward compatibility with legacy links, but if writing from scratch, that would be gone.

Which is better?

www.myblogsite.com?userID=0AFDC426-C52E-46F0-8752-5F9C71159FD9
www.myblogsite.com?userID=7439
www.myblogsite.com/JeffSmith

Yeah, I think so too. Is there a performance hit to the latter? Yeah. But it’s minimal, and well worth it. All your internal joins, etc, stay on surrogate int keys.

As for security, userID=7439 CAN be changed to userID=7440. So? The URL string is the least secure thing on your site. Users can TYPE directly into it. If you’re trusting that, your choice of primary key types is the least of your problems.

The above post makes some valid points. But is flawed.

Natural keys should be used, but not to make “typeable” addresses. They should be used to make neat links.

People shouldn’t be typing in /pen/, they should be clicking on the “Pens” link. Url by guess is not adequate navigation. :slight_smile:

I was speaking with a colleague about this exact question earlier today, and he pointed something out that hasn’t been mentioned in this thread as of yet. So, for completeness I’ll toss it in.

If you can reasonably assume that each row will be unique across an entire schema, and gracefully handle when that’s != true, then Guids also buy you some “polymorphism” for your tables of sorts. For example, if you have an Image table, and images can be associated with multiple entities in your business logic layer, then instead of requiring relations tables ‘FooImage’, ‘BarImage’ etc., you can just create an ‘EntityImage’ table to reference the guid for any object, and the image guid.

This doesn’t really add to the debate over performance or integrity (and perhaps you don’t like the idea of such free association of images to, now, any other entity, or replacing manifold relations tables for one larger table which, you’ll argue, will perform poorly even with an unclustered guid PK), of course, but it might appeal to the more OO-minded developers.

I wrote a lengthy article on this very topic:
http://cookingwithsql.com/index.php?option=com_contenttask=viewid=52Itemid=58

In a nutshell, GUIDS suck for performance. If you are creating rinky dink apps, go for it. If you are building an application with a monster database, then stick to integers.

1 Like

@Datagod:

Thank you for nicely highlighting the “percentages” calculated off those tiny, tiny example tables. My tables happen to have many more columns with way larger data. Do the percentages hold then?

I must say I especially like your char(10) columns, which must be the newest breakthrough in enterprise development.

If you want to be taken seriously, change that script-kiddie name of yours too.

Ishmaeel, the example was simplified so as to make it easier to explain.

There is no call for being so rude.

When is the last time you performance tuned a terabyte database? I thought so.

Get bent.

Datagod:

How do the % look if you use the sequential GUIDs?

Guys, how do you handle merge replication when using autogenerated keys? In my setup there will be lots of PCs (at different locations)with their own SQL Express database. Tables have autogenerated primary and foreign keys. Now all this data need to replicated back to a central location on a daily basis. We can’t go with separate identity ranges as that makes managing this for the number of PCs would be hard.

I am looking for some pointers as to how to handle replications problems which will happen when 2 or more records have same primary or forign key. Is there a simpler way besides going row by row on each machine and inserting the records into the central server using the new keys generated by the central server?

  1. It’s amazing that seemingly intelligent people are concerned that there might be a collision with GUIDs and how they’re not “guaranteed” to be unique. They might want to brush up on their math skills. A 128 bit number isn’t easily duplicated. So what if you get a dup key on insert? It gets kicked back, you handle the error, you regen and retry. What’s the big deal? It should only happen once every few thousand years anyway. Your system won’t be in production that long and/or the problem won’t happen enough to be an issue.

  2. The extra 12 bytes isn’t a big deal in most cases. Notice I say most cases. If you have a million rows that’s only around 12MB right? If you have a super large db then of course you should think about it and maybe use a different strategy but for someone to say “I’m never going to use a GUID PK under any circumstance” is a little short sited IMHO. In computer science the answer is always “it depends”.

  3. Anyone that uses a clustered index on a guid (without a really good reason) probably doesn’t know what they’re doing in the first place. So this is user error, not a flaw in the datatype or db engine. If you didn’t know that the clustered index doesn’t have to be the same as the PK then you should study up and make sure you understand the tools and technology you’re using.

Wow, thanks to RK for the link. Maybe I should brush up on my math as well :slight_smile:

“1 trillion UUIDs would have to be created every nanosecond for 10 billion years to exhaust the number of UUIDs”

http://en.wikipedia.org/wiki/Universally_Unique_Identifier

“1 trillion UUIDs would have to be created every nanosecond for 10 billion years to exhaust the number of UUIDs”

The only problem with this logic is that nothing is keeping track of the UUIDs being generated. You have just as much chance of getting a duplicate on the second try as you do no the 100 Quintillionth.

Groover made an excellent point… HANDLE THE ERROR, RETRY!

Hi,

Excellent blog!!!

I am a developer (not a db admin) working designing a database - confirming to best
practices. I am not sure at this time, but at a later time in the app’s life, we may have to deal with replication, merging etc.

I read various pros and cons and concerns raised in this blog with auto-gen keys in this blog and the potential performance issues with GUID keys.

I have not read anyone commenting on the statement from Celko 'sarticle…

"There are other ways of getting a unique identifier for a table. The most portable method for getting a new identifier number which is not in the set is something like this:
INSERT INTO Foobar (keycol, a, b, c…)
VALUES(COALESCE((SELECT MAX(keycol) FROM Foobar) +1, 0),
aa, bb, cc, …); "…

Looks like performance wise, this method may take a ‘one-time hit during insertion’ because of the sub-query. Can anyone with past experience comment on issues with this approach when it comes to replication/merging and any other db management issues?

Thank you

  • Athadu

There is nothing wrong with exposing parameters into URL, just verify it! Anything can be entered into the URL including SQL so it should always be verified (against database, cookie and/or session), using a GUID to help here is just lazy and the wrong tool for the job. Remember the Web is stateless, each page needs to check who is logged in and that they have access to the resource that the URL (or cookie) is requesting.

Using GUIDs helps with security and obfuscation as well. When data must be selected and is databound in some control, such as a listbox, the user can easily figure out certain things about your database.

This is especially true for web pages where a user might be able to figure out how to gain access to certain resources they might not otherwise have access to (ie: view source - scroll to listbox - “OOH! I have access to IDs 1, 2 and 4, but not 3.” - “Let’s plug 3 into this hidden field and see what happens”). Or SQL injection. Or query strings. You get the point.

Using GUIDs will prevent this. BOING!

In weighing up the options, many people presume ‘replication’ and ‘merging’ is a rare occurance.

However, when you are dealing with disconnected datasets in the client (which is now more common than not), every transaction from client to server is effectively a database replication/merge scenario.

GUIDs are unique identifiers, nothing more. Numeric IDs can give you many useful feature ‘for free’. I disagree with GUIDs feeling natural - we’re talking about a database which is not natural - numbers are natural and mean something to humans plus you get chronological ordering built in. The performance on a numeric ID is so much better.

Regarding replication, the best way (not my opinion, but in every real world case) is to have a numeric primary key and a GUID column for replication.

Unfortunately the academic best practices which are thought up by the gurus do not translate into the best method for the real applications!

You have to think about how the database is going to be used as that is the absolute goal of a database. Doing things in the disputed ‘correct’ way is of no use if the solution does not give the project the best end result.

Consider this:

(employee) What is your customer ID?
(customer) 2A734AE4-E0EF-4D77-9F84-51A8365AC5A0

And you believe this will happen? Never never.

So does anyone has an practical solution?

I am pro-guid…

So you are making my points really!

Use both - they have complimentary and valid uses - that’s why I’ve still got a job!

I just cannot understand why people think that GUIDs are the new id - no they are there for a purpose - I don’t think sequential numbers will ever have their day.

As I say - think about it, don’t just say mine is better than yours - you won’t even get job!

I will not argue for using either GUID, IDENTITY columns or even natural keys. I think it all depends on your needs, if you are building a datawarehouse you’ll definitely NOT stick with IDENTITY, you would probably create the Id on-the-fly, SMALLINT, INT, BIGINT or whatever suit your needs.

IMHO using GUIDs when developing Web-applications is nice and easy to deal with, for instance creating new customer, new user etc. notice the usage!! Small tables, but wide access to functionality, much easier to generate the GUID in the application and commit, anyway that’s just my opinion. I saw another poster commented the cons of natural keys, I agree :slight_smile:
(GUID are also un-avoidable when dealing with merge-replication, and merge-replication should normally exist in a normal sized environment where users updates/ inserts to the frontend-SQL and the data needs to be replicated to the backend)

I have read through all the posts (I believe), some misconceptions exists here and there, but the one from :

Aaron G on March 21, 2007 8:00 AM
"If you use IDENTITY as a key, the values tend to cluster on physical
data pages because they are sequential. The result is that if the most
recent rows are the most likely to be accessed, there will be locking
contention for control of those physical data pages. What you really
wanted in a key is some spread of the rows over physical storage to
avoid having every user trying to get to the same page at the same
time.“
Aaron’s comment
"And I almost missed this gem: “If you use IDENTITY as a key, the values tend to cluster on physical data pages because they are sequential.” No Joe, that’s what a clustered index is for. Identity keys don’t have to be the clustered index and mine usually aren’t.”

  • Really hit my eyes… Aaron doesn’t know how a clustered index works :slight_smile: Joe does! Aaron missed the point regarding pages and Joe is completely right, it would be very nice to create a clustered index on different filegroups… however this can be accomplished at a physical level, but that is another story.

Willem Luijk on February 21, 2009 3:15 AM:
You are mixing customer id and surrogate values, don’t do that! The GUID is created for relationship and not some, in your case, customerId.
I have always drawn a sharp line between Id’s and numbers; IMHO as a rule of thumb: Id’s are arbitrary values and shouldn’t be used as anything else, whereas numbers are describing something.
So in your case, CustomerId is in fact a GUID, but you need to model a CustomerNo as well :slight_smile:
And the CustomerNo can be in a friendly easy reading format, i.e. Customer: Joe Montego, could have a CustomerNo like : Montego01 , easy to understand and Mr. Montego is very happy, he doesn’t need to look at some stupid insurance/ invoice or other crappy system generated, lazy db’s/ developer/ -solution where only the nerds thinks it’s cool to have a customernumber like 4533-49900.900-88 LOL right? (and where he usually ends up with… yada yada… I only have my phonenumber and the service employee ends up with looking their own nerdy-invented stupid customerId up using that phonenumber ROTFL)