Maybe Normalizing Isn't Normal

MaxK · July 15, 2008, 12:00am

Another case where you would almost never want to denormalize would be a situation where you’d turn fixed-length rows into variable-length rows. If your table is all ints, you wouldn’t want to replace them with varchars.

However, denormalizing to create fixed-length rows instead of variable-length rows (or overnormalizing to do it) could sometimes be a performance benefit, if your application is able to reference things by their integer ids instead of their string values.

-Max

Tubs · July 15, 2008, 12:00am

Okay, from the very beginning it has been known that normalisation can cause problems with performance. I think every trained DB designer know this.

But what happens, is that people see Normalisation = Slow, that makes them assume that normalisation isn’t needed. My data retrival needs to be fast, therefore I am not going to normalise!

So they end up with a database they call denormalised for performance, but really it isn’t, as the database was never normalised in the first place.

Lachlan · July 15, 2008, 12:00am

The only time I throw normalisation out the window is when I have a list of infrequently changing fixed-sized items which I can separate by commas and store in a single field instead of another table.

However, whilst it saves time on the querying side, the issue I find with doing this approach is it completely slows down the processing side, as my application must separate the list into an array or another structure before I can work with the data.

Personally, if the trade-off of each approach is near the same, I’d prefer a more logically structured database without null fields than a denormalized database which is hard to maintain.

DareO · July 15, 2008, 12:00am

Ariel and George,
Phone numbers don’t go into the user table because users can have multiple phone numbers (work, phone, cell, fax, etc). A properly normalized database extracts them out to their own table especially since the majority of users will have lots of nulls in these fields.This is assuming that you are normalizing your DB to 3NF.

In addition, who says you wouldn’t want all the user data including phone numbers at once? Have you ever seen a profile page on FB? That’s exactly what it does.

Jheriko · July 15, 2008, 12:00am

always wondered what database normalisation meant…

can’t imagine that joins are exactly a heavy performance drain though… the data is indexed after all. I don’t actually have the experience of a database that big (biggest db i worked on only held info on about 1.5 million people - linked to various details in much smaller tables), but I would imagine you could have trillions of records with no problem given some good indexing implementation…

in my experience slowness comes from things like calculating a sum, where the database doesn’t update the sum on each record change, but recalculates the whole damned thing every time its needed. or poor application design, making 100s of server requests, when actually one would do just fine.

just taking a naive implementation, given numeric primary keys you can reduce the amount of records you actually have to search through by a factor of 10 for each digit you decide to take advantage of…

I wonder what modern database implementations actually do? generic keys aren’t so useful but similar logic can still be applied… at worst there is always the choice of a binary tree…

Michael · July 15, 2008, 12:00am

I am pretty sure your original normalized database in the diagram is poorly designed. After all you have two tables that appear to hold a user_id + affiliation_id relationship (UserAffiliation and UserWorkHistory). That, my friend, is wrong. And no wonder you have six joins, you have a database design that does not make sense.

You don’t want to write a query like that-- ever-- in real life. It doesn’t make sense. If you have three users who each have two phone numbers and two screen names, your join gets you a result set with twelve rows in whatever order the database deems appropriate.

In my experience, the odds of a developer properly designing a database are extremely small. You have just proven that to me yet again. And I can guarantee you that bad designs are way more likely to cause performance problems than properly normalizing data. Not only that, bad database design is likely to cause other problems like inconsistent data, incomprehensible code, and inscrutable bugs.

Goran · July 15, 2008, 12:00am

@Niyaz: When something is updated in the tables, all these join statements will be needed to make the view. Right?

Yes - whenever you update / insert data it will in fact update / insert twice. But that’s likes akin to saying indexes reduce performance as you need to store the same data multiple times :). My favorite saying on this is You can read fast and store slow or you can store fast and read slow.

@Jheriko: The biggest performance killer is so called physical read. Finding and accessing data on disk is the slowest operation. Unless child table is clustered indexed and you’re using the cluster index in the join you will be making lots of small random access reads on the disk to find and access the child table data. This will be slow.

JonathanR · July 15, 2008, 12:00am

In some cases you may be right, but what will you do if you need more entries in affiliation, work_history and screen_name? Do you really want to have so many empty fields in each entry?

BillGoates · July 15, 2008, 12:00am

@Jeff,

Your example of is wrong. This type of normalization will make designing the user interface, source code and the rest of the sql more complicated. For example to get all users connected to an affiliation will need 3 times the queries now then before. And what if you want to extend WorkHistory to 4 (or 10) in a later release?

I also cannot imagine you need all this information in 1 call over a large number of users. To get this data for a single user it doesn’t matter much if you do a few more calls to the database. I assume that large queries probably go over user only, or user and affiliation.

Also the denormalized table isn’t complete, it’s missing 3 work_history_affiliations.

But some denormalization can be done on your sample without becoming denormalized:

Removing the UserPhoneNumber table is good, that table is kind of useless, users don’t want to submit hundreds of phone numbers and it isn’t likely there are is an index on phone_number.

UserAffiliation and UserWorkHistory can be merged. Maybe add a type field if needed, otherwise an empty company_name should tell enough…

So you woud end up with:

User

JonathanD · July 15, 2008, 12:00am

Why not cache denormalized sets of user data in a second database? That allows us the benefits of both.

Greg · July 15, 2008, 12:00am

The only companies running on denormalized databases are companies providing websites that have loosened constraints when it comes to data loss. Your credit card company is not going to denormalize their data in their transactional system anytime soon.

Atwood is wrong. There is a better way to do it, and that’s to normalize your data. This keeps your data integrity and DRY, and helps protect you from bugs and errors in your database. Correctness should always come first. Denormalization is up there with hack and kludge and should be treated as such, not as a reasonable alternative to normalization. (Albeit a sometimes necessary one, though probably employed more than it needs to be due to posts like this one, and also employed due to limitations of the shitty database these sites seem to run on top of, MySQL.)

John · July 15, 2008, 12:00am

The biggest scalability problems I face are with human processes, not computer processes. Aside from number crunching applications, I’ve seldom run into the need to restructure code for performance. But I’m always running into mental scalability problems, ways of doing work that are going to drive people crazy in the long run.

xmariachi · July 15, 2008, 12:00am

Another piece of savvy writing. I thank so much these writings from the experience and humble truth. Whatever it means.

Chris · July 15, 2008, 12:00am

Premature optimisation! Design the database fully normalised, measure performance, optimise, rinse and repeat until fast enough.

Don’t forget that the fastest database query is the one that doesn’t happen, i.e. caching is your friend.

Tubs · July 15, 2008, 12:00am

If you want to have a look at the spec of fully engineered and normalised database, check out the NHS Data Dictionary

This diagram shows the person property :

http://www.datadictionary.nhs.uk/data_dictionary/diagrams/person__amp__person_property_imsp.asp?shownav=1

Tom_Clancy · July 15, 2008, 12:00am

Wanted to repeat what Brian said, because it’s one of the more valuable things in the comments: don’t use SELECT * and you’ll have cut your query count almost in half.

The database diagrams appear to have been made in a recent copy of Visual Studio.

While I agree with the general point Jeff, I do think there are other ways of achieving the effects of denormalization without squishing tons of fields into a table. Views, reporting databases, etc. If The DailyWTF is to be believed, just adding some (well-thought) indicies should be everyone’s first start. Then learning how to use tuners optimizers. Jumping straight to denormalizing a database is the same class of mistake as thinking a properly normalized database can’t be a problem: you’re assuming facts not in evidence based on what someone told you about their experience.

BenG · July 15, 2008, 12:00am

This post (and the comments) raise some very good points (kudos for that), but manages to fall short a little in leading people down the right path in some respects.

There are a number of other ways that the example schema could be partially denormalized (something I don’t think was made clear). While I don’t think you could have done much more in a single post, perhaps this is something that could be covered in a future one. If you’re up to it, that is, it sounds like the expertise revealed in some comments may reveal just how big a question you are trying to tackle.

Chris · July 15, 2008, 12:00am

I’ve read the odd article on this site now and then, but still not come decided whether Attwood is writing ironically, or if he is actually incompetent.

Firstly, the sample schema is not normalised. The member_count field in the Affiliation table is a derived value (SELECT COUNT(*) FROM UserAffiliation WHERE affiliation_id = ?). The religious_views and political_views columns in the User table smell badly of multi-values. The purpose of the UserWorkHistory table is unclear, but looks unnormalised.

Secondly, the SELECT example given should user outer joins.

Thirdly, using SELECT * is a no-no. What happens if the order of the columns change? If you’re not using integer indices, then performance suffers. If you are using integer indices, then column order is significant. What if someone adds more columns that your code doesn’t need? You’ll still be pulling that data back.

Oh well, I’m going to add your example schema to our pre-interview exam (commonly known as the idiot filter) - and if anyone suggests your denormalised table as an improvement I’ll know to export them out the building uninterviewed.

Julio1 · July 15, 2008, 12:00am

You mean you aren’t using NHibernate, Jeff? Or even Linq to Sql?

In your example, if you need to get all of the data for a given user, your sql would look something like:

select * from Users where user_id = 10;
select * UserPhoneNumbers where user_id = 10;
select * UserScreenNames where user_id = 10;
select * UserAffiliations where user_id = 10;
select * Affiliations where user_id = 10;
select * UserWorkHistory where user_id = 10;
select * Affiliations where user_id = 10;

and your ORM would handle populating your POCOs.

Even if you aren’t using an ORM of some kind, ADO.Net handles the simultaneous retrieval of multiple sets of records just fine.

Like Peter Becker said, Normalization is about design, denormalization is about optimization.

cyclo · July 15, 2008, 12:00am

From experience: Normalize tables in databases (up to the third normal but probably not more) when these will be mainly used to collect information (lots of inserts, updates).

Denormalize tables when these are going to be used to select type qyeries (to feed reports, etc) or better still offload these to a data warehouse cubes which are optimized for analysis and reporting.