Maybe Normalizing Isn't Normal

Fancypants · August 6, 2008, 12:00am

i cant believe i have to read all this bullplop

sid · August 8, 2008, 12:00am

you’d better go at work and denormalize your databases. All of you

ChristopherB · August 14, 2008, 12:00am

Even the normalize it 'til it hurts, then denormalize, which seems the nearest to rational argument, doesn’t nearly get at the point.

The point of normalization is to analyze the data model in order to determine how the data all hangs together. In effect, once you have gone through normalization, you understand your data model, and have a representation that allows expressing information with minimal data; it doesn’t have redundancies that consume disk or cause update anomalies or generally require managing.

The next thing to do is not to denormalize, but rather to create the concrete schema based on the understood data model. If you have a decent query optimizer, most of the queries should turn out perfectly well when using a pretty well normalized concrete schema.

When exceptions arise, of queries that perform badly, you need to react to them, and denormalizing isn’t a thing to do.

Easiest is to consider adding indices; if that resolves the problem cheaply, great!
Second easiest is to see if the expensive query may be replaced with some cheaper query. No denormalization here.
There may be DBMS-specific techniques such as functional indexes or partial indexes to precompute or avoid actions that are expensive. This is a declarative action, thus comparatively cheap, and again, doesn’t involve denormalization.
But then there is the possibility that it may be necessary to alter the data model to PRECOMPUTE something so as to avoid the need to process something (e.g. - maybe a join or something) in the totally normalized data model.

It is in the fourth case where we diverge from a normalized model. But that wasn’t termed denormalization; it was described as precomputing, that is, storing a precomputed result. You don’t just randomly denormalize - you look for things that are being computed that could be computed earlier and stored to save recalculating later.

JulesLt · August 29, 2008, 12:00am

There’s some good advice here, but I can already see people citing it as a ‘we don’t need normalisation’ piece. And if I can recall back to my time studying normal forms (back in the early 90s) we were always taught that you would probably need to denormalise - did they stop teaching that part?

Personally, I think there are good reasons why courses drill the maxims of normalisation into students, and that’s because without learning them reflexively, the majority of designers and developers make a pigs ear of data models, dumping data into the first entity they can find to hold it, or seeing the database as nothing more than a way of serialising objects.

A significant problem is that the relational model is strongly associated with SQL, which is pretty much like associating high level programming and COBOL. SQL was simply the first relational query language, but it’s success unfortunately locked in some bad early design decisions. Check out Date and Darwen’s ‘The Third Manifesto’ for proposals made in the light of experience.

Later languages, and even SQL syntax revisions, have unfortunately failed to gain much traction - for instance a lot of the code in your query can be eliminated by an RDBMS that actually understands the relationships between tables (which we currently specify in each query rather than metadata). Some RDBMS already support this in SQL itself.

Or how about a syntax where you would simply declare the user view as the user table extended by the other tables? The point is that none of these issues actually relate to the model per se.

The second major problem is that the relational model (and SQL) was supposed to free us from thinking about the underlying physical model. We should not be concerned if a query joins 5 tables of 100 columns, or 500 tables, because we’re not supposed to know/care how it is implemented underneath.

The default response to that is that is typical ivory tower theory - but recall that the same objections were raised by C developers towards OO, and C++ developers towards managed code and the Java VM. RDBMS have vastly improved in performance in much the same way that VM languages have.

The third problem is with people using an RDBMS where they don’t have to, or where they don’t fit well, because they’ve gone with an off-the-shelf architecture. The equivalent view is thinking that your correct DBMS free architecture is equally applicable to all problem spaces.

Final thought - some people reading this would be scared of joining 6 tables, yet would think nothing of having a class reference 6 other classes - is there really a difference? (What if those classes each make a d/b or file system query?)

Mark · October 5, 2008, 12:00am

Sure, dogmatic adherence to the princple of normalization no matter how much it hurts is foolish. And there are times when it does hurt. But normalization was invented for a reason. In your example, above, you give a simplified design with room for 2 screen names and 3 affiliations. But what if someone has 3 screen names or 4 affiliations? Where do we put the data? You’ve built in arbitrary limits. And sure it’s true that in the unnormalized schema, getting all the data about a single user becomes a simple select * from user. But what if we want to find all the users with a given affiliation? In the normalized schema, we write select user_id from userAffiliation where affiliation_id=‘foobar’. (Yes, if we need other data about the user besides his user id, there’s an extra join.) In the unnormalized schema, we have to write select * from user where affiliation_id_1=‘foobar’ or affiliation_id_2=‘foobar’ or affiliation_id_3=‘foobar’. We have to type everything three times, and if we want that query to run efficiently, we have to have three separate indexes. I’m not sure if your intent was that the entire database consists of only one table, or if there are other tables for other categories of data. If so, and if any of these link to the affiliation, then any joins have to join against any of the three fields. If the other table was designed with a similar philosophy and also has three fields for affiliation, then we need 3 * 3 = 9 clauses in our where to catch every possible combination. And if somewhere down the line somebody decides that we need to up the limit to 4, then we have to search all of our queries to make them all search against this fourth slot in addition to the first three. If we miss one, we’ll get very subtle bugs, where it seems to work most of the time, only failing for the tiny number of users who have a fourth affiliation.

I discuss the rules and when it’s good to break them in more depth in my book, A Sane Approach to Database Design. (Was that clever the way I snuck in the sutble plug for my book? Okay, probably not all that subtle.)

james6 · October 27, 2008, 12:00am

Subject Line: Beat Long Poll Lines with Absentee Ballots from StateDemocracy.org
Many state and local election officials are encouraging voters to use Absentee Ballots to avoid the long lines and delays expected at the polls on November 4th due to the record-breaking surge in newly registered voters.
Voters in most states still have time to obtain an Absentee Ballot by simply downloading an official application form available through www.StateDemocracy.org, a completely FREE public service from the nonprofit State Democracy Foundation.
Read More: http://us-2008-election.blogspot.com/2008/10/beat-long-poll-lines-with-absentee.html

Jel_Mist · January 2, 2009, 12:00am

Bit of a johnny-come-lately to this, but I would always go for full normalization at the data analysis/modelling stage, then take a pragmatic view at implementation. If, in the particular case, it is obvious that denormalization is the sane option, do it.

One comment I would make. If you are designing an information storage and retrieval system that is queried more often than it is updated, it might make sense to use a fully normalized schema for updates and a denormalized one for queries. When inserting, updating or deleting data, use triggers to refresh the query tables to reflect the changes. The SELECT command to query the denormalized table or tables would be far simpler than one with six joins in it, and data integrity would not be an issue.

yma_pinto · June 18, 2009, 12:00am

The way i see it - the goal of normalization is creation of a logical design that gives you data integrity and expandability(in the design) and hence It does not consider query evaluation performance in its design. This is pushed to the lower layer in the design phase - the physical db design which now looks at volumes of data! And that’s where the problem is? We need a design methodology that looks at - the data AND its usage before we design – is denormalizing at that level a solution? Maybe partly
Please comment as I plan on researching this area

Raj_Datta · June 30, 2009, 12:00am

Fascinating Post, a couple small comments to the awesome list. IHMO: OLTP Databases require “Normalized” designs, not on principle but out of necessity. Reporting Databases on the other hand requires “De-Normalized” designs for performance/scalability. A single database should not have to solve both problems (performance and integrity). Understanding the usage of the database in question and using the design required is the right answer, there isn’t a one size that fits all…

StephanE4 · July 2, 2009, 12:00am

Coding horror indeed.

Interesting, but clearly solving the wrong problem. Why are you using a relational database at all? It doesn’t fit your problem domain.

MickeyB · August 19, 2009, 12:00am

I’m sorry this articles comparisons are irrelevant. Normalization and Denormalization achieve different goals, and are used as such.

Jon_Limjap · February 6, 2010, 12:00am

I will have to nod on the use of views, especially indexed views, on this. Views help in performance because it tells the database which queries with complex joins will be accessed more often, and allow the database to pre-store its execution plan (as opposed to adhoc T-SQL) Indexing against the view will further optimize it.

If you use NHibernate, views have the additional advantage of being treated by NH as just another table, so you could actually design a view against an object in a much simpler manner.

ClintonN · February 6, 2010, 12:00am

Jeff,

It’s common in programmer circles to think there’s not a best way to do things, but in this case, that’s not true. The best way to store data in a relational database is to normalize it; that is, to take full advantage of its relational nature.

In your example, you do not need six joins to get all the user’s data: you choose what the pertinent data is for a situation. If you do need all that data, and performance is an issue, materialized views, or some other caching system, is the way to go.

I appreciate your post, but it’s very wrong, and harmful advice to an inexperienced programmer.

nona · February 6, 2010, 12:00am

hey all you young guys - logical modelling (data model) is normalised (for understanding), physical model (database) is denormalised (for performance) - duh!

SimonW · February 6, 2010, 12:00am

@[ICR]
Just a nitpick, I’d recommend using the term Gender over Sex

Do you sex a cat, or do you gender it?

Don’t use an incorrect term just because the politically correct crowd want to make the very mentioning of words that might also refer to a natural human activity. Fight the pussification of language!

If you want to know if someone is physically male or female, the question is sex, not gender.

a href=http://en.wikipedia.org/wiki/Genderhttp://en.wikipedia.org/wiki/Gender/a

SimonW · February 6, 2010, 12:00am

^^
I don’t know what happened with the wikipedia URL, but I didn’t write the HTML tags. I just pasted the URL.

http://en.wikipedia.org/wiki/Gender

SimonW · February 6, 2010, 12:00am

This is why this article is wrong, Jeff. This is why you’re an idiot…

In your opinion.

In my work building and maintaining content and asset management systems, I’ve seen far more problems arise from over-normalisation than under-normalisation. Far too often I’ve seen database designs that fall over because they’re so intricately normalised. They don’t fail technically, but they become a nightmare to write code against.

On the other hand, some of our most long-lasting table structures have been criminally flat. They have their limitations. They suffer some redundancy. But they’re rarely misunderstood, and always a joy to attach to your new systems, because their broad usage has spawned many valuable tools and code that work with them – and that’s far more valuable than having an unlimited number of phone numbers for each contact.

GrantJ · February 6, 2010, 12:00am

How do views speed up performance?

They prevent bad joins like the one demonstrated here.
You can use a materialized view which is the data denormalized for performance, but backed by normalized data for integrity.

The question is what is important: Your data or your performance?

If your limiting where clause is only one one table, you should not see a bad hit, because you will be hitting only the few select rows of each other table. You will take a hit for disk seek if you do a poor job of laying out your physical storage, but that is a different issue.

If you care about data integrity, you need to be normalized, and you need things like foreign keys and constraints. Normalizing data can hurt performance, but so few applications are of the size to cause this issue that usually data integrity is the bigger concern.

On a P-200 with 128M of RAM and a 10G database (point of sale with automated reordering and the lot) fully normalized, the 100MB network card was the bottleneck. If you need more than that, you have the money to buy hardware that can support it.

JonG · February 6, 2010, 12:00am

The obvious tradeoff in denormalizing your data is that your data structure has lost meaning and flexibility. I’ve worked on projects which required normalizing unstructured or poorly structured data, and it’s often difficult or impossible to move in that direction. I’m surprised you’re at that point already - I’d normally think you’d look at optimization once you’ve had a site running for a while and analyzed actual usage data. Regardless of the direction you’re moving, materialized views are very helpful - a materialized denormalized view can give you the performance benefits of denormalization without messing up your data, and a materialized normalized view over unstructured data can reduce some of the pain in dealing with User_Phone_Business_Prefix. Have you looked at the sparse column feature in SQL Server 2008? That’s supposed to reduce storage size (and thus increase performance through increased page efficiency) when working with wide columns in which many columns may contain null values.

MarceloC · May 31, 2010, 12:00am

[I came across this rather old post, and couldn’t resist putting in my 2c-worth.]

Let me voice a dissenting opinion from a different angle…

Denormalisation is often necessary in modern SQL databases, but not because there is a fundamental problem with the concept of normalisation. It is because SQL databases don’t support one of the relational model’s core principles: separation of logical and physical models.

When Codd defined the relational model, he explicitly stated that it was a logical model of data, and that the means of storing relations in a data bank was orthogonal to the model itself. This meant, for example, that if a particular pair of relations was often joined, the DBMS would be free to store the pair as a single set of records, which would nonetheless continue to appear as two distinct relations to clients, with the ability to manipulate them independently of each other, but which would also be extremely fast to join (a NOOP, essentially).

Unfortunately, SQL databases almost universally ignored this, and insisted on maintaining a very close mapping between table rows and database records. There has been a recent trend towards column stores, which put paid to the fallacy that rows must be records, and provide just a tiny hint at the power that would be unleashed if DBMSs allowed DBAs and programmers to separate the logical and physical layers in whatever way they saw fit. Ironically, such databases don’t even pay lip service to the relational model, and instead we hear all this talk about “post-relational” databases. What a crock!

So, yes, sometimes you just have to denormalise. But don’t blame the relational model for this, and don’t go looking for the post-relational pot of gold at the end of the rainbow. Blame the SQL DBMSs that ignored Codd and left us all bickering and squabbling amongst each other over a dilemma that should never have existed.

And if you want to do something positive, ask DBMS vendors to make your databases more relational, not less.