Maybe Normalizing Isn't Normal

@Jeff Atwood
@As the old adage goes, normalize until it hurts, denormalize until it works.

ha. old adage. what a crock.

@James Pearce
@While I agree that normalization is the way to teach, as a design principle (RE: SomeGuy), but it would be nice to see optimization issues mentioned more often when talking about best design principles in Computer Science.

See, what you call optimization issues, I call good design. There is a difference between negligent design and avoiding premature optimization.

seriously- how many of the commenters here have actually worked on a system with large amounts of data? A recent system of mine inserted around two billion rows per day into it’s larger tables- indexes, partitions/sub-partitions, and summary tables are a given; but we also saved a small percent of the data into duplicate tables during the initial loading process in order to retain them longer than they are retained in the main tables- delete statements are too slow, we drop partitions by date as the data ages out; We also don’t allow foreign key constraints on the large tables (though of course we list them on the logical data model) because we can’t afford the performance loss during inserts. I’m always interested to hear how others have dealt with performance issues, I just hope all of you out there bashing Jeff for this post have dealt with large-enough systems to have a basis to comment…

Hard drive space is cheap, programmer time and running time is not.

If you’re only normalizing to avoid blank entries in a table (like the optional phone number table you’re using), there’s no point in normalizing. Really. Just let the table have a blank entry in it. How many entries do you think it will take to make a difference when hard drives are measured in TB now?

Normalizing should mainly be used only to avoid duplicate data. If you are doing a social networking site and you have two duplicate entries to make two people friends, then you are probably doing it wrong. Having duplicate records makes coding and running a mess, and should be avoided.

Let’s face it, denormalizing and normalizing are necessary evils in DB design. I look at the first tables given and unless the client requirements are to allow a user to have an unlimited number of phone numbers and/or unlimited number of usernames, you could very easily NOT normalize those two requirements into tables. Thus you limit your joins and increase performance.

Overall it seems that only people that haven’t really worked on databases, that are stronger in theory than practice, do they beat a database to death with over normalization. The argument could also be made that normalization increases performance if you know your database enough to pull only what you need and not everything and kitchen sink.

An example would be the affiliation information, you wouldn’t need to join all the tables in that example if you are pulling just user information, you would only need to join the user affiliation table if that user is allowed multiple affiliations. Even better if he/she isn’t allowed more than one, then you can put that in the user table too! :wink: My point is very simply if the data follows the business requirements and the developer using the database understands the structure properly then a level playing field of both normalization and denormalization should come to play and you should have a pretty decent mix of both structured data and performance.

But hey, what do I know, I’m just a guy who reads blogs.

Great Post dude, love the discussion that it brought up.

Well, I haven’t touched a DB since I changed jobs last year, but this is one good reason why a DB designer should also be writing queries and optimising them. That way he has to feel the pain too.

To my mind, if your normalised table is ONLY EVER queried in a join to another table, that’s a good enough reason to denormalise.

And of course, optimisation depends on usage. If you’re updating a table as often as you’re reading it, that’s a completely different scenario to the typical write-slow, read-fast environment.

You need a normalized logical schema. This is what the DBMS should provide to the application. The DBMS should denormalize as necessary to optimize performance WITHOUT exposing those denormalizations to the application.

The problem is that DBMSes don’t do this. Sure, they give you indices and caching and materialized views, but they don’t go all the way.

Oracle has had a feature (since 8i IIRC) called query rewrite, where you can write your query against the normalised schema, and Oracle will figure out that the same query can be performed more efficiently against materialized views (if they exist) and hit those for the data instead.

Holy cow, about the flame war I was expecting.

For the normalization crew: I have never seen (and I expect I never will) a fully normalized database. Consider table User, in the example above, is not fully normalized. You have first_name and last_name.

A fully normalized database would not include first_name, except as a pointer to another table. Fully normalized tables can’t contain repeat information, and you’ll have several James people. Same thing as last_name, multiple Smiths. For that matter, you can’t have phone_number; the area codes and prefixes replicate. Zip codes are a five-digit primary and a four-digit extension. Any information which itself represents a join has to be broken up. BCNF isn’t fully normalized. Do you do that? No? Okay, then you’re not operating in a normalized database. Get over it already.

@ BuggyFunBunny

The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.

No it’s not. The gravitational constant is only that here on our good ol’ terra firma. If you’re building an aeroplane that may perhaps be used in the future on some other planet, you ought to take that into account when designing the beast. The flip side to this is most likely, you’re not going to ship your plane off-planet. The degree of normalization with which you feel comfortable is (or rather, should be) contextual, not absolute, as well.

From a higher viewpoint, complete data normalization is a theoretical construct. Reality gets in the way of theory. Even if you went to the trouble of building out your tables fully normalized, you’re relying on the false assumption that the data itself is presented in a canonical fashion; it’s not. Jim, James, Jimmy might all be James, or they might not be. The user- or institutionally- provided data that populates your database is coming in without consistent presentation.

All that aside, the Och, Jeff, be careful what you blog about! commentary is well advised to be heard. Messing with your data structure is something that should be undertaken with supreme care. In practice it is several orders of magnitude easier to de-normalize a database than it is to normalize one that is a giant mess. The undo button here is buried under a mountain of pain. Be forewarned, here be dragons.

  1. The example was poor

  2. Even if the example was good, you’ve provided no context for optimisation via denormalisation.

What your post seems to be suggesting is that optimisation (specifically denormalisation in this case) is something that takes place in design. To me, optimisation is a consideration of circumstance. What value is there in attempting to second guess those circumstances? None. Particularly in the case of denormalisation, since if - god forbid - your assumptions were wrong, reverting back from denormalised to normalised is so costly.

Basically you set no scene, gave no re-design requirements - and as a result you are sending out the wrong messages to impressionable developers.

A final point: who is this post aimed at? With all due respect; anyone responsible for optimising a database to this degree will not be looking to your for advice.

I think this post is misleading at best. Sorry Jeff.

As long as each of your joins is using a proper index, the query optimizer will take care of the performance. Unless you plan on having millions of users, stick with the normalized design…

Never say never. s I’ve done my share of denormalizing here and there - but never from the get-go.

I think the main thing here is - there probably is a situation where you must architect with denormalization from the start because this is version 25 and all previous versions blew up the database. I think it’s a fair guess to say that most of us will probably never work on something so huge to have to start with denormalizing in mind.

FWIW - Once we hit a scalability wall - ended up dropping a foreign key to hit the clients performance metric (they hit us with an automated perf tool) and that table has lived for 8 years without a foreign key since. It always bothers me.

Entry has been up for a week and still no post from --CELKO–.

For those who are interested in what real database folk have to say about it (beware: they are Brits, and SQLServer zealots to boot):

http://www.simple-talk.com/community/blogs/tony_davis/archive/2008/07/21/63094.aspx

This seems like an argument between stereotype application developers and stereotype database administrators.

The stereotype system administrator would point out that social networking is store-and-forward and point-to-point. Get those parts of your design wrong, and that will be your bottleneck, folks.

For what it is worth, I would err on the side of over-normalization, whatever that means.

http://www.simple-talk.com/community/blogs/tony_davis/archive/2008/07/21/63094.aspx

Really not much content and what there was suspect. Normalization can’t produce anything that is logically correct be normalization is all about form, not semantics. Normalization produces something in the right form and the form has certain attributes. Logically the model could be gibberish because the semantics aren’t addressed at all.

Are you joking .normilize a bit slower?
1.The main table should contain the userid,username and password only
This is to ensure fast login.
2.The main sub table should hyper link with the main table.Since like mysql quit good not auto implement foriegn key.so your query a bit longer but you don’t nessary search for unnessary value.
3.Denormlization are quite dangerous
3.1. I T user can back door the data and change it without relation ship of table.
3.2 To query the data and joining again are more longer time.I’m not talking about 15 table per join with dam lot of data.Are we should query denormlize data.

…this isn’t intended as a real query…

Good. Because it won’t be taken as a real argument, either.

…queries are now blindingly simple…

Oh, really?
select * from User where affiliation_1 = @affiliaition or where affiliation_2 = @affiliation or affiliation_3 = @affiliation
If you offer a checkbox or select list for input into a denormalized column group, you cannot loop over an inner recordset, you have to:
affiliations.items[rs[affiliation_1]].selected= true;
affiliations.items[rs[affiliation_2]].selected= true;
affiliations.items[rs[affiliation_3]].selected= true;

And where are you storing the fourth affiliation? There will be one, BTW. There is always at least one record that exceeds your expected ceiling, now or in the future.

…and probably blindingly fast, as well.

As long as you don’t have to filter based on the denormalized column groups (which you will), or store them in additional overflow rows a year down the road when you realize that you need more fields in the group, but your code is committed to your denormalized design. Two-dimensional querying ruins performance.

Despite copious evidence that normalization rarely scales…

SO copious, in fact, that you weren’t able to cite even one of them.

Despite copious evidence that normalization rarely scales…

SO copious, in fact, that you weren’t able to cite even one of them.

Your quote is a complete non-sequitor, BTW. Did indexes solve the problem, or not?

Never, never should you normalize a database out of some vague sense of duty to the ghosts of Boyce-Codd.

No, you should do it because it makes sense for queries and maintainable code.

Disks and memory are cheap and getting cheaper every nanosecond.

Ah. The argument of someone that thinks normalization is about space. You should read Normalized data is for sissies, or any of the thousands of other naive blog and newsgroup posts over the years by people that also didn’t know what they were talking about (not just the ones you cite), but be sure to read the comments below them.

Measure performance on your system and decide for yourself what works, free of predispositions and bias.

And foresight of maintainability, apparently.