Maybe Normalizing Isn't Normal

One of the items we're struggling with now on Stack Overflow is how to maintain near-instantaneous performance levels in a relational database as the amount of data increases. More specifically, how to scale our tagging system. Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/07/maybe-normalizing-isnt-normal.html

Just to add that materialized view will involves caching of the data.

View data isn’t cacbed, however the execution plans are.

Personally I thnk anyone who encourages denormalization to the extent that this article does hasn’t worked with databases enough. Denormanlzation should always be an optimization, and not done by design, and a denormanlized database is not the same as database that was never normalized to begin with.

In many case I think materialized / indexed views will suffice in cases when you think denormalization is necessary.

You changed the rules.

The normalized model allows for n number of user phone numbers, screen names, affiliation and work history. It also does not care about the order of those items.

The de-normalized model limits the number of the items listed above. It also force an order.

Only one represent the business. Which is it?

I didn’t realise that denormalisation was controversial. I was taught to do it in university.

If you are just mostly doing dirty reads and do not particularly care if the data is 100% consistent all the time then you gain little by doing a lot of normalization. In fact, you should denormalize as much as it makes sense to do so for read-heavy loads.

Most web applications are of this type and do not even require a regular database necessarily. We are just conditioned to throw that database there every time without even questioning it. That is pretty high in the brainless zombie score rating.

great post. simple is beautiful…

Denormalized data makes a mess of the application code: you end up having to do all kinds of funny work to make sure that their affiliation is added to the right spot (affiliation_1? affiliation_2?), and if you remove it, then you end up backtracking and rectifying how your assignments work. And that’s assuming that you don’t accidentally run into a case where someone has 4 affiliations.

This whole concept is physically painful for me to even consider.

Before you start denormalizing your database, you should talk to a DBA and see what solutions they might have. And step #2 would be to be smarter about pulling back data than slurping back the entire database via your ORM.

Between those two options, you should be able to get the DB performance you’re looking for without hemorrhaging functionality and extensibility.

I just spent a year having normalization hammered into my meager little mind as part of my database module and now you’re telling me that it mightn’t be necessary…

love your blog Jeff but i hate it when you do this to me. :frowning:

How about using materialized query tables or materialized views?

@Raymond Berg: I agree but I don’t think Google can resolve a link in the form of http://www.codinghorror.com/mtype/mt-comments-renamed.cgi?__mode=redid=xxxx to an actual site.

My approach is to design it all normalized at first, as it makes the design cleaner and usually easier to grok, and then denormalize where absolutely needed. As long as the system is performing well, I would prefer to keep my data normalized and then create denormalized views so I can get those huge joins out of my way.

When it comes to teaching, normalized tables is the way to go. People must master the rules before they break them (and at that point nobody will have to teach them that denormalization is OK, they will have figured that out already)

Quite a horrible job naming the tables. Why the underscores in field names and not in table names. Why user_id and not user_sex? Why table nimes in singular? Of course this is beside the point, but lol… :smiley:

Jeff, IMO, more important than denormalization is how you design your indexes, and what type of hardware you’re running on. IO is the big killer, not so much normalization. Denormalizing a table would be about the last thing I would try to boost performance. SQl server should be able to shred those queries, even with 20 million plus records in each table.

Normalization is about design, denormalization is about optimization. They should be applied as such, i.e. anything should be normalized until denormalization is needed. As Knuth (or was it Hoare?) said: Premature optimization is the root of all evil.

And from the short glimpse I did at the article about performance of tag systems it seems they tested only on MySQL. Not exactly a database one connects with good performance on complex queries – it’s extremely fast on flat data, but to my (not utterly up-to-date) knowledge they never managed to catch up with Postgres or the commercial RDBMS’ on anything reasonably complex. Not sure if that applies to single joins already, but I wouldn’t be too surprised.

Normalization might indeed not be the fastest design, but can usually made fast enough without denormalization, as Dare Obasanjo states in his original article, too:

Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching.

In my LONG experience of both DB and application design, it is a bad (or rushed) application design that usually forces denormalization on a well designed database.

Several small queries over normalized data also usually outstrip one big one over denormalized data.

But, denormalization still has a home in large data warehouses.

Hope smart tools from DatabaseGear can help.

Lets unroll our loops next.

Just a nitpick, I’d recommend using the term Gender over Sex. Someone might ally themselves with a different gender to their sex and that’s the one they’re going to want to tell you.
(Obviously it’s sample data, but just for the future :slight_smile:

Id did annoy me slightly always having normalization drummed into me without mention that it’s not always best, but I guess SomeGuy is right.