Maybe Normalizing Isn't Normal

Mecki said:
Yes, if you need all of it or if you need all of it in a single table.
Very often neither is the case.

This is spot on.
Let’s say you wanted to run a query to find all the people that worked for a specific company:
select first_name,last_name from User u,UserWorkHistory uwh where
u.user_id == uwh.user_id AND
uwh.company_name = ‘Coding Horror’;
bi(BTW I haven’t written SQL in ages so this may be grossly incorrect; I think you get what I’m trying to say, though)/b/i

Granted, I could have just as easily done:
select first_name,last_name from User where
User.work_history1 = ‘Coding Horror’ OR
User.work_history2 = ‘Coding Horror’ OR
User.work_history3 = ‘Coding Horror’;

But which makes more sense?

I don’t think there’s any harm in starting off normalized and scaling at some point later. Premature optimization is evil, and for all you know stackoverflow may scale just fine under the circumstances.

I recently created a denormalized table with over two hundred columns because it was just needed to generate a PDF from a long online application form. There was no need to query the data for any other reports. The table is only meant to serve one purpose. However, those 200 plus columns did exceed the limit for the number of columns a SSIS export will support so now I need to exclude that table from my back ups. There is also a limit to the number of tables SSIS will export so you could run into problems with a fully normalized design as well.

Denormalized data sounds great until you try to extend your data model, at which point you will wish to high heaven for a time machine. The purpose of a relational database is NOT to give you fast performance. If you want that, use flat files and do your own heavy lifting. The purpose of a relational database is to help maintain data integrity. If your data doesn’t need integrity all that bad, you don’t need or want a relational database. If your data does need integrity, you can’t afford to make the mistake of modeling it poorly, and that means semi-obsessive normalization. Those 6 joins are a feature, not a bug.

Three quick observations:

First: You only need the six joins if you want every single attribute of information stored about the user. My guess is that about 99% of the time, you REALLY only need maybe two tables’ worth.

Second: normalization is optimizing for data entry and maintenance. If your application is presenting the data primarily rather than CRUD, of course you need a denormalized view. But I’d push from a normalized db to an optimized copy, or maybe even a cube.

Third: MEMCACHED, MEMCACHED, MEMCACHED.

Over the years I’ve developed a theory that the one area where developers reveal a consistent lack of knowledge and understanding is in DB/query design. This article and the subsequent comments reinforce this theory (and to be candid, leave me a bit apoplectic).

  1. Views: terrible idea unless it’s a SQL 00/05 indexed view. Even then, you probably need to revisit your schema and queries before resorting to this measure. If you’re on MySQL, forget that views exist and used stored procedures.

  2. Surrogate keys: that’s another rotten idea, especially if you’re concerned with performance. Also a horrible idea if you ever need to migrate the database.

  3. The query in question: dude, Dare, really? And your defense of this query is that you may want to show all the numbers a given user has, for example, Facebook? Respectfully, please, both you and Jeff need to learn more about databases or hire someone that does, or you are headed for pain and misery as you build Stack Overflow.

For example: say my User.User_Id is 666 (I also am the son of Satan, but that’s a different subject). There are two entries in the phone number table that have the User_Id foreign key of 666: one is my office number, 402.261.0498. Then you have my mobile, 402.525.5376. So you run your query, using my User_Id in the WHERE and you get back two rows, one for each phone number, with all of my data from the User table repeated for each. What are you going to do with this result, say on my profile page? Are you only going to use the data from the first row, except when dealing with phone number data, in which case you’ll loop through each row of the result and render each phone number?

You’re going to have to do as Ariel suggests: query the distinct user data, then query the phone numbers and other information. Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data.

Cache the generated view pages first. Then cache the data.

http://www.pragprog.com/titles/memcd/using-memcached

You have to think about your content- very infrequently will anyone be updating it, it’s all inserts. So you don’t have to worry about normalization too much.

I wonder if one factor at play here is that it’s very easy to write queries for de-normalized data, but it’s relatively hard to write a query that scales well to a large data set.

I’ve seen numerous queries written by professionals that work well for small tables (As jeff says, everything’s fast for small values of ‘n’), but completely fail when scaled up to larger tables.

In my opinion, writing good SQL queries is dark voodo - not least since results can vary hugely depending on the underlying DBMS.

I think that you should think about how it would be easier to implement the user interface, in the normalized, or in the denormalized form.
In the blog post (article), the user interface can be implemented much easier in case of the denormalized form, so I would prefer that one. Of course, if you can afford to make the application perfect (3 times more time), then the normalized form would be better.

The biggest performance killer is so called physical read.

Thats a bit scary… I’d always assumed that reads from hard drives were slow… but not so slow as to create the waiting for several seconds issue when looking up a record…

I suppose in these big databases with loads of tables all joined together the sheer number of reads creates the problem.

You learn something new everyday, huh?

Jeff,

I know you probably would rather not change the site too much. But I would love to be able to rate responses based on merit. Then we could hide the long list of comments by people who didn’t read the post or who troll around looking to bolster their web traffic with insults.

I agree wholeheartedly, by the way, to this idea. I unfortunately was taught by someone who knew what normalization was supposed to mean but no more, and one math prof. who was thrust into a Software role because of his 1 year of experience doing consulting in 1994. He instructs all students that database must be 43NF and all logic outside of storage must happen in the application layer. No stored procedures, no denormalization, and views are sacrilegious.

Thanks again,
Raymond

You used OR ? No index for you. Next!

Data integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second.

Always is a dangerous word.

Well, after reading all these comments it appears that most programmers still fail miserably at reading comprehension (or just didn’t read the article completely). How long until StackOverflow.com is ready, Jeff?

What about an approach where the data is stored in and WRITTEN into normalized tables but have a DEnormalized table for READING from? You can have triggers update the READ table when changes are made to the normalized tables, or batch process it nightly or mroe frequently, checking DB activity first to make sure its low.

no matter what, the database is the slowest part of your application. period. fin. done. caput. The way to get near instant execution is to have the objects in memory and then just use the database for storage (ie read = from memory, write = from memory + write to DB). Then if your app restarts, it’ll only be slow while pulling things from disk, but after that super fast.

It has been said well by others already, but I feel strongly enough on this issue to add in my two bits:

Denormalization is the last possible step to take, yet the first to be suggested by fools.

And, yes, I just called both Atwood and Obasanjo out as fools.

It is dangerous to make that suggestion so willy-nilly, as people take posts by prominent authors as something they can rely on. Even hinting that denormalization should be anything but a last step means that these guys simply cannot be relied upon as authorities in this space.

So, for those still feeling unsure after reading through all of the above:

Views are not about perf/scale, they are about query clarity and/or abstraction of true data source structure.

Materialized/indexed views can help perf/scale, but have limitations in terms of the underlying data structure.

Optimization must always be based on freshly-collected metrics, should go roughly in this order (with the order slightly adjusted based on your particular applications and databases and the issues you have previously experienced), and should be done one at a time with fresh metrics collected between each:

  • remove indexes no longer deemed necessary
  • add indexes newly deemed necessary
  • simplify views where possible
  • optimize queries, especially to reduce scans
  • reduce application mis-use of the database (unnecessary and repeated queries are a common culprit)
  • add/change application-tier caching of database information
  • materialize/index views where appropriate
  • optimize stored procedures
  • create generated search/reporting tables
  • investigate fundamental issues with the design of the database schema (data types, relations, etc.) but not from a hey, let’s denormalize perspective.

You’ll notice that denormalizing your fundamental schema isn’t even on the list. It is too tempting and dangerous to put it there. True, it may be a last resort if all of the above fail, but if all of the above fail you are almost surely having the kind of problem that everyone would love to have: your application is going gang-busters, and you can afford to throw more hardware at your current model (start with increasing spindle count first by getting transaction logs onto their own spindle(s) and then by getting indexes onto their own spindle(s)) and if that still isn’t enough then you are are into Google/Amazon/etc. territory and can get some doctorate types to start designing their own persistence model for you. They’ll probably cook up something decidedly non-relational, and with much denormalization. :wink: But if so, then why not just up and denormalize? Simple: The other possibilities need to be exhausted first, through repeated execution of the above list of steps, until you can be certain that violating the correctness guarantees of your relational database through denormalization is worth it.

Interesting blog post on denormalization patterns:

http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html

One more benefit of normalization - variable number of fields. For example, if you denormalize the phone number field, how many phone numbers do you pick? 5? 6? Picking too few is very limiting and picking too many is a waste. In 1990, who would have guessed that a person could have 10 phone numbers associated with them? Home Phone 1, Home Phone 2, Cell Phone, Fax, Cottage, Winter Home, etc. Normalization solves this problem.

God bless ya, Jeff. As with everything in software development, a bit of critical thinking about why something is done and when tends to go a long way…

There is a simple alternative to denormalisation here – to ensure that the values for a particular user_id are physically clustered. In Oracle you would create a hash cluster and then create the five tables that include user_id in that cluster. Performance for that join wuld be excellent. I expect that other RDBMS’s have similar structures.

The morals of the tale:

i) know your RDBMS
ii) database-agnostic design sucks.