Maybe Normalizing Isn't Normal


#101

Sorry to contradict Greg !

Use a multi-value (a.k.a. Pick) database and you don’t have to worry about normalization ever again. They are fast too.


#102

I have had to fix a few huge databases over the years and often the problem is a bad, but normalized, database design.


#103

I just spent a year having normalization hammered into my meager little mind as part of my database module and now you’re telling me that it mightn’t be necessary…

This is why this article is wrong, Jeff. This is why you’re an idiot, in case the first statement wasn’t clear enough. You just gave an excuse to be lazy to someone who doesn’t have a clue.

You don’t normalize because of ghosts; you normalize because it doesn’t make fucking sense otherwise. A tuple is a logical statement; denormalizing means you take the risk of having contradictory or false statements in your database in order to gain performance. (That’s why it’s recommended for read-only databases, for example: the risk of having incorrect statements is much smaller.) If premature optimization is the root of all evil, denormalization is Satan’s ugly mother. It’s right there with the compiler is wrong.


#104

I have a feeling a lot of the commentors missed your point, which means they probably just skimmed over the post and read the bolded parts. True, the six join query was an exaggerated example, but still, the point stands.

There wasn’t a call to denormalize from the beginning, people. There was a call to denormalize when bneeded/b.


#105

You just need 1 table with 2 fields:

  • id (primary key)
  • data as Xml

lol !


#106

I’ve never been a database guy, but this whole issue seems like the classic problem of optimization to me. There’s almost always that trade-off between elegant readable design and optimal execution speed.

I disagreed and waged a number of arguments that it was our database that needed attention.
We first needed to tune queries and indexes, …
Consultants were called in. They declared the db design to be just
right - that the problem must have been the application.

This is precisely why you never optimize until after you profile (find objectively where the bottlenecks are). One of my pet peeves is folks who optimize their code with absolutely no evidence that there’s an execution speed problem there, or even anywhere.


#107

Ah, and you should read the comments to Dare Obasanjo’s article: there’s no six-way join; there’s a three-way join at the worst.


#108

Great point. At my previous employer, they ran into scalability issues, and denormalizing their data never entered their mind (as far as I know). Another great way to speed things up is to do more processing outside of the database. Instead of doing 6 joins in the database, have a good caching plan and do some simple joining in your application.

My article explains it better:
http://www.ytechie.com/2008/05/stored-procedure-reporting-scalability.html


#109

I’ve been working with RDBMS’s since about 1986, and currently support a very large complex (normalized) database.

I have to disagree that normalized databases can’t perform. Or that you should assume they cannot. This is just plain not true and a bad idea. You normalize, then you test, and then you refactor those places that need to perform faster. Never waste any time trying to design around performance problems that ‘may’ happen, only those that you can actually measure. Anything else is a waste of everyone’s time.

Object-Relational database technology doesn’t solve anything, the relationships between objects still have to be built into a persistence structure. All it does is hide the complexity some (which can be a good thing). Just don’t assume that because the object hides the complexity that the actual physical storage is simpler as well.

Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don’t have have to keep pulling stuff apart and putting back in new configurations of the db, update synchronization code etc. You can concentrate on the presentation and persistence APIs, and gradually refactor a much smaller set of db changes than you would if you had denormalized everything.

I totally agree with Greg, correctness comes first.


#110

I think file system performance is often overlooked when examining performance problems in medium/small db’s. Denormalization will help when you have an i/o bottleneck - to a point. However, I personally find it faster and more cost effective to upgrade my i/o subsystem than to refactor my db and app for a denormalized schema.


#111

Jeff, be careful when giving advice like this… De-normalizing is something that should only be done by people who ALREADY UNDERSTAND when to use it and how. I fear you’re giving it the air of being trendy and thus confusing the less experienced developers out there. :stuck_out_tongue:


#112

I’ve been working with RDBMS’s since about 1986, and currently support a very large complex (normalized) database.

I have to disagree that normalized databases can’t perform. Or that you should assume they cannot. This is just plain not true and a bad idea. You normalize, then you test, and then you refactor those places that need to perform faster. Never waste any time trying to design around performance problems that ‘may’ happen, only those that you can actually measure. Anything else is a waste of everyone’s time.

Object-Relational database technology doesn’t solve anything, the relationships between objects still have to be built into a persistence structure. All it does is hide the complexity some (which can be a good thing). Just don’t assume that because the object hides the complexity that the actual physical storage is simpler as well.

Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don’t have have to keep pulling stuff apart and putting back in new configurations of the db, update synchronization code etc. You can concentrate on the presentation and persistence APIs, and gradually refactor a much smaller set of db changes than you would if you had denormalized everything.

I totally agree with Greg, correctness comes first.


#113

To everyone bashing Jeff on his database design skills - he didn’t come up with that design. He clearly pointed out that the design was from Dare Obasanjo’s blog post on this topic.


#114

Some things are too slow in a normalized database. Facts are Facts. You just rout around this:

  • Keep a de-normalized replica for expensive operations (e.g. reports)
  • Cache Results for repeat queries (Memcached)
  • Partition the database for scalability (vertical or horizontal)

#115

@Julio: Hibernate, at least the Java version, actually has a concept called join fetching where it will join a related collection instead of doing a second select. So, just for the record, an ORM wouldn’t necessarily use all separate selects.


#116

Speaking from long experience, if you don’t normalize, you will have duplicates. If you don’t have data constraints, you will have invalid data. If you don’t have database relational integrity, you will have orphan child records, etc. Everybody says we rely on the application to maintain that, and it never, never does.

The question is, what’s the cost of the data integrity problems? If it’s something you can live with for performance, fine. If it’s a disaster waiting to happen, you better build your database design as tight as possible.


#117

This post seems to suggest throwing normalisation away, saying that our professors are all wrong and the real way to design databases is to just do what works, as that’s how it works in the real world.

What a horrible, cowboy attitude to DB design. I hope you don’t design any real databases.

Data integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second. The CFO / CEO might be concerned if their database is running slowly, but they will have your neck if the data inside the database is garbage. Properly normalised databases are important because they enforce data integrity, rather than relying on application developers to enforce it.

You even said this yourself, though the point seems to have been lost on you when you said it. And all those pesky data integrity problems the database used to enforce for you? Those are all your job now.

No… just, no.

Relying on the application to enforce data integrity is the surest path to hell that I know of. That is the whole freaking point of a database. Hundreds of thousands of hours of development time and testing has gone into each database product to guarantee this level of service. There are thousands of features. And you want to replace all of that time and testing with some hand rolled garbage that has received maybe 100 hours of testing if you’re really lucky? What an absurd idea. What a horrible anti-pattern.

Denormalisation is an optimisation over the normalised data. The DB should ALWAYS be normalised with denormalised components added later as scalability issues need to be addressed. The normalised part of the database is still the authoritative source of data, and the denormalised part simply aids in query execution. You can always rebuild the denormalised data from the normalised data which has had all of the data integrity rules enforced on it.

Come on. I know it’s easy to hate on professors for not being in the real world and not having frontline experience, I hear this wanky sort of rhetoric all the time, but you might be surprised to know that they do sometimes know a thing or two.


#118

BuggyFunBunny,

I was wondering how far down the comments it would be before we’d see a pontification from the anti-knucklehead, database crusader.

Best thing about your comments are they provide a us with a beautiful reference point for one (extreme) end of the application design continuum. At one end we have persistence ignorant application design, and at the other database is god.

Gee, I wonder which end you belong to? You are an extremist and as such your views need to be treated as such.


#119

Several people already hit on what I was going to say, but here goes anyway…

What is the purpose of the database? Is it to process transactions or report findings? I always take the route that I design the data model using the best practices approach and considering my requirements.

Now, I double check requirements to make sure the database will behave correctly. If 99% of what I am doing will be reporting off the tables, I will either create Materialized Views (Oracle) or Materialized Query Tables (DB2) that basically denormalized the data for quick reporting. This can be a problem if you need up to the minute dynamic reports since the MV/MQT’s need to be refreshed periodically. Currently, we are on a 2 week refresh, so this approach works well for us. The MV/MQT’s carry some resources, but memory and storage is cheap!

Now, if it is a quick app for logging or something, I will just create a flat/wide table and put everything in it. After all, this log is for me and my team… no reason to confuse the issue with 5 tables to hold each piece of the log information… Plus, this data gets cleaned out every so often and I am not concerned about instant access. Just index the table by date, and order by in the sql…

If I am going to be constantly hitting the data with updates/inserts/deletes (like in order processing or something) the normalized structure works very well… and I don’t carry the overhead of all the duplicate data associated with MV/MQT…

So, I don’t think you can make any blanket statements on normal vs. non-normal form. Like everything else in programming, it all depends on requirements and intended goals.

my $0.02


#120

Sergej, great comment.

Jeff, I remember from one of the early podcasts when you were describing the box you were deploying on that it was just a single machine, running the database and the web server, is that correct? That immediately struck me as the first thing that will have to change when you go live. It is astounding the performance difference you get by putting the database on its own box and clicking the little box that basically gives the machine to Sql Server.

There is a lot of idealism in the comments above. Lots of alternative databases, OODBMS’s, LINQ. Software developers in general have a very poor understanding of how databases work, how to tune them, how to design them. I wonder how many billions of dollars have been spent on servers in the last ten years that could have been saved by a simple index.

The design of any database that expects millions of rows in any table should be handled by an experienced DBA/Programmer. All access should be controlled via stored procedures. Never, ever, ever let a software tool or a (non-dba) software developer write your queries for you. db_reader and db_writer should not be privileges that your applications enjoy.

If you look at a query that has 30 joins and immediately think it is horrible, please analyze and tune the query first. Only the joins that require a big scan matter. Look at the plan and if you see 29 joins at 0% and 1 join at 100% of execution time, then only one of the joins is bad. Go ahead and add ten more if you need to. De-normalization is for reporting, not for OLTP.

Here’s an excellent book for really digging into how Sql Server processes queries: http://www.amazon.com/Inside-Microsoft-SQL-Server-2005/dp/0735623139 (T-Sql Querying by Ben-Gan).

Jeff, one of your comments struck me as worth a little more thought: Disks and memory are cheap and getting cheaper every nanosecond. Not everyone can afford a SAN, and if you’re running everything on one box, it’s not about how cheap the disks are, it’s about how many you can cram into the machine. You want the fastest disks you can get, which means you are limited in size, and you want RAID, which eats up more space. In a 1U machine, you get maybe 2x136Gb 15k drives, double that in a 2U. And even that leaves everything running on the same set of spindles. So you move up to NAS or DAS, but you are still limited, because if you just fill it up with slow 1Tb drives the IO will kill you. Data size matters.