Maybe Normalizing Isn't Normal

@Niyaz: Materialized views (or indexed views, depending on your rdbms) are in fact tables, but created and maintained by your rdbms. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automatically.

sorta on topic: What’d you use to make the diagrams? I dig

table names are plural, column names are singular.

According to Codd and Date table names should be singular, but what did they know.

I’m a little lost here.

I’d say the example is not even normalized. I mean User Phone Number as well as User Screen Name should me merged into the User table and they will still be in BCNF. Then you keep the rest of tables as they are…

Why do you take Phone Number and Screen Name out of the picture? Is not as if you may have several Phone Numbers for one user with that design… and if the user does not have any phone number then, you know, NULL values may help.

and at that point nobody will have to teach them that denormalization is OK, they will have figured that out already
Erm… Isn’t the entire point of this post that some people haven’t?

Unless you are doing a lot more reading then writing

Aren’t most – perhaps in fact all – websites in this class of application? Millions of reads, handful of writes.

you should denormalize as much as it makes sense to do so for read-heavy loads. Most web applications are of this type and do not even require a regular database necessarily

My point exactly!

View data isn’t cacbed, however the execution plans are.

That buys you virtually nothing. Caching of execution plans is pretty much automatic for all modern SQL databases these days – even for raw SQL. No stored procs or even parameterization of the query is required!

http://msdn.microsoft.com/en-us/library/ms181055.aspx

When any SQL statement is executed in SQL Server 2005, the relational engine first looks through the procedure cache to verify that an existing execution plan for the same SQL statement exists. SQL Server 2005 reuses any existing plan it finds, saving the overhead of recompiling the SQL statement. If no existing execution plan exists, SQL Server 2005 generates a new execution plan for the query.

I always learnt the last rule of normalization is to de-normalize.

Also I believe large organizations use a de-normalized for for selects and a normalized view for updates and inserts using replicated servers.

Have a look at the Scaling Databases section of http://msdn.microsoft.com/en-us/magazine/cc500561.aspx

@Ariel
Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that.
The syntax may be incorrect (I’m not sure you’re right, though), but I assume the idea is that you might want a complete list of all users and their related data. Can you give the correct answer rather than just picking holes?
At the same time, the fact that the syntax for joins is so tricky to get right and interpret on reading is another major problem with Normalised Tables.

If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?
What if you needed a list of everybody’s screen name and their related phone numbers?

You’re simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn’t! Test it - you’ll see I’m right.
Well, as the articlesays, there’ll be no measurable difference until you reach billions of data rows. Do you even read the article?

But, do me a favor and write real joins - not ones that give you tremendous numbers of duplicated rows.
Which you have singularly failed to provide an example of.

very easy answer:
normalize till its good, denormalize till it works!
That’s the rule I learn in college.

Hi Jeff,

All I can say is: Thanks!

This is the most enjoyable and valuable post in the blogoshphere for monts!

Why do you need to use a relational db for something like querying tags in the first place? Another approach would be simply to treat the db as a persistence layer and maintain in memory indexes for your tags.

Feeding a term vector of tags into something like Lucene is going to be faster than any DB query and a hell of a lot easier to scale.

An observation : if you use LINQ (Linq2SQL,Entity Framework,…) you don’t have to write all those inner joins. You can use properties to access related entities. I love SQL but LINQ is allows to work with your data on a higher abstraction. If needed we, the SQL oldies, can still impress the young with blazing fast queries which will become a lost art.

‘Golden Copy’ of data should also be taken into account when designing databases, to often I have seen database that don’t apply any normalisation have multiple copies of critical fields spread across multiple tables…

select * whatever can also be a performance drain worse than the 16 inner joins you described. select what you need and nothing more.

@Ariel
Looks like you’d get some duplication where you had multiple screen names and multiple phone numbers, but that’s a very fine edge condition. It could also be helpful because you’d be able to detect users with multiple profiles or who have provided multiple contact details.

The fact that you didn’t provide suggests that the normalised table is quite tricky to extract the complete data from cleanly.

In your example you’ve turned several one to many or many to many relationships into 1-2 and 1-3 relationships. These are obviously not the same thing and there are many real life requirements for these standard relationships (1-N, and N-N).

The problems really start happening when you aren’t normalizing your tables because your professor told you to, but when you need to model these one-to-many or a many-to-many relationships between data.

And I disagree that it takes millions of rows to make this stuff relevant, I worked on a project which had a search query with 4 or 5 joins and even specialised indexing for the major queries that were being run still caused any FULLTEXT searching on the data (3000 rows) to take 9 seconds. One summary table later and it was down to 0.09. Just by removing the joins!

So yeah, I guess my point is that the important thing here is the cost of a JOIN, and you need to know how to deal with these (summary tables, proper indexing, query caching, etc.) in situations where they ARE necessary.

Even a modest PC by today’s standards – let’s say a dual-core box with 4 gigabytes of memory…

Holy carp, is that considered modest? Somebody please tell my company so they’ll upgrade our Pentium 4 512mb machines. That billion dollars we make each year has to be going somewhere.

As I tried to clarify in my blog post, denormalization is something that should only be attempted as an optimization when EVERYTHING else has failed.

Denormalization brings with it it’s own set of problems. You have to deal with the increased set of writes to the system (which increases your I/O costs), you have to make changes in multiple places when data changes (which means either taking giant locks - ugh or accepting that there might be temporary or permanent data integrity issues) and so on.

Reading various database stories may make it sound like all the cool kids are doing it but all the cool kids are also rolling their own file systems and implementing their own alternatives to relational databases as well. They did that because they hit scaling problems and felt they had no alternatives but they gave up something along the way.

More on this at http://www.25hoursaday.com/weblog/2007/10/10/WhenDatabasesLieConsistencyVsAvailabilityInDistributedSystems.aspx

Goran,
So my question is: where is the performance enhancement in using views?
When something is updated in the tables, all these join statements will be needed to make the view. Right?
So for a very large application, views does not give performance improvements. Tell me if I am wrong.

There’s a whole lot of denormalisation going on with even the most normalised databases. Indexes are denormalisations so is anything you cache.

I think the approach to take here is to denormalise for speed but to ensure that denormalised data sits on top of normalised data can be rebuilt automatically. This is what happens with indexes, and it’s a tried tested methodology.

As with indexes, anything denormalised should be hidden from the main programming effort. In fact, I’d go as far as to say that it should be a function of the database. How this can be done without an over-reliance on views is difficult to imagine though.