Maybe Normalizing Isn't Normal

I’ve said it before:

As someone who has had to work with a system for K-12 student records that stored absence dates in a FORTY-column-denormalized-group with overflow rows, and a financial institution, I would ask that, rather than de-normalizing, simply punch the next person who has to interface to the database as hard as you can in the stomach. Let them know the alternative, and they will politely thank you.

This article fails to show a partial normalization, which is better than total de-normalization or total normalization. Where is the (ideal) partial normalization schema example?

There are at least three things normalization does:

  1. Reduce duplicate data, both in-memory and on-disk.
  2. Allow many-to-many references (MxN),
    where the quantities for N and M are not known.
  3. Each table has a primary key ID which may be indexed.

Consider this de-normalization of the sample schema.

  1. No duplicate data exists.
    Suppose that the user is allowed one screen name and one phone number. Then, there is no duplication possible for either. So, merge the UserPhoneNumber and UserScreenName tables into the main User table.

  2. Many-to-many references exist.
    We must let users join ass many affiliates as they wish. Keep the UserAffiliation and Affiliation tables. This reduces wasted space, since there are no empty affiliate_1 to affiliate_10 fields in the User mega-table.

  3. De-normalize only when the waste is marginal.
    a) The waste is marginal.
    UserWorkHistory can merge with the Affiliation table. The extra integer IDs and short description rows are marginal compared to a join.
    b) The indexed join is effective.
    Most databases can choose which table’s index is optimal to start a join. Suppose the Affiliation table is very short, and we want records WHERE DESCRIPTION LIKE ‘Clubs%’. The database loads the small Affiliation table into memory, and get its affiliation_id keys. Then, it reads only the shorter index-matched UserWorkHistory that fit the keys.

@David Aldridge

In Oracle you would create a hash cluster … I expect that other RDBMS’s have similar structures.

In my experience, only Oracle and Teradata can do this. And clustering the data will absolutely fix the OP’s concerns about performance.

database-agnostic design sucks

Well, logical models should be database-agnostic. But a physical design that ignores the unique features of the database engine will truly suck.

As a general rule, when you hear someone speak about the performance implications of normalization, you can rest assured they don’t know WTF they’re talking about. This particular blog post is no exception.

Oh well. At least I can count on a continued lucrative contracting career fixing the problems caused by these know-nothings and the readers that follow their advice.

I think all databases must be at least in 3NF to avoid redundancies and anomalies. Anomalies derived from the existense of redundant data and functional data dependencies.

For example:

In the above case what would you do if your USERS changed their Surname would you write an update statement to change the surname as many times as the USER had affiliates? Imagine this is Facebook? How many rows would you update at a time? Could you keep on living like this? Locking? ACID ? Transaction processing comes to my mind.

I don’t think is a good idea to denormalize just because it is slow. If it is slow I think is because of bad written SQL. Normalisation was not invented to make things slow either.

It is there for data integrity and consistency. The Relational Model is a sound and proven model based on Algebra’s set theory. You can read C.J. Date’s Introduction to Database Systems to find out more about other scary/bad things that will happen to denormalized databases. If you denormalize and write everything into one table, is easy, you might as well use a file system and a flat file, in such a case you don’t have to use a PRIMARY KEY either!

The database should not be tought of as a hard disk. Although I think is used a lot like a hard disk nowadays.

Must admit I have read very interesting posts here!

My team spent a year and a half fixing a major system that was designed to be denormalized from the get-go, because they expected to have hundreds of millions of rows of data.

The genius who designed the system created a data structure that consisted mainly of a single table with just a few columns. One column contained the type of data contained in the row, and another column contained the actual data (a long, delimited string, containing multiple columns of data). In effect, he had entity tables within rows.

It was pretty fast, but it was a nightmare to maintain, and data integrity went out the window even faster than the blindingly fast queries. (They weren’t really that fast, anyway.)

We replaced this data stack with a normalized relational database design. Some of the queries required a lot more joins, but you know what? The new system is much faster, data integrity is back, and maintenance no longer causes nightmares.

This author needs to learn a thing or two about database design…

Seems to me, that if you want to use a moderate pc to store BILLIONS of records of anything, denormalizing is just going to postpone your problem. If your site is important enough to have billions of records, than it is important enough to have the proper hardware, as well as the proper design.

That being said, denormalization does have its place, and its an important place. But it shouldn’t be considered until you have gone through the normalization process, at least so that you will have a better understanding of your data to build upon.

Its quite a bit easier to build a denormalized structure from a normalized one for speed, but trying to fix the problems that seem to crop up from starting with a demormalized structure aren’t always so easy. But just like anything else, it depends on what it is being used for.

Disks and memory are cheap and getting cheaper every nanosecond.

vs.

Please think twice before subjecting yourself, your fellow programmers, and your users to the XML angle bracket tax. – Jeff Atwood, CodingHorror.com, May 11, 2008

Please make up your mind!

This is exactly like the discussions that arise when someone decides normalizing data is unnecessary, and have invented some new way to add array data types to SQL.
Brianary on May 12, 2008 08:25 AM

Hah! I was exactly right, there!

http://www.codinghorror.com/blog/archives/001114.html

Oy. And the DB fanatics come out to play.

A few issues I have with other comments in general:

  1. Use a view
    What many people might not realize is that when using a view, you are actually running two queries - one to build the view, and one to query against the view. You might not even need 90% of what the view is returning, so you are essentially just throwing away all that data and wasting the memory/CPU resources required to process it.

And before anyone throws out the CPU/DISK SPACE/MEMORY/BANDWIDTH IS CHEAP argument, think about the multiplicative effect of running those queries when you have hundreds or thousands of people connected to your database. Query optimization plays a -huge- role at this point, and a properly-normalized database will help.

  1. Coders don’t know Databases, and DBAs don’t know Code.
    Until you have to work as both. And then you understand how absurdly close they both are, and how if you mess one side up, the other side suffers just as much, if not more.

I’ve spent my share of time sifting through bad code that implemented poor entity relationships (y’know…the R in RDBMS). Tables that have 90 columns to store information about 30 fields, of which maybe (MAYBE) four are actually getting used. On the flip side, there are also the tables that are normalized to the point where retrieving an object’s data becomes an excruciating task involving six types of joins, views being joined on views, and the like. If you’re spending more time tracking down how to connect the information than you are actually GETTING the information, you have a problem.

  1. [De]Normalization is the best solution
    We don’t use flat files for data storage very much anymore, and for good reason. It’s easier to think of objects as their various components, and to model them accordingly. That’s where B-C were coming from - your tables should reflect the entities that make up your system. However, situations arise where it isn’t the best idea to have normalized data…reporting is the classic example, but this goes back to my previous point of doing more work just to get the information you need than retrieving it takes.

Mindless ranting aside, at the end of the day, we all just want to get data where it needs to be. Do what works the best, and everything will be fine.

@ James:

I’ve spent my share of time sifting through bad code that implemented poor entity relationships (y’know…the R in RDBMS).

No, actually, that’s not what the ‘R’ in ‘RDBMS’ refers to.

Loved the article. Coming from a data warehousing background I can vouch for denormalizing.

Jeff, that post was funny.

Denormalization is just a tool to solve performance problems when all other solutions are defeated.

The data integrity advantages of normalization far overcome that a completely denormalized would gain. I tried in the past create a denormalized db that works: the spaghetti code it generated would generate the kind of reation the your’ site logo show and data was complete rotten and untrustable .
Normalized it and it became a breeze to maintain.

Normalization is not about disk space. Is about maintainability.

What I don’t like is when people want your data denormalized because they deem it easier. I work with a database that actually had two versions of someone’s name: the all caps version and the regular version. Please for the love of everything that’s holy don’t let this happen to your database.

Generally speaking. You should be normalizing your domain models, then your business objects, then implementing a data structure that hopefully closely matches those designs. All of which should be normalized both for data and for behavior. If you do this THEN find performance issues, THEN try optimization procedures like caching, or any of the suggestions above. I think de-normalization is a last result type of thing to do. Generally speaking there’s just too much risk for data corruption when you duplicate like that.

when using a view, you are actually running two queries - one to build the view, and one to query against the view. You might not even need 90% of what the view is returning

Nope. The query optimizer places the view’s code in a subquery within your query that uses the view. Then the optimizer optimizes the whole thing.

You can see the whole thing in action by clicking show execution plan.

Why not just throw out the whole kit and caboodle… go the CouchDB route
it’s too hot not to consider.

a href=http://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/CouchDB/a

instead of
select user.id
you can just have
select user_id…

I can’t help but feel this is a poor excuse for bad naming conventions. Data schema seem to be the only place left in programming where arcane and redundant Hungarian naming schemes still have a hold - and the only technical reason that anyone can give is it saves typing?

Sometimes I feel departmental naming schemes are there to give people who know nothing a sense of knowing something: pages and pages of abbreviation rules to memorise so you can make sure that your database never has anything over 12 characters when it can in fact handle 64. He who remembers the list of rules has an edge over others even if he doesn’t understand a thing about database design. And why call a field Passport when you can call it PrsnTrvlDc?

I sometimes look at schemes and wonder if the designer really knew how to normalise or simply claims that his design is for increased efficiency. Especially when his design requires a rewrite to add an extra item of data which should have been stored as a row but was being stored as a column name (AnswerToQ1, AnswerToQ2, etc). Especially if the data is ambiguous and contradictory and he does not realise it. Especially if reporting is now limited to a few use cases and not anything that could be asked that we haven’t thought of yet (a strength of normalisation is that it is future-proof and caters for the inevitably unanticipated).

And, how can this argument make sense when the solution is to put the table name in the field name to avoid typing the table name?

Why do people keep insisting that it is easier to type table_id than table.id? It is actually one extra keystroke over the more naturalistic scheme.

If there was a performance argument, I could accept that (although many performance arguments seem dependant on assuming one knows how the query engine internals work and hoping it will not work differently on the next service pack).

Of course, there may be good reasons I haven’t encountered yet…

The things I hate the most are ignorance and stupidity. And no matter how hard I work, I just can’t seem to rid myself of them.

Normalize to that grad where you feel comfortable with the balance between duplication and performance.

A whopping six joins?! Really?! Omg, save the whales now before it’s too late.

Seriously, six joins isn’t a big deal, and unless you have multiple millions of people networking in your site, your database shouldn’t be dying with the model you describe. Someone f*ed up somewhere, and it has nothing to do with normalization.

You’re missing the HUGE costs to de-normalized data on large dataset updates. You have to update multiple places every time data changes. For large datasets this can be disastrous, especially when you have to update objects as a single unit (or roll them all back). As these updates increase, locks increase, which then slows down everything else. You’re seriously discussing running cron jobs to clean up the mess you are creating, rather than just writing clean code the first time.

If your only course to improve performance was denormalization, then you either have a pretty rare situation, or you need new architects and/or DBA’s. I’ve worked at three companies now where developers lack of knowledge concerning databases and data relationships brought applications to a halt. It comes repeatedly from a developers centrist view of applications where their part is always the most important component and everything exists to serve their application. After than comes an over-reliance on data layer generators, which cannot optimize queries as well as a good DBA. So basically the developers make cuts to save time, misuse the data model, then blame the DBA when things go wrong.

Out of curiosity, did you find your DBA and Developers often arguing over where things should go? Every time I’ve seen the behavior, it’s been because DBA’s feel they cannot trust the developers to intelligently access data, either due to the current team, or that DBA’s past experiences.

Every time I see a post like this on a blog I try to imagine it in Nuclear Engineering. Do you find yourself putting needless safeguards in your nuclear generators just because some professor told you to? They cost too much! Remove them! Don’t do things by reflex! Every engineer should just do what works for them! Don’t let standards get in the way of progress and performance!

I find it humorous, that people follow a million rules such as table names should be plural as a doctrine, then casually discard good design at the first hint of an actual challenge. To be fair, naming tables is fairly easy, and you rarely find that table names stump a developer and cause them to look for problems.