Maybe Normalizing Isn't Normal

hosts · July 15, 2008, 12:00am

Rich · July 15, 2008, 12:00am

Your normalized database could actually have a couple more tables. Phone Type, IM Service, Company name and maybe even Title could be in lookup tables so you can find all the people for X Company, or using Y IM service etc.
You can’t allow free text there or some will enter ATT and others ATT. You could control it on the front end, but you’d still need a table to get the list.

AndrewM · July 15, 2008, 12:00am

Ask anyone who works with the family of databases called “multivalued” (such as Revelation’s OpenInsight or IBM’s U2) and they’ll tell you that you normalise when data HAS to be treated atomically – or as so many others have said, design to third normal then justify denormalising. Removing external joins is one easy way of reducing web latency.

decaf · July 15, 2008, 12:00am

You make a good point about denormalization. I am developing a social networking app and have been fairly strict about normalizing data, and have created indexes to help performance. It is not yet a problem, but it has occurred to me that occasionally tortuous joins may eventually exact a performance hit. I’m fully prepared to abandon purity for performance when/if this happens. Thought-provoking article.

Greg_Beech · July 15, 2008, 12:00am

An obvious (?) solution which works for even very large sites is to store the master copy of your table in normalized form, and then use batch jobs to denormalize the data as appropriate. This means you can have multiple denormalized forms suitable for different purposes, without losing any data integrity.

Of course this precludes the use of ORM, but once you start talking about performance and scalability you’ve already done that.

Rhywun · July 15, 2008, 12:00am

People normalize because their professors told them to.

Never had one of those. I normalize because it’s elegant and easy(er) for me to understand. I don’t do much work on huge systems, except for our major systems which were already designed for us and are quite denormalized already. Of course now I’m doing more data warehouse work, where all the rules get thrown out the window.

Stuart_Cam · July 15, 2008, 12:00am

Using NHibernate lazy loading would allow you to just compose your object tree with the data you need at that point in time.

Say for example you needed to display the users name. No need to populate the whole aggregate (Affiliation, Work History…) just the root (User).

That should help cut down some of the joins.

In the end of course ‘it depends’.

BuggyF · July 15, 2008, 12:00am

You’re just another knucklehead. Think of it this way: would you take seriously a guy who’s spent 20 years on the line at Tyson dismembering chickens announcing that he was really a neurosurgeon? And that he’s got the perfect way to remove that tumor that’s been giving you massive headeaches? The folks who’ve spent decades developing RDBMSs and systems from same, really do know more about it than you do.

That said; RDBMS is about shared transactional data. If you really don’t care about keeping the data right all the time, then how you store it doesn’t matter. Read Codd’s papers or Date’s book. If you haven’t bothered to do either (you haven’t, right?), then why should your opinion matter?

There are reasons Codd drew his conclusions. He didn’t just make it up.

Some thoughts I’ve collected over the years.

Not only am I not aware of any resource that would advocate logical modeling of data in an unnormalized fashion, I would be against anyone reading it if it existed. The logical data model should be normalized. Normalization is the process of identifying the one best place each fact belongs.
– Craig Mullins/2007

There’s a tendency, especially among managers of smaller shops, to assume that any developer knows how to set up a database. Frankly, this perplexes me. You wouldn’t assume that any given developer knows how to code in C# or set up a Web Service, so why is it that we’re all supposed to be database pros? The end result is that too many databases are designed by people who have never even heard the term normalization, never mind developed any understanding of the various normal forms.
– Mike Gunderloy/2006

This is a great book on building databases the correct way. I highly recommend this book[Database Modeling and Design by Teorey] and if you cannot understand it you shouldn’t be building a database anyway.
– Robert C/1999 [the current 4th edition/2006 is even better]

Many of the existing formatted data systems provide users with tree-structured files or slightly more general network models of the data. Application programs developed to work with these systems tend to be logically impaired…
– E. F. Codd, A Relational Model of Data for Large Shared Data Banks/1970

[Dr.] Codd had a bunch of …fairly complicated queries, and since I’d been studying CODASYL (the language used to query navigational databases), I could imagine how those queries would have been represented in CODASYL by programs that were five pages long that would navigate through this labyrinth of pointers and stuff [XML anyone?]. Codd would sort of write them down as one-liners. …[T]hey weren’t complicated at all. I said, ‘Wow.’ This was kind of a conversion experience for me. I understood what the relational thing was about after that.
– Don Chamberlin/1995

Proper data management is the key to great architecture. Ignoring this and abstracting data access and data management away just to have a convenient programming model is … problematic. … Many of the proponents of O/R mapping that I run into (and that is a generalization and I am not trying to offend anyone – just an observation) are folks who don’t know SQL and RDBMS technology in any reasonable depth and/or often have no interest in doing so.
– Clemens Vasters/2006

The big myth perpetrated by architects who don’t really understand relational database architecture (me included early in my career) is that the more tables there are, the more complex the design will be. So, conversely, shouldn’t condensing multiple tables into a single catch-all table simplify the design? It does sound like a good idea, but at one time giving Pauly Shore the lead in a movie sounded like a good idea too.
– Louis Davidson/2007

Technologies like OODBMS sacrifice sound, application technology agnostic data management for short-term convenience (convenience for one single application, written using one particular programming language). Relational technology essentially completely replaced network or hierarchical database technology, and there were excellent reasons why that happened. We should most certainly not be reviving either of those discredited approaches by slapping on the latest buzzwords (OO, XML, etc) as window dressing. …We don’t see any future for JDO.
– Gavin King/2004 [he wrote Hibernate]

…the logic exercised by the database to maintain referential integrity is not free, but if the rdbms does not do it, it gets done in the application, where it costs the same if not more - just now the application becomes slower versus the rdbms. Whether the logic of triggers or RI is performed in the application or the database, it’s going to have similar pathlength - the only question is whether the rdbms or application can be more efficient. As more applications are written in languages that emphasize portability over performance (Java) or ease of learning over performance (Visualbasic), the benefits of putting logic that centres around data further down into the database engine become stronger.
– Blair Adamache/2003

In CS we don’t have a lot of formal models to guide us, as in engineering or other science. Much of CS is entirely ad-hoc. However we do have a sound and complete model for data storage (relational model) and hardly anyone uses it. It boggles my mind. Do people not want their programs to work predictably? [corollary: if you must interact with an Application to modify data, the data specification isn’t relational]
– Anonymous Coward/2005

The people using SQL to store data often don’t grasp that there’s a difference between expressing various facts as a set of relations and stuffing data into a database like you stuff material into a file.
– Christopher Browne/2005

Programmers will always gravitate towards viewing the data in their databases as their private bailiwick, and insist that users of the data access it through their own API. Learning SQL is certainly better than learning a hundred programmer’s different APIs.
– David Cressey/2005

It is very obvious that a lot of people have absolutely no idea what a business model is or for that matter a database. The number of times I have seen business rule code that should be imbedded in the database definition and not in the BLL (Business Logic Layer) just makes me want to scream. As soon as data rules (the backbone of a business model) are moved away from a relational style declarative constraint, you have reduced the effectiveness, meaning and integrity of your business model/database.
– DavidM/2004

Normalizing is an attempt to optimize generally. Denormalization is an attempt to normalize for a particular application, to introduce a bias.
– Gulutzan Pelzer/2003

If you store all your business logic within the database, then it doesn’t matter what flaky applications people write you can guarantee the integrity of the data.
– Pretsel/2004

SergejA · July 15, 2008, 12:00am

Lets unroll our loops next. (Christog)

I seen that and I must say was blown away how removing loops can make your application run faster. Of course, you shouldn’t jump right away and erase all your loops and ifs, but sometimes it can help, just like de-normalization.
You can read about a guy optimizing his XML parser in the book called Beautiful Code. Sorry, I don’t remember who wrote it.

Ben_Mills · July 15, 2008, 12:00am

I run an e-commerce website with about 400,000 page views a month. I’ve been a complete purist about normalizing the database, as well as using business objects to access all data (rather than using ad hoc SQL queries). We have some pretty complicated business object hierarchies and I’ve long been waiting for there to be a performance issue (especially as we use a shared SQL Server), but the website continues to be blazingly fast. I realize that you are going to have a lot more page views than our site, but it looks like your schema is going to be simpler and you have a dedicated SQL Server, so I think you can afford to be a purist for now and not worry about denormalizing until it becomes an issue.

John_Pirie · July 15, 2008, 12:00am

Good post Jeff.

The point many of the jihadist posters are missing is that the right thing may change dramatically as an application scales up.

Your high scalability point is key. As you said, Everything is fast for small n.

The ideals of normalized data, and even of using a RDBMS at all, fall apart at some scale, as Google and Amazon will attest.

wiggles · July 15, 2008, 12:00am

jeff, are you awake?

Konrad · July 15, 2008, 12:00am

It’s funny how many people just ignore your last conclusion because it didn’t suit their argumentation purpose; “normalize until it hurts, denormalize until it works” sums it up very nicely.

That being said, the above example would be no reason to change the normalized DB schema but instead to change the query (obviosuly). In any case, such a query would be very rarely necessary (someone already mentioned data warehousing, I think).

Sean · July 15, 2008, 12:00am

Ahhhhh! Don’t tell people that! Denormalization may be all well and good, when you need the performance and your system is STABLE enough to support it. Doing this in a business environment is a recipe for disaster, ask anyone who has spent weeks digging through thousands of lines of legacy code, making sure to support the new and absolutely required affiliation_4. Then do the whole thing over again 3 months later when some crazy customer has five affiliations.

Delmania · July 15, 2008, 12:00am

Traditional database design principles tell you that well-designed databases are always normalized, but I’m not so sure.

Whose tradition database design principles are you referring to? It’s been some time since I took any database courses at college, but I do remember that both the book and the instructor told us that going all the way to the 5th level was extreme and not a good idea, because it added complexity to the queries used to get data which slowed it down. We were told that in most cases, the 3rd level of normalization will provide the best results.

The purpose of normalization is reduce size and eliminate the number of duplicate records, but the trade off is that the queries used to reconstruct the data become more complex and thus take more time.

Nicholas · July 15, 2008, 12:00am

(disclaimer: I’m talking completely about Microsoft SQL Server 2005 here.)

I’m calling shenanigans on this post, unless you put up a definitive example of where your app is running into problems with a normalized database design.

Like others have said, to get the user data you could use an indexed view. But also as others have said, if you have proper indexes and foreign keys setup, the SQL engine should have no problem figuring it out and generating the same query plan as if you had just queried those tables directly in the first place.

I’d also like to say that the example given is a poor one, seeing as you’re joining 1-to-1 tables to 1-to-n (phone numbers, affiliations, etc.)

Not to mention that de-normalization leads to it’s own problems…

Rhywun · July 15, 2008, 12:00am

PS. I name my fields with ORM in mind, even if I don’t use it. Thus, Id, not UserId or god forbid user_id…

Practicality · July 15, 2008, 12:00am

Keep it normalized. How often do you actually need EVERYTHING about a user at once? Probably only when you are editing that user.

As others have said, use proper indexes to speed it up. De-normalized tables will haunt you for years whenever there are issues. As in an earlier post you are being cruel to the maintenance programmer (even if that is yourself).

structure · July 15, 2008, 12:00am

all of those with a data access layer step forward… not so fast jeff

nick10 · July 15, 2008, 12:00am

I run an e-commerce website with about 400,000 page views a month

That’s very low. I’ve got a site doing twice that traffic every day and has about 30 tables, with the heaviest (and most common query) working on a table of 200k records requiring a sequential scan each time since the sort criteria is based on an average of a field in the resultset that is different for each query which are finally joined against 12 tables.

All running on a 2ghz opteron with 1gb ram, and cpu usage hovers at 40% at peak times.

God bless PostgreSQL.