Maybe Normalizing Isn't Normal

Niyaz_PK · July 15, 2008, 12:00am

Another thing you could try in the case of StackOverFlow is to reduce the data required from the user.
Some questions you can ask:

Do our web application really need those phone numbers? Are we going to call our users?
Is address needed?
Does gender matter?
If an email id and password (and a display name if necessary)is what really matters, bother with all those unnecessary data?

You can add the bells and whistles later. First, Ship.

Frank · July 15, 2008, 12:00am

Jeff,

if you want to avoid the usage of big and complicated SQL-Statements, why not implementing Views instead of denormalization. A good RDBMS should offer this mechanism, especially for performance and usage optimizations.

You could have a View called users, could still say SELECT * FROM users, but what is exactly within users can be changed at any time, without changing the interface to the application; The View users in this example.

MaxK · July 15, 2008, 12:00am

One thing you’re not taking into account is a system like memcached that avoids hitting the database at all. When using such a system, it might be better to actually be storing normalized data so that your hits on the database for cache misses are small and involve as little table locking as possible.

-Max

offler · July 15, 2008, 12:00am

Before normalizing denormalizing think about what you need to to with the database and where you are going to do it. One example: We have a quite big database with test results. Most of the work is gathering data for reports. Normalizing Denormalizing is an issue and you only know what is faster if you know how you access it.

Ali_M · July 15, 2008, 12:00am

It’s always amazed me how we spend hours normalizing databases just to spend more hours creating non-normalized views to run reports on.

I’ve always tried to normalize as much as was sensible rather than slavishly following the supposed good practice of BC Normal Form.

Excellent thought-provoking article.

offler · July 15, 2008, 12:00am

by the way: why do you want to create a list of users and adresses? Wanna sell?

AlastairS · July 15, 2008, 12:00am

I’m not sure I agree with the total denormalisation in the example, Jeff. I’m (relatively) recently out of University, so I’ve still got the normalisation-is-everything rules in my mind Being pragmatic, though, and as a couple of other people have noted, denormalisation is an optimisation to a fully-normalised design. I’d argue that denormalisation should be done in stages, in just the same way that normalisation is.

For example, I don’t think I’ve ever taken a production database past 3NF. Certainly the only time I’ve take a design to 5NF is for my University database assignment. And aren’t stored procedures supposed to improve performance for this sort of queries. Yes, you’re always going to take a performance hit for a query utilising six joins (there are some crazily-sized cartesian products going on in there), but wouldn’t a multi-layered approach aid this? I.e., for common or computation-intensive operations (such as viewing a user’s profile in the example above), using multiple stored procedures to extract sections of data and then another stored procedure to put them together. It could also be accomplished using the user_id primary key to extract the relevant record from each table and then combining the records at the application level.

Having said all that, I haven’t done any database programming in a good couple of years now, so I could be blowing a lot of hot air

Lewisham · July 15, 2008, 12:00am

Databases are one of those things that you just can’t take a cavalier approach to. If you screw that design up, and worse still, launch it, you’re in for a world of hurt when the problems with the data come.

Normalisation means the only problem you’re going to get is scalability: your data is going to be good (presuming you were also clever/paranoid enough to get all the integrity checks in there too).

As the quote says, you can always denormalise later for speed.

Harry_M · July 15, 2008, 12:00am

If you create indexed views over the joins, then the optimizer should use those anyway and there won’t be a perfomance hit from being denormalized.

Lewisham · July 15, 2008, 12:00am

Addendum: And as others have said, there are much better solutions than denormalising your database. Having Materialized Views for reads, or memcached.

james5 · July 15, 2008, 12:00am

Quite a horrible job naming the tables. Why the underscores in field names and not in table names. Why user_id and not user_sex? Why table nimes in singular? Of course this is beside the point, but lol…

nub on July 15, 2008 02:33 AM

because when you’re selecting multiple id fields you want to be able to distinguish between them, example: user_id and affiliation_id

instead of
select user.id user_id, affiliation.id affiliation_id from …
you can just have
select user_id, affiliation_id from …
taht way you don’t always have to prefix the fields with the table name and you also you don’t have to alias the fields everytime;

and table names are singular because each row in the table designates one specific person; one row = one user

Ariel · July 15, 2008, 12:00am

That 6 join query is wrong, and the rest of your article follows from that incorrect join.

Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that.

If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?

You’re simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn’t! Test it - you’ll see I’m right. But, do me a favor and write real joins - not ones that give you tremendous numbers of duplicated rows.

And if you don’t even understand why your query is wrong, you have no business designing databases.

A valid reason to denormalize is to precalculate data, which you touch on very briefly. But you always write that twice: once normalized, and that’s the primary data, and then again, the cached/precalculated version. You should always be able to regenerate that from the normalized data.

OK, one final reason to denormalize which you didn’t even write: if you need to do a where clause from one table, but the order by from a different table, you need to denormalize because you can not create a combined index from both tables. (Databases with function indexes might be able to but that’s pretty complicated.)

I’m sorry to bash you so much, but you shouldn’t write about what you have no experience with.

Glenn · July 15, 2008, 12:00am

The approach we’ve taken at work is to have data normalised (well all the new tables, we have some truly hideous legacy tables that we haven’t tidied up yet!) and then either use memory cache or search tables to speed up access.

In fact our search table is one honking great big denormalised table. There can’t be any joins at all. This is generated by an off line task constantly.

I’m always wary about denormalising the ‘cannonical’ tables as this leads you into a maintenance nightmare, especially when it comes to adding or removing columns.

Colm_Larkin · July 15, 2008, 12:00am

(You’re automatically measuring all the queries that flow through your software, right?)

Got an article about this? Would be a good topic

Krischan · July 15, 2008, 12:00am

Two keywords (already mentioned) that I want to underline:

views
views aka materialized views
star scheme for your billions of tuples (that your memory won’t hold)
(a href=http://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Star_schema/a)

Indexing and caching aside.

Niyaz_PK · July 15, 2008, 12:00am

Can somebody tell me how ‘views’ can increase the performance of a very large system?

Ariel · July 15, 2008, 12:00am

and table names are singular because each row in the table designates one specific person; one row = one user

james: are you serious? Tables don’t have 1 row in them they have many. So you make them plural (1 row singular, multiple rows plural). Rule: table names are plural, column names are singular.

Also, you do not want to prefix every single column with the name of the table, what does that save you? Just type tablename.columnname instead of tablename_columnname - no need for aliasing. Sometimes you violate that rule where it might be confusing, but for that most part don’t double prefix stuff.

Sander · July 15, 2008, 12:00am

Your technique of describing a denormalized database model is one of the basic principals of dimensional modelling primarily used for data warehousing.

It’s pretty much the opposite of normalizing your model (it contains redundant data, utilizes lots of space etc). Especially if end users need to use this data to query for e.g. reporting purposes then dimensional modelling is absolutely the way to go.

However, if you’re designing a normalized database, I would suggest to keep it normalized and don’t denormalize at all unless you’re designing a dimensional database structure from the beginning. Keep it separated and don’t mix the two modelling techniques even though it might be more efficient in some cases. Making things unnecessary complex is not something you want to do very often.

Asd · July 15, 2008, 12:00am

DB nuts come out in forces. Views as a performance improvement LOL.

Even on a very small DB those 6 joins will cost you. Oracle uses huge amounts of CPU for joins, whether or not you have lots of data. With any kind of load at all it will cost you.

Just forget about using a relational DB and use Berkeley.

MrioM · July 15, 2008, 12:00am

I’m starting to study object-relational database, and I wonder if it wouldn’t be a nicer solution for you, since is wipes away the necessity to write long complicated joins.

My studies haven’t yet gone too far on that topic, but it seems ORDBs are easier to deal with than relational ones.

Which software did you use do create those diagrams?