Normalization is not a process

Database normalization is often explained as a step-by step process to improve a database design. This is a very unfortunate misunderstanding which causes a lot of confusion for students. Instead, the normal forms should be thought of as a checklist which can be used to analyze a database design for potential problems.

Presenting normalization as a step-by-step process lead to absurd results. Like this example from Microsoft documentation where a table is normalized into first normal form by deliberately introducing redundancies and duplicating the primary key. The only reason to introduce the redundancies in the first place is so they can be eliminated again in the next step of the normalization. Any reasonable student would conclude that normalization is a waste of time!

In the real world, database design typically starts with a conceptual model on a whiteboard or with an entity-relationship diagram. The process would then go something like this:

  1. An entity with its attributes corresponds to a table
  2. Select a primary key for all entities
  3. Identify one-to-many relationships and add a foreign key on the “many”-side pointing to the “one” side
  4. Identify many-to-many relationships and create a junction table with foreign keys to both sides

This process will typically end up with a normalized table design.

Lets say you want a database over books and authors, with genre, publishing year and author nationality. You would immediately realize that you have two entities, author and book with a many-to-many relationship (because an authors can write multiple books and a book can have multiple authors). It is also obvious that genre and publishing year are attributes of the book, but author nationality is an attribute of the author.

But in normalization tutorials, they typically will have you start by putting all data attributes into one single table, regardless of the entity relationships. This requires introducing repeated data, even sometimes (as in the Microsoft example) duplicating primary keys. Then this crazy design is gradually normalize this into more and more tables, until you end up with a reasonable design.

But all the steps except the last represent table designs which no reasonable person would have come up with in the first place.

This does not mean such bad designs cannot occur in a real-world database. We all make mistakes, and especially when a data model incrementally grows and changes over time, such errors could be introduced by mistake. Therefore, a database professional should be able to recognize such problems and mitigate them.

So where did it begin?

Hopefully, I made my case that normalization should not be understood as a process. But for the historically inclined, it may be interesting to know how this idea came to be in the first place.

Normalization-as-process makes sense in a specific scenario: When converting a hierarchical database model into a relational model. Hierarchical databases were popular at the time when E.F.Codd invented the relational model, so it was necessary for him to show how hierarchical models could be converted to relational model without information loss.

Here is a hierarchical model over books and authors:

The Count of Monte Cristo, Fiction, 1844
    Alexandre Dumas, French
The Three Musketeers, Fiction, 1844
    Alexandre Dumas, French
The C Programming Language, Computer languages, 1978
    Brian W. Kernighan, Canadian
    Dennis M. Richie, American

Note that in a hierarchical model, we cannot express many-to-many relationships between records. We must select one of the directions as the parent-child structure and then repeat information for child records which occur under multiple parents. In this case the author nationality is repeated for each book by the author. (If we selected authors as the top level instead, it would be the attributes of the book which was repeated for each author.)

The normalization process suggested by Codd would extract the nested records into separate relations, and add the primary key of the parent record as a foreign key to the child record. Assuming the author’s name is the key, the primary key of the new table would be title + author name, and the title would do doubly duty as foreign key to the book table.

Book

Title (PK) Genre Year
The Count of Monte Cristo Fiction 1844
The Three Musketeers Fiction 1844
The C Programming Language Computer languages 1978

Book Author

Title (PK, FK) Author (PK) Author Nationality
The Count of Monte Cristo Alexandre Dumas French
The Three Musketeers Alexandre Dumas French
The C Programming Language Brian W. Kernighan American
The C Programming Language Dennis M. Richie American

This process lead to tables in 1NF, however the redundancy from the hierarchical model is still there – the author nationality is still repeated once per book. In other words, we have a 2NF violation! So, we perform the next step of normalization which leaves us with a table for books, a table for authors and a junction table defining the many-to-many relationship between authors

Book

Title (PK) Genre Year
The Count of Monte Cristo Fiction 1844
The Three Musketeers Fiction 1844
The C Programming Language Computer languages 1978

Book Author

Title (PK, FK) Author (PK, FK)
The Count of Monte Cristo Alexandre Dumas
The Three Musketeers Alexandre Dumas
The C Programming Language Brian W. Kernighan
The C Programming Language Dennis M. Richie

Author

Author (PK) Author Nationality
Alexandre Dumas French
Brian W. Kernighan American
Dennis M. Richie American

Which is exactly the many-to-many model we would have designed in the first place, if we did not have to conform to the limitations of a hierarchical model.

So, what went wrong? Relational databases became the standard and hierarchical databases became niche products. The next generations of writers and teachers were not familiar with the context Codd was writing in. But they kept teaching the normal forms as a step-by-step process. But instead of using the obsolete hierarchical form as a starting point, they instead deliberately introduce errors in a relational model to motivate the step-wise normalization.