Normalization is not a process
Database normalization is often explained as a step-by step process to improve a database design. This is a very unfortunate misunderstanding which causes a lot of confusion for students. Instead, the normal forms should be thought of as a checklist which can be used to analyze a database design for potential problems.
Presenting normalization as a step-by-step process lead to absurd results. Like this example from Microsoft documentation where a table is normalized into first normal form by deliberately introducing redundancies and duplicating the primary key. The only reason to introduce the redundancies in the first place is so they can be eliminated again in the next step of the normalization. Any reasonable student would conclude that normalization is a waste of time!
In the real world, database design typically starts with a conceptual model on a whiteboard or with an entity-relationship diagram. The process would then go something like this:
- An entity with its attributes corresponds to a table
- Select a primary key for all entities
- Identify one-to-many relationships and add a foreign key on the “many”-side pointing to the “one” side
- Identify many-to-many relationships and create a junction table with foreign keys to both sides
This process will typically end up with a normalized table design.
Lets say you want a database over books and authors, with genre, publishing year and author nationality. You would immediately realize that you have two entities, author and book with a many-to-many relationship (because an authors can write multiple books and a book can have multiple authors). It is also obvious that genre and publishing year are attributes of the book, but author nationality is an attribute of the author.
But in normalization tutorials, they typically will have you start by putting all data attributes into one single table, regardless of the entity relationships. This requires introducing repeated data, even sometimes (as in the Microsoft example) duplicating primary keys. Then this crazy design is gradually normalize this into more and more tables, until you end up with a reasonable design.
But all the steps except the last represent table designs which no reasonable person would have come up with in the first place.
This does not mean such bad designs cannot occur in a real-world database. We all make mistakes, and especially when a data model incrementally grows and changes over time, such errors could be introduced by mistake. Therefore, a database professional should be able to recognize such problems and mitigate them.
So where did it begin?
Hopefully, I made my case that normalization should not be understood as a process. But for the historically inclined, it may be interesting to know how this idea came to be in the first place.
Normalization-as-process makes sense in a specific scenario: When converting a hierarchical database model into a relational model. Hierarchical databases were popular at the time when E.F.Codd invented the relational model, so it was necessary for him to show how hierarchical models could be converted to relational model without information loss.
Here is a hierarchical model over books and authors:
The Count of Monte Cristo, Fiction, 1844
Alexandre Dumas, French
The Three Musketeers, Fiction, 1844
Alexandre Dumas, French
The C Programming Language, Computer languages, 1978
Brian W. Kernighan, Canadian
Dennis M. Richie, American
Note that in a hierarchical model, we cannot express many-to-many relationships between records. We must select one of the directions as the parent-child structure and then repeat information for child records which occur under multiple parents. In this case the author nationality is repeated for each book by the author. (If we selected authors as the top level instead, it would be the attributes of the book which was repeated for each author.)
The normalization process suggested by Codd would extract the nested records into separate relations, and add the primary key of the parent record as a foreign key to the child record. Assuming the author’s name is the key, the primary key of the new table would be title + author name, and the title would do doubly duty as foreign key to the book table.
Book
Title (PK) | Genre | Year |
---|---|---|
The Count of Monte Cristo | Fiction | 1844 |
The Three Musketeers | Fiction | 1844 |
The C Programming Language | Computer languages | 1978 |
Book Author
Title (PK, FK) | Author (PK) | Author Nationality |
---|---|---|
The Count of Monte Cristo | Alexandre Dumas | French |
The Three Musketeers | Alexandre Dumas | French |
The C Programming Language | Brian W. Kernighan | American |
The C Programming Language | Dennis M. Richie | American |
This process lead to tables in 1NF, however the redundancy from the hierarchical model is still there – the author nationality is still repeated once per book. In other words, we have a 2NF violation! So, we perform the next step of normalization which leaves us with a table for books, a table for authors and a junction table defining the many-to-many relationship between authors
Book
Title (PK) | Genre | Year |
---|---|---|
The Count of Monte Cristo | Fiction | 1844 |
The Three Musketeers | Fiction | 1844 |
The C Programming Language | Computer languages | 1978 |
Book Author
Title (PK, FK) | Author (PK, FK) |
---|---|
The Count of Monte Cristo | Alexandre Dumas |
The Three Musketeers | Alexandre Dumas |
The C Programming Language | Brian W. Kernighan |
The C Programming Language | Dennis M. Richie |
Author
Author (PK) | Author Nationality |
---|---|
Alexandre Dumas | French |
Brian W. Kernighan | American |
Dennis M. Richie | American |
Which is exactly the many-to-many model we would have designed in the first place, if we did not have to conform to the limitations of a hierarchical model.
So, what went wrong? Relational databases became the standard and hierarchical databases became niche products. The next generations of writers and teachers were not familiar with the context Codd was writing in. But they kept teaching the normal forms as a step-by-step process. But instead of using the obsolete hierarchical form as a starting point, they instead deliberately introduce errors in a relational model to motivate the step-wise normalization.