What does First Normal Form actually mean?
Despite being a fundamental concept in relational databases, First Normal Form (or 1NF) is often explained confusingly or downright incorrect. This post explains what 1NF really means and is useful for and debunk a number of the misunderstandings.
E. F. Codd, the inventor of the relational model, explained 1NF as: "Eliminate domains which have relations as elements". In more common database terminology, this means "Eliminate columns which can have tables as values".
Since SQL does not allow creating or using nested tables, most relational databases will be in first normal form by necessity.
Codd's 1971 paper Further normalization of data base relational model has this helpful figure:
UNNORMALIZED FORM | | eliminate domains which | have relations as elements 🡓 FIRST NORMAL FORM | | eliminate non-full dependence | of non-prime attributes on | candidate keys 🡓 SECOND NORMAL FORM | | eliminate transitive dependence | of non-prime attributes on | candidate keys 🡓 THIRD NORMAL FORM
(I have tried to recreate the ASCII-art in HTML!)
A note about terminology: The fundamental premise of the relational model is to view data as relations. A relation more or less correspond to a table (a common misunderstanding is that relation refer to the relationships between tables), an attribute correspond to a column, and a domain – the critical concept in 1NF – correspond to the data type of a column.
A relation has certain constraints compared to more vaguely defined notion of table:
- Rows must be unique. While some databases allow duplicate rows, this is not in accordance with the relational model. In practice this is ensured by defining a primary key for a table.
- Rows cannot have any intrinsic ordering. If the order of a set of rows is significant, the order have to be expressed as values (e.g., an "order" column)
- A relation has a fixed number of attributes (columns) and each column have a domain which defines what values are allowed.
So even in the unnormalized form, data must conform to certain constraints to be considered relational. But there are no constraints on domains. Since domains can contain anything, they can also contain relations. Which means an attribute can have a relation as value, and that relation can have an attribute which have a relation as value and so on.
But Codd realized the data model and query language can be significantly simplified if nested relations are eliminated. He suggests a process of normalization where nested relations are converted to stand-alone relations which references the parent row through foreign keys rather than through containment.
To a modern reader, the case of nested relations might seem a rather obscure special case. Why even spend so much attention on this? But in the context Codd was writing it was a significant question.
The dominant database model in the age before the relational model was the hierarchical database model. In the hierarchical model a record can contain sets of child records – called repeating groups or table-valued attributes.
If a hierarchical database is directly converted into a relational model, these nested record groups would directly translate to relation-values attributes. But a nested structure would severely undermine the benefits of the relational model. Therefore, Codd spends some effort justifying the "normal form" and explaining the process of normalization.
The benefits to eliminating nested relations, according to Codd:
- Relations in 1NF can be presented, stored, and exchanged in the form of two-dimensional arrays. Nested relations would require more complex data structures.
- A data language would require a complex path navigation syntax to unambiguously locate a data item in nested relations. In 1NF, all values can be identified just by relation name, attribute name and primary key.
- Nested relations couple queries tightly to the structure of relationships.
The effect on the data language is quite significant. Notably Codd introduces 1NF before he introduces relational algebra, which is the theoretical foundation for data languages like SQL. And this is the reason SQL does not support nested relations!
Which is why we don't have to worry about 1NF in modern relational databases.
All the wrong explanations
The curious phenomenon is that 1NF is explained misleadingly in so many high-profile places. As an experiment I googled for "first normal form" and found a bunch of explanations and tutorials. All the results on the first two pages of results had incorrect explanations - including the first hit on Wikipedia. [Edit: The explanation on Wikipedia has now been improved.]
The misunderstandings are probably because the original papers are not easily accessible on the internet and use terminology which is not easy to understand today. This seems to have turned 1NF explanations a game of telephone where misunderstandings replicate and evolve.
The misunderstandings fall in these major groups:
- 1NF is about eliminating strings with comma-separated values.
- 1NF is about eliminating multiple columns with same type.
- 1NF is about eliminating rows with similar values.
- 1NF is about adding primary keys.
A major source of confusion is Codds use of the word "atomic" - once, in a particular context. Unfortunately, a lot of authors have latched onto this word and ignored the context.
So far, we have discussed examples of relations which are defined on simple domains - domains whose elements are atomic (nondecomposable) values. Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements. These relations may, in turn, be defined on nonsimple domains, and so on.
I'm sure if Codd had know the confusion this passage has caused, he might have used clearer language. Nevertheless, the meaning is pretty clear: Nonatomic values are relations, atomic values are anything which is not a relation, and therefore are atomic with respect to the relational operators.
Comma separated strings
Many explanations (including Wikipedia) use the example of a comma-separated string with phone numbers as an example of a 1NF violation:
The telephone number column contains multiple phone numbers in a single value. For example, the first row has two telephone numbers separated by a comma. The column values are not atomic: it can be subdivided into two numbers. This violates first normal form.
|Customer ID||First Name||Surname||Telephone Number|
|456||San||Zhang||(555) 403-1659 Ext. 53; 182-929-2929|
Make no mistake, encoding multiple values in a single string is generally bad design. But it has nothing to do with first normal form. First normal form means a column should not allow relations as values. A comma-separated string is still just a single string from the perspective of the database type system.
Each individual field should hold the smallest data element possible to facilitate easy sorting and searching. For instance, the date column can be separated into day, month and year.
But a date is not a relation (you can't add new "rows" to a date!). Codds example of normalization include dates which are treated as atomic.
Another similar misunderstanding from O'Reilly:
Data within a column is broken down into the smallest meaningful element. A Name column would be divided into three columns as First Name, Middle Name, and Last Name.
Sure, it might be useful to break names into components if you e.g., need to sort by last name. (It also opens a can of worms with regards to international names.) But this is a question of business requirements, not 1NF. As long as the Name type is not a relation it is fine.
In Codds own example of normalization he even has a "name" as a single column.
What is repeating groups?
Another a source of confusion is that Codd in one place talks about repeating groups. As mentioned before, repeating groups is a term from hierarchical databases which describes a set of records nested under a parent record. Codd defines this as equivalent to a non-simple domain, i.e. a relation-valued attribute.
Unfortunately this is sometimes misunderstood as referring to multiple columns with same type.
The site EssentialSql thinks "repeated group" is about columns with similar names:
a table should not have repeating groups of columns such as Customer1Name, Customer2Name, and Customer3Name.
Such a table would probably be an example bad design or at least an example of bad naming of columns. Especially if the number of customers could change! But there is nothing wrong with multiple columns having the same type per se, as long as they have distinct names. For example a table of marriages certificates might have columns Party1 and Party2 which would be fine.
In any case, this is unrelated to the question of 1NF because each individual columns have a simple domain.
IBM, the birthplace of the relational model, has published this explanation:
Repeating attributes, often called a repeating group, are different attributes that are inherently the same. In an entity that satisfies the requirement of first normal form, each attribute is independent and unique in its meaning and its name.
Example: Assume that an entity contains the following attributes:
This situation violates the requirement of first normal form, because JANUARY_SALARY_AMOUNT, FEBRUARY_SALARY_AMOUNT, and MARCH_SALARY_AMOUNT are essentially the same attribute, EMPLOYEE_ MONTHLY_SALARY_AMOUNT.
This definition hinge on the very vague notion of "essentially the same". I don't think the above explanation make any logical sense - January salary and February salary are obviously different things.
Repeating values across rows
A different misinterpretation refers to repeating values across rows.
The dubious but very popular Tutorialspoint have this inaccurate explanation:
The next step is ensuring that there are no repeating groups of data. Consider we have the following table (...)
|100||Sachin||36||Lower West Side||Cannon XL-200|
|100||Sachin||36||Lower West Side||Battery XL-200|
|100||Sachin||36||Lower West Side||Tripod Large|
But as per the 1NF, we need to ensure that there are no repeating groups of data. So, let us break the above table into two parts and then join them using a key ...
The above table is obviously invalid because the primary key is not unique, but "repeating groups of data" is not in itself a 1NF violation. 1NF is not concerned about values only about types. (The above table would probably be a violation of third normal form though.)
Weird normalization process
Codd shows a straightforward process to eliminating nested relations: You convert nested relations into a top-level relation and then add a foreign key referring to the "parent row" where it was extracted from. (This of course requires all tables to have a primary key - having a primary key is a necessary precondition for nomalization)
But many tutorials show a much weirder process. A typical example is StudyTonight. They start with this table:
An then they "normalize" into this table:
This is weird because you would expect roll_no to be the primary key, but apparently it isn't since we now have duplicates. Had we instead followed Codd's process and just extracted the set to a separate table, we would have two tables:
Which would be much more logical, doesn't break the primary key, and is fully normalized.
Some explanations define 1NF as having a primary key. This is a rather benign misunderstanding since it will not cause anyone to make bad design decisions. But technically it is incorrect. Having a primary key is a prerequisite for normalizing to first normal form.
Quite a lot of tutorials confuse 1NF with the definition of relations.
Don't believe everything you read on the internet. (This blog excluded of course.)
Codd, E.F (1970) A Relational Model of Data for Large Shared Data Banks.
Codd, E. F. (1971) Further Normalization of the Relational Model.