Introduction to database normalization: the first three normal forms

The goal of a relational database normalization is to achieve and improve data integrity and avoid data redundancy so to avoid possible insertion, updation or deletion anomalies. A relational database is normalized by applying a series of rules called normal forms. In this article we will discuss the first three normal forms.

In this tutorial you will learn:

What is the first normal form
What is the second normal form
What is the third normal form

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions
Category	Requirements, Conventions or Software Version Used
System	Distribution independent
Software	No specific software needed
Other	None
Conventions	# – requires given linux-commands to be executed with root privileges either directly as a root user or by use of `sudo` command $ – requires given linux-commands to be executed as a regular non-privileged user

The first normal form

Suppose we have the following table we use to store information about some movies:

+----+--------------------+--------------------+------+
| id | name               | genre              | year |
+----+--------------------+--------------------+------+
|  1 | The Exorcist       | Horror             | 1973 |
|  2 | The Usual Suspects | Thriller, Neo-noir | 1995 |
|  3 | Star Wars          | Space-opera        | 1977 |
+----+--------------------+--------------------+------+

The table above, doesn’t satisfy the first normal form, why? For the first normal form to be satisfied, each column of a table must contain atomic (indivisible) data. In the second row of our table, which contains information about the “The Usual Suspects” movie, we can see that the genre column contains data which is not atomic. Two genres are actually listed: Thriller and Neo-noir. Let’s say in our representation we want to allow one movie to be associated with more than one genre; how do we solve the problem?

The first thing that comes to mind may be to add a new row in the same table, repeating the information about the movie, and just specify one genre per raw. This idea is quite horrible, since we would have a lot of redundant data (we should repeat the same movie information each time we want to associate it with a new genre!).

Another slightly better solution, would be to add a new column, so to have, for example, a genre1 and genre2 columns. This however would, among the other things, represent a limit: what if a movie should be listed under more than two genres?

A smarter way to solve this problem is to create a new table used to store genres information. Here is the “genre” table:

+----+-------------+
| id | name        |
+----+-------------+
|  1 | Horror      |
|  2 | Neo-noir    |
|  3 | Space-opera |
|  4 | Thriller    |
+----+-------------+

Now, since the one between genre and movie is a many to many relationship (a movie can be related to several genres, and a genre can be related to many different movies), to express it without data redundancy, we can use a so
called junction table:

+----------+----------+
| movie_id | genre_id |
+----------+----------+
|        1 |        1 |
|        2 |        2 |
|        2 |        4 |
|        3 |        3 |
+----------+----------+

Our junction table has the only task to express the many-to-many relationship between the two tables or entities movie and genre. It is composed by only two columns: movie_id and genre_id. The movie_id column has a foreign key constraint to the id column of the movie table, and the genre_id has a foreign key constraint to the id column of the genre table. The two columns together are used as a composite primary key, so the relationship between a movie and a genre can be expressed only once. At this point, we can remove the “genre” column from the “movie” table:

+----+--------------------+------+
| id | name               | year |
+----+--------------------+------+
|  1 | The Exorcist       | 1973 |
|  2 | The Usual Suspects | 1995 |
|  3 | Star Wars          | 1977 |
+----+--------------------+------+

The table is now in first normal form.

The second normal form

The first normal form is a prerequisite for the second: for the second normal form to be satisfied, the data must already be in first normal form and there should not be any partial dependency of secondary attributes from a subset of any candidate key.

What is a partial dependency? Let’s start by saying that in a table there could be more than one candidate key. A candidate key is one column, or a set of columns that together can be identified as unique in a table: only one of the
candidate keys, will than be chosen as the table primary key, which uniquely identifies each row.

The attributes that are part of candidate keys are defined as prime, while all the others are called secondary. For a relation to be in second normal form, there should not be any secondary attribute which is dependent on a subset
of a candidate key.

Let’s see an example. Suppose we have a table we use to store data about soccer players and their scores for each gameday for a fantasy football application, something like this:

+-----------+------------+-----------+------------+---------+-------+
| player_id | first_name | last_name | role       | gameday | score |
+-----------+------------+-----------+------------+---------+-------+
|       111 | Cordaz     | Alex      | Goalkeeper |      18 |  6.50 |
|       117 | Donnarumma | Gianluigi | Goalkeeper |      18 |  7.50 |
|       124 | Handanovic | Samir     | Goalkeeper |      18 |  7.50 |
+-----------+------------+-----------+------------+---------+-------+

Let’s take a look at this table. First of all we can see that it satisfies the first normal form, since the data in each column is atomic. The data contained in the player_id column could be used to uniquely identify a player, but
can it be used as primary key for the table? The answer is no, because a row for each player will exist for every gameday! Here we could use a composite primary key instead, made by the combination of the player_id and gameday columns, since one and only one entry can exist for that player for each gameday.

Does this table satisfies the second normal form? The answer is no, let’s see why. We previously said that each attribute that is not part of any candidate keys is called secondary and for the table to satisfy the second normal
form it must not be dependent on a subset of any candidate key, but it must depend on the candidate key as a whole.

Let’s take the role attribute, for example. It is a secondary attribute, since it is not part of any candidate key. We can say that it is functionally dependent on player_id, since if the player changes, also the associate role potentially can change; however, it is not dependent on gameday, which is the other component of the composite primary key, since even if the gameday changes the role of the player remains the same. We can say that role is functionally dependent on a subset of the composite primary key, therefore the second normal form is not satisfied.

To solve the problem we can create a separate table used to exclusively describe each player:

+-----------+------------+-----------+------------+
| player_id | first_name | last_name | role       |
+-----------+------------+-----------+------------+
|       111 | Cordaz     | Alex      | Goalkeeper |
|       117 | Donnarumma | Gianluigi | Goalkeeper |
|       124 | Handanovic | Samir     | Goalkeeper |
+-----------+------------+-----------+------------+

We can now remove those information from the score table, and make it look this way:

+-----------+---------+-------+
| player_id | gameday | score |
+-----------+---------+-------+
|       111 |      18 |  6.50 |
|       117 |      18 |  7.50 |
|       124 |      18 |  7.50 |
+-----------+---------+-------+

The second normal form is now satisfied.

The third normal form

The second normal form is a pre-requisite for the third normal form. To be in third normal form, a table must already be in second normal form, and must not contain attributes that are transitively dependent on the table primary key. What does it mean? We can say we have a transitive dependency when a secondary attribute doesn’t depend directly on the table primary key, but it has a dependency on another secondary attribute. Suppose we add two new columns to the player table above, so it does look like this:

+-----------+------------+-----------+------------+---------+-----------+
| player_id | first_name | last_name | role       | club    | club_city |
+-----------+------------+-----------+------------+---------+-----------+
|       111 | Cordaz     | Alex      | Goalkeeper | Crotone | Crotone   |
|       117 | Donnarumma | Gianluigi | Goalkeeper | Milan   | Milano    |
|       124 | Handanovic | Samir     | Goalkeeper | Inter   | Milano    |
+-----------+------------+-----------+------------+---------+-----------+

We added the club and club_city columns to the table to specify, respectively, the club associated with a player, and the city that club belongs to. Unfortunately the table now doesn’t satisfy the third normal form, why? It is quite simple: the club_city attribute doesn’t directly depend on player_id, which is the table primary key, but it has a transitive dependency on it, via another secondary attribute: club.

How to solve the problem so that the third normal form is satisfied? All we have to do is to create another table, where to record information about each club. Here is the “club” table:

+-----------+-----------+
| club_name | club_city |
+-----------+-----------+
| Crotone   | Crotone   |
| Milan     | Milano    |
| Inter     | Milano    |
+-----------+-----------+

We isolated club information in a dedicated table. As a primary key for the table, in this case, we used the club_name column. In the player table we can now remove club_city column, and add a foreign key constraint to the club column so that it references the club_name column in the club table:

+-----------+------------+-----------+------------+---------+
| player_id | first_name | last_name | role       | club    |
+-----------+------------+-----------+------------+---------+
|       111 | Cordaz     | Alex      | Goalkeeper | Crotone |
|       117 | Donnarumma | Gianluigi | Goalkeeper | Milan   |
|       124 | Handanovic | Samir     | Goalkeeper | Inter   |
+-----------+------------+-----------+------------+---------+

The third normal form is now satisfied.

Conclusions

In this tutorial we talked about the first three normal forms of a relational database and how they are used to reduce data redundancy and avoid insertion, deletion and updation anomalies. We saw what are the prerequisites of each normal form, some examples of their violations, and how to fix them. Other normal forms exist past the third, however, in the most common applications, reaching the third normal form is enough to achieve an optimal setup.

Software requirements and conventions used

The first normal form

The second normal form

The third normal form

Conclusions

Related Linux Tutorials: