Database Normalization: Avoiding Redundancy in Your Data

Sandhiya Priya
1d
268
0
0

Article

Introduction to Database Normalization

Database normalization is the process of organizing data in a relational database to reduce redundancy and improve data integrity. The primary goal of normalization is to ensure that data is stored efficiently while maintaining its consistency across the database. It involves dividing large, unnormalized tables into smaller, more manageable tables and defining relationships between them.

In this article, we’ll walk through the process of database normalization, converting an unnormalized database to 3rd Normal Form (3NF), and discuss the pros and cons of normalization.

What is Database Normalization?

Normalization is the process of removing redundancy and undesirable characteristics like:

Data redundancy (storing the same data in multiple places),
Insertion anomalies (difficulty in inserting data due to redundant information),
Update anomalies (inconsistent data due to redundant updates),
Deletion anomalies (inconsistent data after deleting a record).

The process involves several stages, or "normal forms," with each subsequent form building on the previous one:

1NF (First Normal Form): Ensures that each column contains only atomic (indivisible) values, and each record is unique.
2NF (Second Normal Form): Achieved when a table is in 1NF and all non-key attributes are fully functionally dependent on the primary key.
3NF (Third Normal Form): Achieved when a table is in 2NF and all attributes are functionally dependent only on the primary key (eliminating transitive dependencies).

Converting an Unnormalized Database to 3NF: Practical Example

Let’s start by considering an unnormalized database and walk through the normalization steps up to 3NF.

Step 1: Unnormalized Table

Imagine we have an unnormalized table storing information about students, their courses, and professors.

Student_ID	Student_Name	Course_Name	Professor_Name	Professor_Phone	Professor_Email
1	John Doe	Math 101	Dr. Smith	123-456-7890	[email protected]
2	Jane Smith	History 101	Dr. Johnson	234-567-8901	[email protected]
1	John Doe	Science 101	Dr. Clark	345-678-9012	[email protected]
3	Alice Brown	Math 101	Dr. Smith	123-456-7890	[email protected]

This table contains several issues:

Redundant Data: The professor’s details (name, phone number, email) are repeated for each student enrolled in the course.
Inconsistent Data: If Professor Smith changes their phone number or email, we'd have to update multiple rows.

Step 2: First Normal Form (1NF)

To bring this table into 1NF, we ensure that each field contains only atomic (indivisible) values. In our case, the data is already atomic, so we don't need to make any changes. However, we must ensure that there are no repeating groups.

The table is already in 1NF since each cell contains only a single value, but we need to eliminate redundancy by separating repeating groups.

Student_ID	Student_Name	Course_Name	Professor_Name	Professor_Phone	Professor_Email
1	John Doe	Math 101	Dr. Smith	123-456-7890	[email protected]
2	Jane Smith	History 101	Dr. Johnson	234-567-8901	[email protected]
1	John Doe	Science 101	Dr. Clark	345-678-9012	[email protected]
3	Alice Brown	Math 101	Dr. Smith	123-456-7890	[email protected]

Step 3: Second Normal Form (2NF)

To achieve 2NF, we need to ensure that:

The table is in 1NF.
All non-key attributes are fully functionally dependent on the primary key.

The current table violates 2NF because the non-key attributes Professor_Name, Professor_Phone, and Professor_Email depend on the course, not the combination of Student_ID and Course_Name (the composite primary key).

We need to split this into two tables:

Students_Courses Table: This will store the relationship between students and courses.
Professors Table: This will store the professors' details.

Students_Courses Table (2NF)

Student_ID	Student_Name	Course_Name	Professor_ID
1	John Doe	Math 101	101
2	Jane Smith	History 101	102
1	John Doe	Science 101	103
3	Alice Brown	Math 101	101

Professors Table (2NF)

Professor_ID	Professor_Name	Professor_Phone	Professor_Email
101	Dr. Smith	123-456-7890	[email protected]
102	Dr. Johnson	234-567-8901	[email protected]
103	Dr. Clark	345-678-9012	[email protected]

Now, we have eliminated partial dependencies. The Professor_ID in the Students_Courses table is only related to the course, and Professor_Name, Professor_Phone, and Professor_Email are stored once in the Professors table.

Step 4: Third Normal Form (3NF)

To achieve 3NF, we need to:

The table must be in 2NF.
Remove transitive dependencies (i.e., non-key attributes depending on other non-key attributes).

In the Students_Courses Table, all non-key attributes are already dependent only on the primary key (Student_ID, Course_Name). The table is in 3NF.

However, we can improve the design further by removing the Professor_Name, Professor_Phone, and Professor_Email attributes from the Professors table if the data is too volatile and often changes. We might use an additional lookup table for Professors in case there are more details.

Summary of Tables after 3NF

Students Table

Student_ID	Student_Name
1	John Doe
2	Jane Smith
3	Alice Brown

Courses Table

Course_Name	Professor_ID
Math 101	101
History 101	102
Science 101	103

Professors Table

Professor_ID	Professor_Name	Professor_Phone	Professor_Email
101	Dr. Smith	123-456-7890	[email protected]
102	Dr. Johnson	234-567-8901	[email protected]
103	Dr. Clark	345-678-9012	[email protected]

Students_Courses Table

Student_ID	Course_Name
1	Math 101
2	History 101
1	Science 101
3	Math 101

Pros and Cons of Normalization

Pros of Normalization

Reduces Data Redundancy: By splitting data into smaller tables, normalization reduces the repetition of data. This makes the database more space-efficient.
Improved Data Integrity: It reduces the chance of data anomalies such as update, insert, and delete anomalies.
Simplified Database Structure: The schema is more modular, which makes it easier to maintain and extend.
Prevents Inconsistencies: Changes in one part of the database (e.g., professor details) automatically reflect everywhere it’s used, reducing the likelihood of inconsistencies.

Cons of Normalization

Complex Queries: With normalized databases, you may need to join multiple tables to retrieve the required data. This can make queries more complex and potentially slower.
Performance Issues: If a normalized database requires many joins to fetch data, this can lead to slower query performance, especially for large datasets.
Overhead of Multiple Tables: Having many tables may increase the complexity of the database schema and can make understanding the database more difficult for developers and administrators.

Conclusion

Database normalization is an essential concept for ensuring data integrity and reducing redundancy. By dividing large tables into smaller, related tables, we not only make the database more efficient but also ensure that data remains consistent and reliable.

In our example, we walked through the process of normalizing an unnormalized table to 3NF, resulting in more efficient and manageable data storage.

While normalization offers significant advantages in terms of data integrity and storage efficiency, it comes with trade-offs in query complexity and performance. As with most design decisions, it’s essential to balance the benefits of normalization with the specific requirements of your application and database.