Why do we need normalization in DBMS

De-normalization in the database

introduction

We saw how to design a database and how to have normalized tables in the database. The main purpose of clear design and normalization of tables is to reduce redundancy and have consistent data in the database. So we have multiple tables in the database and each one is mapped to one another by referential integrity. When we need related data, we join the related tables and get the records. This works well and quickly when the database is small and has relatively few records.

In the real world, however, the database is very large and contains many records. If we normalize the table, we may not see the record amount. We would just focus on having a perfect database with less redundancy. But what happens when data grows in the database? How do you handle the smallest request that normalization has to pull the data from multiple tables? The cost of the query increases dramatically. This is because multiple tables must be joined to get the data. However, these tables are no smaller. They will contain huge amounts of data, and every tiny query on the table will have to traverse the table until it gets the record (although that depends on the method of file organization).

In our STUDENT database, for example, we have divided the address into a separate table such as door number, street, city, state, zip code. Imagine if we need to display a student's address on a report, we always have to connect to that address table. Our query pulls the student's details from the STUDENT table. As soon as it has received the data records, its address is obtained from the ADDRESS table. This is where the two transactions take place during retrieval, and therefore disk I / O. Therefore, the performance cost is higher.

Instead, imagine what will happen when we have the address in the STUDENT table itself. The above second transaction is not all required. The first retrieval itself gives the student's details as well as his address. This saves the second call time and speeds up the call. But what happens here with redundancy and 3NF?

As a rule of thumb for database design, we should have normalized data so that there is no redundancy. However, this normalization increases the performance cost. When comparing the good design and the performance of the database, performance comes first. Anyone who accesses the database is more interested in quick and accurate results than in designing a database. So if we introduce redundancy into the table and the query performance increases, we can ignore 3NF.

This process is known as denormalization. In this process, a normalized table is reconnected to have the redundancy in the table to improve the performance of the query. The denormalization does not have to apply in all cases. It all depends on the dates. Therefore, this task is performed after the data is designed and pasted into the tables. In addition, it also depends on the redundant column we are adding back to the table and the frequency of this column update. The basic criteria for denormalization would be

  • This should reduce the frequency of joins between the tables and thus speed up the query. If one of the two or more tables is joined frequently to query the data and that join cost is higher, we can combine them into one table. However, after combining the table, the correct data should still be preserved. There should be no unwanted / unnecessary duplicate records. In our example above, after denormalizing STUDENT and ADDRESS, all students should have the correct address. It shouldn't lead to a wrong address for the students.
  • In most of the cases where there are joins for tables, a full table scan is performed to retrieve the data. So when the tables are huge, we can think of denormalization.
  • The column should not be updated more frequently. If the columns are updated frequently, the update cost increases, although the retrieval cost decreases. If it is updated less frequently, the database can bear the cost of the update. Otherwise the database always hangs. In our case above, the address is updated less frequently (the frequency with which a student changes homes is comparatively less). Also, the column should be very small to be reconnected to the table. Huge pillars are back over the table and the performance costs.
  • The developer should have a very good knowledge of data when denormalizing it. He should be very familiar with the above factors, the frequency of joins / hits, updates, column and table sizes, and so on.

Denormalization doesn't just recombine the columns to create redundant data. Denormalization can be any technique that can improve the performance of the normalized table.

There are only a few of the denormalization methods discussed below.

  1. Add redundant columns
  2. Add derived columns
  3. Unfold the tables
  4. Snapshots
  5. VARRAYS
  6. Materialized views

Add redundant columns

This method only adds the redundant column that is often used in the joins to the main table. The other table remains unchanged.

For example, consider the EMPLOYEE and DEPT tables. Suppose we need to create a report showing the employee details and their department name. Here we have to join EMPLOYEE with DEPT to get the department name.

SELECT e.EMP_ID, e.EMP_NAME, e.ADDRESS, d.DEPT_NAME FROM EMPLOYEE e, DEPT d WHERE e.DEPT_ID = d.DEPT_ID;

However, joining the huge EMPLOYEE and DEPT table will affect the performance of the query. However, we cannot merge DEPT with EMPLOYEE. At the same time, we need a separate DEPT table with many other details, apart from the ID and name. In this case, we can add the redundant DEPT_NAME column to EMPLOYEE to avoid connecting to DEPT and thus increase performance.


SELECT e.EMP_ID, e.EMP_NAME, e.ADDRESS, e.DEPT_NAME FROM EMPLOYEE e;
Now you no longer need to log into DEPT to get the department name and details. However, there is a redundancy of the data in DEPT_NAME.

Add derived columns

Suppose we have a STUDENT table with student details such as ID, name, address and course. Another table MARKS with its internal notes in different subjects. It is necessary to prepare a report for individual students in which we must give their details, overall grades and grades. In this case we need to query the STUDENT table and then join the MARKS table to calculate the total number of grades in different subjects. Based on the total, we also need to set the grade in the selection query. Then it has to be printed on the report.

SELECT std.STD_ID, std.NAME, std.ADDRESS, t.TOTAL, CASE WHEN t.TOTAL> = 80 THEN ‘A’ WHEN t.TOTAL> = 60 AND t.TOTAL
The above query is run on each of the student records to calculate the total number and grade. Imagine how many students there will be and how often this query will pull the data and do calculations. Instead, what if we have saved the total and grade in the STUDENT table itself? This reduces the connection time and the calculation time. Once all the grades have been put into the MARKS table we can calculate the total and grade for each student and update the STUDENT table for those columns (we can use the MARKS trigger to update the STUDENT table as soon as the Notes are inserted). Now when we need to generate the report, just issue a SELECT query on the STUDENT table and print it out on the report.
SELECT std.STD_ID, std.NAME, std.ADDRESS, std.TOTAL, std.GRADE FROM STUDENT std;
This made the query easy and faster.

Unfold the tables

We have already discussed this method in the examples above. This method combines commonly used tables into one table to reduce the links between the tables. This increases the performance of the polling query. Merging the redundant column into a table can cause the redundancy in the table. However, it is ignored unless it affects the meaning of other records in the table.

For example, after denormalizing STUDENT and ADDRESS, all students should have the correct address. It shouldn't lead to a wrong address for the students.

In addition to shrinking the tables, we can duplicate the table or even split it as it improves the query performance. However, duplication and splitting are not denormalization methods.

Snapshots

This is one of the earliest ways to create data redundancy. With this method, the database tables are duplicated and stored on different database servers. They are updated at specific times to ensure consistency between the database server tables. This method allowed users in different locations to access the servers that were in their vicinity so that the data could be obtained quickly. In this case, they do not need to access the tables that are on remote servers. This helps with faster access.

VARRAYS

This method creates tables as VARRAY tables that store repeating groups of columns in a single table. This VARRAY method overwrites the state of 1NF. According to 1NF, each column value should be atomic. However, this method allows the same data to be stored in different columns for each record.

Consider the example of STUDENT and MARKS. For example, suppose the MARKS table has 3 subjects for each student. After applying 1NF, the MARKS table has the following structure.

If we need to see the grades of a particular student here, the MARKS table needs to be accessed three times. However, if we use VARRAY, the table will be changed as follows.

Now we can access all of a student's grades in a single traverse. This reduces the time it takes to get each student's grades.

Materialized views

Materialized views are similar to tables in which all columns and derived values ​​are precomputed and kept. So if a query is used with the same query in the materialized view, the query will be replaced with that materialized view. Since this view contains all the columns as a result of the link and the precalculated value, the values ​​do not need to be recalculated. This reduces the time required for the query.

Consider the same total and grade calculation example above.

SELECT std.STD_ID, std.NAME, std.ADDRESS, t.TOTAL, CASE WHEN t.TOTAL> = 80 THEN ‘A’ WHEN t.TOTAL> = 60 AND t.TOTAL
What if we create a materialized view for the query above? Yes, it will benefit a lot. There is no need to update the STUDENT table with total and grade every time we insert the marks. Once all the markups are placed, creating a materialized view saves all of the data necessary for the report. So when we need to generate the report, we need to query this materialized view just as we query the STUDENT table.

The only problem with the materialized view is that, like all other views, it doesn't update when the table data changes. We need to explicitly update them to get the correct data in the materialized view.

Benefits of de-normalization

  • Minimizes the table links
  • It reduces the number of foreign keys and indexes. This helps save memory usage and less time spent manipulating data.
  • When aggregation columns are used to denormalize, these calculations are performed at the time of data manipulation, not at retrieval time. that is, if we used 'Overall Grades' as the denormalized column, the total will be calculated and updated when other related column entries - for example student details and their grades - are inserted. So when we query the STUDENT table for its details and grades, we don't need to calculate its total. This saves the retrieval time.
  • It reduces the number of tables in the database. As the number of tables increases, the allocation increases; joins elevations; Increased storage space and so on.

Disadvantages of de-normalization

  • Although it supports faster retrieval, it makes data manipulation slower. If the column is updated frequently, the update speed will decrease.
  • If the requirement changes, we need to reanalyze the data and tables to understand the performance. Therefore, denormalization is specific to the requirement or application that a user is using.
  • The complexity of the coding and the number table depends on the requirement / application. It can enlarge or reduce the tables. There is a chance that the code could become more complex due to the redundancy in the table. Therefore, a thorough analysis of the requirements, queries, data, etc. is required.
Catalog DBMS tutorial