Handling Duplication In SQL

In this article we will learn how to handle the duplication.

Duplicate data is a very big and serious problem for developers as well as the system. It occupies extra space and decreases the response of the query. To overcome this problem we have two methodologies. The first is to select distinct data from the table but it is a temporary solution, not a permanent solution. This approach is not efficient and also creates a headache for developers to select always distinct data.

Example



Figure 1: Table 1

  1. Select  
  2. distinct *  
  3. from  
  4. Employee_Detail  
The output will be a distinct set of data as in the following:


Figure 2: Table 2

But as I explained previously, that is not a good approach. So now we will try the second methodology in which we delete (remove) all duplicate data from the table. This approach is best because it removes duplicate data permanently from the table. We aren't required to always select the distinct data from the table.

There are various methods for removing duplicate data that should be used depending upon the data of the table and our requirements. Let us understand some important methods of removing the data from the table.

Method 1: Use of Temp Table

In this method we use a temp table. In this method the following procedure will be used. 
  1. Select the distinct data from the original table and insert it into the temp table.
  2. Delete data from the original table
  3. Insert the distinct data from the temp table into the original table and remove the temp table

Example



Figure 3: Table 3

Query

  1. /* Insert Data into temp Table */  
  2. Select  
  3. Distinct * into #Temp_tab  
  4. from  
  5. Employee_Detail  
  6. /* Delete Data from Table */  
  7. Delete from  
  8. Employee_Detail  
  9. /* Insert data from Temp table into Employee_Detail Table */  
  10. insert into Employee_Detail  
  11. select  
  12. *  
  13. from  
  14. #Temp_tab  
  15. /* Drop Temp Table */  
  16. Drop  
  17. Table #Temp_tab  
  18. select  
  19. *  
  20. from  
  21. Employee_Detail  
Output



Figure 4: output

There are some problems with this method. It requires a temp table and insertion and deletion of data many timews and takes more time and the complexity is very high.

Method 2

This method is useful when a table contains at least one unique field. Let us see an example for a better understanding of this approach.


Figure 5: Table 4

Syntax
  1. delete T1  
  2. from  
  3. MyTable T1,  
  4. MyTable T2  
  5. where  
  6. T1.dupField = T2.dupField  
  7. and T1.uniqueField > T2.uniqueField  
Query
  1. Delete Table1  
  2. from  
  3. Employee_Detail Table1,  
  4. Employee_Detail Table2  
  5. Where  
  6. Table1.Emp_name = Table2.Emp_Name  
  7. and Table1.Emp_City = Table2.Emp_City  
  8. and Table1.Emp_Age = Table2.Emp_Age  
  9. and Table1.Emp_Salary = Table2.Emp_Salary  
  10. and Table1.Emp_Id > Table2.Emp_Id  
Output



Figure 6: output 

Method 3: Using Identity Column

This method is useful when a table doesn't contain a unique field. In this method the following procedure will be used:
  1. Create a identity column in table.
  2. Use method 2 for deletion
  3. Remove identity column from table


Example



Figure 7: Table 5

Query

This query is actually done in two parts. In other words this query is a set of two querires.
In the first query we create an Identity Column in the table.
  1. /* Create a Identity Column */
  2. Alter table
  3. Employee_Detail
  4. Add
  5. Emp_Id int identity(1, 1)

Output

 
After successfully execution of this query we execute the second query as in the following:
  1. /* Delete Data */  
  2. Delete Table1  
  3. from  
  4. Employee_Detail Table1,  
  5. Employee_Detail Table2  
  6. Where  
  7. Table1.Emp_name = Table2.Emp_Name  
  8. and Table1.Emp_City = Table2.Emp_City  
  9. and Table1.Emp_Age = Table2.Emp_Age  
  10. and Table1.Emp_Salary = Table2.Emp_Salary  
  11. and Table1.Emp_Id > Table2.Emp_Id  
  12. /* Remove Identity Column */  
  13. Alter table  
  14. Employee_Detail  
  15. Drop  
  16. column Emp_Id  
  17. Select  
  18. *  
  19. from  
  20. Employee_Detail  
Output



Figure 8: output 

Method 4: Using ROW_NUMBER()

In this method we use a Row_Number. We generate a row number with respect to each set of duplicate values then one the behalf of these row numbers we do a deletion operation. We use a Common Table Expression (CTE) in this query.



Figure 9: Table 6

Query

1. First generate a Row_Number with respect to each set of the same values.
  1. WITH TempEmp (  
  2. Emp_Name, Emp_Age, Emp_Salary, Emp_City,  
  3. Dupli_Col  
  4. AS (  
  5. SELECT  
  6. *,  
  7. ROW_NUMBER() OVER(  
  8. PARTITION by Emp_Name,  
  9. Emp_Age,  
  10. Emp_Salary,  
  11. Emp_City  
  12. ORDER BY  
  13. Emp_Name  
  14. AS Dupli_Col  
  15. FROM  
  16. Employee_Detail  
  17. )  
  18. Select  
  19. *  
  20. FROM  
  21. TempEmp  


 Figure 10: Table 7

Now the table will look as in the following.

2. Now do a delete operation on the table.
  1. Delete from  
  2. TempEmp  
  3. where  
  4. Dupli_Col > 1  
  5. select  
  6. *  
  7. from  
  8. Employee_Detail 
 
 

The complete query of this method is:

WITH TempEmp ( Emp_Name, Emp_Age, Emp_Salary, Emp_City, Dupli_Col )
AS
( SELECT *, ROW_NUMBER() OVER( PARTITION by Emp_Name, Emp_Age, Emp_Salary, Emp_City ORDER BY Emp_Name ) AS Dupli_Col
FROM
Employee_Detail
)

Delete from TempEmp
where
Dupli_Col > 1

select * from Employee_Detail 

Output


Figure 11: Table 8

Method 5: Using Sub query

In this method we use ranking. In this method we add an Identity column then we find a rank for each row. If the rank for a row is greater than one, in other words it is a duplicate row, then we will delete this row from the table.

Example



Figure 12: Table 9

Query

This query is also divided into two parts like method 3. In the first query we create an Identity Column in the table.
  1. /* Create A Indetity Column */  
  2. Alter Table  
  3. Employee_Detail  
  4. ADD  
  5. Del int identity(1, 1) 


            
 
After successfully execution of this query we execute second query that is following .
 
  1. /* Delete Data from table */  
  2. delete from  
  3. Employee_Detail  
  4. where  
  5. Del in (  
  6. select  
  7. Del  
  8. from  
  9. (  
  10. select  
  11. *,  
  12. RANK() OVER(  
  13. PARTITION by Emp_Name,  
  14. Emp_Age,  
  15. Emp_Salary,  
  16. Emp_City  
  17. ORDER BY  
  18. Del Desc  
  19. ) rank  
  20. From  
  21. Employee_Detail  
  22. ) Tbl  
  23. where  
  24. rank > 1  
  25. )  
  26. /* Remove Identity Column */  
  27. alter table  
  28. Employee_Detail  
  29. drop  
  30. column Del  
Output


Figure 13: Output

So we can see that there are many methods for handling duplicate data in SQL. Which method to use depends upon conditions and requirements.