Handling Duplication In SQL

Article

Duplicate data is a very big and serious problem for developers as well as the system. It occupies extra space and decreases the response of the query. To overcome this problem we have two methodologies. The first is to select distinct data from the table but it is a temporary solution, not a permanent solution. This approach is not efficient and also creates a headache for developers to select always distinct data.

Example

Figure 1: Table 1

Select
distinct *
from
Employee_Detail

The output will be a distinct set of data as in the following:

Figure 2: Table 2

But as I explained previously, that is not a good approach. So now we will try the second methodology in which we delete (remove) all duplicate data from the table. This approach is best because it removes duplicate data permanently from the table. We aren't required to always select the distinct data from the table.

There are various methods for removing duplicate data that should be used depending upon the data of the table and our requirements. Let us understand some important methods of removing the data from the table.

Method 1: Use of Temp Table

In this method we use a temp table. In this method the following procedure will be used.

Select the distinct data from the original table and insert it into the temp table.
Delete data from the original table
Insert the distinct data from the temp table into the original table and remove the temp table

Example

Figure 3: Table 3

Query

/* Insert Data into temp Table */
Select
Distinct * into #Temp_tab
from
Employee_Detail
/* Delete Data from Table */
Delete from
Employee_Detail
/* Insert data from Temp table into Employee_Detail Table */
insert into Employee_Detail
select
*
from
#Temp_tab
/* Drop Temp Table */
Drop
Table #Temp_tab
select
*
from
Employee_Detail

Output

Figure 4: output

There are some problems with this method. It requires a temp table and insertion and deletion of data many timews and takes more time and the complexity is very high.

Method 2

This method is useful when a table contains at least one unique field. Let us see an example for a better understanding of this approach.

Figure 5: Table 4

Syntax

delete T1
from
MyTable T1,
MyTable T2
where
T1.dupField = T2.dupField
and T1.uniqueField > T2.uniqueField

Query

Delete Table1
from
Employee_Detail Table1,
Employee_Detail Table2
Where
Table1.Emp_name = Table2.Emp_Name
and Table1.Emp_City = Table2.Emp_City
and Table1.Emp_Age = Table2.Emp_Age
and Table1.Emp_Salary = Table2.Emp_Salary
and Table1.Emp_Id > Table2.Emp_Id

Output

Figure 6: output

Method 3: Using Identity Column

This method is useful when a table doesn't contain a unique field. In this method the following procedure will be used:

Create a identity column in table.
Use method 2 for deletion
Remove identity column from table

Example

Figure 7: Table 5

Query

This query is actually done in two parts. In other words this query is a set of two querires.

In the first query we create an Identity Column in the table.

/* Create a Identity Column */
Alter table
Employee_Detail
Add
Emp_Id int identity(1, 1)

Output

After successfully execution of this query we execute the second query as in the following:

/* Delete Data */
Delete Table1
from
Employee_Detail Table1,
Employee_Detail Table2
Where
Table1.Emp_name = Table2.Emp_Name
and Table1.Emp_City = Table2.Emp_City
and Table1.Emp_Age = Table2.Emp_Age
and Table1.Emp_Salary = Table2.Emp_Salary
and Table1.Emp_Id > Table2.Emp_Id
/* Remove Identity Column */
Alter table
Employee_Detail
Drop
column Emp_Id
Select
*
from
Employee_Detail

Output

Figure 8: output

Method 4: Using ROW_NUMBER()

In this method we use a Row_Number. We generate a row number with respect to each set of duplicate values then one the behalf of these row numbers we do a deletion operation. We use a Common Table Expression (CTE) in this query.

Figure 9: Table 6

Query

1. First generate a Row_Number with respect to each set of the same values.

WITH TempEmp (
Emp_Name, Emp_Age, Emp_Salary, Emp_City,
Dupli_Col
) AS (
SELECT
*,
ROW_NUMBER() OVER(
PARTITION by Emp_Name,
Emp_Age,
Emp_Salary,
Emp_City
ORDER BY
Emp_Name
) AS Dupli_Col
FROM
Employee_Detail
)
Select
*
FROM
TempEmp

Figure 10: Table 7

Now the table will look as in the following.

2. Now do a delete operation on the table.

Delete from
TempEmp
where
Dupli_Col > 1
select
*
from
Employee_Detail

The complete query of this method is:

WITH TempEmp ( Emp_Name, Emp_Age, Emp_Salary, Emp_City, Dupli_Col )
AS
( SELECT *, ROW_NUMBER() OVER( PARTITION by Emp_Name, Emp_Age, Emp_Salary, Emp_City ORDER BY Emp_Name ) AS Dupli_Col
FROM
Employee_Detail
)

Delete from TempEmp
where
Dupli_Col > 1

select * from Employee_Detail

Output

Figure 11: Table 8

Method 5: Using Sub query

In this method we use ranking. In this method we add an Identity column then we find a rank for each row. If the rank for a row is greater than one, in other words it is a duplicate row, then we will delete this row from the table.

Example

Figure 12: Table 9

Query

This query is also divided into two parts like method 3. In the first query we create an Identity Column in the table.

/* Create A Indetity Column */
Alter Table
Employee_Detail
ADD
Del int identity(1, 1)

After successfully execution of this query we execute second query that is following .

/* Delete Data from table */
delete from
Employee_Detail
where
Del in (
select
Del
from
(
select
*,
RANK() OVER(
PARTITION by Emp_Name,
Emp_Age,
Emp_Salary,
Emp_City
ORDER BY
Del Desc
) rank
From
Employee_Detail
) Tbl
where
rank > 1
)
/* Remove Identity Column */
alter table
Employee_Detail
drop
column Del

Output

Figure 13: Output

So we can see that there are many methods for handling duplicate data in SQL. Which method to use depends upon conditions and requirements.