5 Quick Ways To Delete Duplicate Records From Database Which Every Developer Must Know

Yogi S
1y
74.2k
0
7

Article

When you are working with Database-Driven applications, you should definitely know how to remove duplicated records from your database table. Duplicate records not only take a massive amount of space but also make the database query execution process slower. Therefore, every developer should know how to delete duplicate records.

I have decided to write this tutorial that tells 5 quick ways to delete duplicate records from a Database table. I hope you will enjoy reading and learning from it.

The Database Table which contains the duplicate records

I have a table called ‘Student’ that contains some duplicate records which need to be deleted. This table looks like:

delete duplicate records from SQL Server

It has 4 columns which are,

Id – a primary column of type ‘int’ and Identity[1,1]
Name – varchar(50) and NOT NULL
Age – int and NOT NULL
Standard – int and NOT NULL

You can create this table by running the following script in your SQL Server Database.

SET ANSI_NULLS ON    
GO    
    
SET QUOTED_IDENTIFIER ON    
GO    
    
SET ANSI_PADDING ON    
GO    
    
CREATE TABLE [dbo].[Student](    
    [Id] [int] IDENTITY(1,1) NOT NULL,    
    [Name] [varchar](50) NOT NULL,    
    [Age] [int] NOT NULL,    
    [Standard] [int] NOT NULL,    
 CONSTRAINT [PK_Student] PRIMARY KEY CLUSTERED     
(    
    [Id] ASC    
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]    
) ON [PRIMARY]    
    
GO    
    
SET ANSI_PADDING OFF    
GO

Now I add some duplicate records to this table by running the following insert scripts.

SET IDENTITY_INSERT [dbo].[Student] ON     
    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (1, N'John', 12, 5)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (2, N'Ram', 11, 4)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (3, N'Ram', 11, 4)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (4, N'Anna', 8, 3)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (5, N'Maya', 9, 4)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (6, N'Mary', 10, 5)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (7, N'Mary', 10, 5)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (8, N'Mary', 10, 6)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (9, N'Robin', 11, 5)    
GO    
INSERT [dbo].[Student] ([Id], [Name], [Age], [Standard]) VALUES (10, N'Donald', 12, 7)    
GO    
SET IDENTITY_INSERT [dbo].[Student] OFF    
GO

Now open the table's data and see the duplicate records as shown by the below image:

The duplicate records are the students with the below names:

Ram with id 2 and 3.
Mary with id 6, 7

Note that the student with name Mary and Id 5 is not a duplicate of other Mary records (id 6 and 7) because her standard is different than other Mary records.

My Database table is ready, now let us start with 5 ways to delete duplicate records from this table:

Way 1: Using Group By

SQL Group By statement groups the results by one or more columns and is used by with aggregate functions (COUNT, MAX, MIN, SUM, AVG).

So I create an SQL Query using Group By and Min statements which will return ‘Non-Duplicate’ records. This query is:

SELECT MIN(Id) AS 'Id', Name, Age, Standard     
FROM Student     
GROUP BY Name, Age, Standard

Non-Duplicate records Query SQL Server

You can see in the above image that the students with id 3 and 7 are missing since they are the duplicate records.

Now, I can use the Delete statement in my query to delete these records. The query is.

DELETE FROM STUDENT WHERE     
Id NOT IN(SELECT MIN(Id) AS 'Id' FROM Student GROUP BY Name, Age, Standard)

After running this query, the table duplicate records get deleted as shown in the below image.

Way 2: Using Distinct Statement

If you want to use Distinct Statement which returns Non-Duplicate rows then you will have to create a new table with an exact definition like your previous table. Then insert all ‘Non-Duplicate’ rows from the previous table to the new table.

I created a new table called Student1 and added all the distinct records to it from my previous table called ‘Student’ using the below query:

Select  Distinct * into Student1 from Student;

Now I can drop my ‘Student’ table and rename ‘Student1’ to ‘Student’. The below 2 queries do this work for me:

DROP Table Student  
EXEC sp_rename 'Student1', 'Student'

Way 3: Using Programming Language

You can also delete the duplicate records with the help of some programming language like C#, JavaScript, PHP, etc.

Let me take the help of Entity Framework Core which is an OR/M(Object-Relational Mapping) framework written in C# language. Entity Framework Core is widely used in ASP.NET Core database programming.

In Entity Framework Core, the code to delete the duplicate records is given below:

using (var context = new MyContext())    
{    
    var students = context.Student.GroupBy(s => new { s.Name, s.Age, s.Standard }).SelectMany(grp => grp.Skip(1)); ;    
    context.Student.RemoveRange(students);    
    context.SaveChanges();    
}

I have used GroupBy() function of Entity Framework Core to find out all the duplicate records based on 3 columns - 'Name, Age, Standard'.

Next I use the RemoveRange() method to delete these duplicate records.

Way4: Using Common Table Expressions (CTE)

Common table expression (CTE) is a temporary named result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. I use CTE to with ‘PARTITION BY’ statement to delete duplicate records like this:

WITH myTemp    
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Name, Age, Standard ORDER BY Id) RN FROM Student)    
DELETE FROM myTemp WHERE RN > 1;

Notice the Row_Number() function which provides consecutive numbering of the rows in the result by the order selected in the OVER clause. It will assign the value 1 for the first row and increase the number for the subsequent rows.

I have put the 3 columns of the Student’s table in the Partition By statement. When you specify a column or set of columns with Partition By statement then it will divide the result set into record partitions and then finally ranking functions are applied to each record partition separately and the rank will restart from 1 for each record partition separately.

Way5: Using Self Join

Self Join which is an Inner Join on the same table can also be used to remove duplicate records. Here you create the self-join by making use of 2 things which are given below,

In the ‘on’ condition you check the duplicate columns.
In the ‘where’ condition you compare the ‘primary key’ columns of the tables.

So, the below Self Join query will give me the duplicate columns of the ‘Student’ table:

Select * From Student s Inner Join Student s1     
On s.Name=s1.Name And s.Age=s1.Age And s.Standard=s1.Standard     
Where s.Id>s1.Id

So all I have to do is just put a Delete statement in order to delete the duplicate records.

Delete Student where Id in(    
Select s.Id From Student s Inner Join Student s1     
On s.Name=s1.Name And s.Age=s1.Age And s.Standard=s1.Standard     
Where s.Id>s1.Id)

Conclusion

In this tutorial, you learned 5 quick ways to delete your duplicate records from a database table. You can choose any deletion method based on your liking. Don’t forget to like this tutorial and share it with your friends.

Do read my other related database tutorial on C# Corner which is How To Restore Northwind Database In SQL Server In One Minute