Efficient SQL Query to Remove Duplicates with ROW_NUMBER

Using ROW_NUMBER() and PARTITION BY (Preferred Approach)

The ROW_NUMBER() function assigns a unique row number to each record within a partition (group). We can use this to identify and delete duplicates while keeping only the required data.

Query Syntax

WITH CTE AS (
    SELECT 
        *, 
        ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS row_num
    FROM table_name
    WHERE condition  -- Apply filtering condition here
) 
DELETE FROM CTE 
WHERE row_num > 1;

Example

Consider a Customer table with duplicate entries based on Email.

ID Name Email City
1 John [email protected] NY
2 Jane [email protected] LA
3 John [email protected] NY
4 Sam [email protected] TX

Removing Duplicates While Keeping the First Entry.

;WITH CTE AS (
    SELECT ID 
    FROM (
        SELECT ID, 
               ROW_NUMBER() OVER (PARTITION BY NAME, Email, City ORDER BY ID) AS RN  
        FROM Customers 
        WHERE City = 'NY'  -- Only NY state filtering condition
    ) AS sub
    WHERE RN > 1
)
DELETE FROM Customers 
WHERE ID IN (SELECT ID FROM CTE);

Explanation of the Query

  • Identifies Duplicates
    • ROW_NUMBER() OVER (PARTITION BY Name, Email, City ORDER BY ID)
    • Assign a row number (RN) for each duplicate group, keeping the first record (RN = 1).
  • Filters Out Duplicates (RN > 1): Only marks duplicate records where City = 'NY'.
  • Deletes Duplicate Records: Deletes only IDs from the CTE that have RN > 1
  • This syntax will be useful when we are joining more tables and deleting duplicates from one specific table.

Please comment on your ideas, and follow for more.