Finding Duplicates Among Multiple Columns in SQL

Here I show how to find duplicates and their frequency among multiple columns using the GROUP BY clause.

Suppose you have data in which you need to find the duplicates in a single column (for example to find common names in a list of names) or in multiple columns (for example to find all the persons who have same name and age but reside at a different address). In these situations, we could do that easily by applying "GROUP BY" and "HAVING" clauses. Let us see how to use these SQL clauses.

SQL1.jpg

Finding Duplicates in a single column

Let us consider we need to find the street numbers that are repeated in the preceding table. From the preceding table we can see the street numbers: 604 and 538 are the repeated ones.

SELECT STREET_NUM
FROM ADDRESS_TABLE
GROUP BY STREET_NUM
HAVING COUNT(*) > 1

The output is:

SQL2.jpg

Finding Duplicates combination spanning in multiple columns


Let us consider we need to find the addresses that are identical except only by their City. For this, we select the collection of such fields for which we seek the duplicates. In this case, we form a collection having all the address fields except City because we want identical addresses that differ only by their city field. So the collection will be:

"STREET_NUM,PREFIX,STREET_NAME,SUFFIX,STREET_TYPE,STATE,POSTAL_CODE,COUNTRY"

SELECT

      STREET_NUM,

      PREFIX,

      STREET_NAME,

      SUFFIX,

      STREET_TYPE,

      STATE,

POSTAL_CODE,

COUNTRY

 

FROM ADDRESS_TABLE

GROUP BY

      STREET_NUM,

      PREFIX,

      STREET_NAME,

      SUFFIX,

      STREET_TYPE,

      STATE,

POSTAL_CODE,

COUNTRY

HAVING COUNT(*) > 1

The above query will give output as:

SQL3.jpg

Note

The fields in SELECT clause must be matched with fields in GROUP BY clause otherwise we will get an error like "Column 'ADDRESS_TABLE.CITY' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY
clause." for the following SQL query:

SELECT STREET_NUM, PREFIX,STREET_NAME,SUFFIX,STREET_TYPE,CITY,STATE, POSTAL_CODE,COUNTRY

FROM ADDRESS_TABLE

GROUP BY STREET_NUM, PREFIX,STREET_NAME,SUFFIX,STREET_TYPE,STATE, POSTAL_CODE,COUNTRY

HAVING COUNT(*) > 1
 

Finding Duplicates combination spanning in multiple columns and their frequency

Let us consider we need to find the addresses that are identical except only by their City and as well as their frequency. In the following query, we just add COUNT(*) that gives the count of the group of columns that we put in GROUP BY clause.
 

SELECT COUNT(*)AS DUPLICATES_COUNT,

      STREET_NUM,

      PREFIX,

      STREET_NAME,

      SUFFIX,

      STREET_TYPE,

      STATE,

POSTAL_CODE,

COUNTRY

FROM ADDRESS_TABLE

GROUP BY

      STREET_NUM,

      PREFIX,

      STREET_NAME,

      SUFFIX,

      STREET_TYPE,

      STATE,

POSTAL_CODE,

COUNTRY

HAVING COUNT(*) > 1


The above query will give output as:

SQL4.jpg

So, we summarize the following points:

  • We can find the occurrence of one/many fields in the data if we group them by GROUP BY clause and qualify them by HAVING clause further.

  • The fields in SELECT clause must be matched with fields in GROUP BY clause.