How to remove duplicate rows from a given table, when there is a unique key based on more than one column.And you have to keep the latest record based on the modified date/ create date

2.1K    Asked by JacobRutherford in SQL Server , Asked on Nov 5, 2019
Answered by Rachit Gupta

Solution 1: Using Row_number and CTE:

If there is a unique combination based on 3-4 columns so we should use a row_number function with the partition by clause and find the number for each row. Post that, we can use this statement in CTE (common table expression). Post that we can delete the data from CTE.


CREATE TABLE [dbo].[Employ](

             [EmployID] [int] NULL,

             [EmployName] [varchar](30) NULL,

             [EmploySalary] [int] NULL,

             [Created_date] [datetime] NULL,

             [modified_date] [datetime] NULL

) ON [PRIMARY]

GO


ALTER TABLE [dbo].[Employ] ADD DEFAULT (getdate()) FOR [Created_date]

GO


insert into Employ

select '1','emp9','2090',getdate()-10,getdate()-1 union

select '2','emp10','2140',getdate()-10,getdate()-7 union

select '2','emp10','2140',getdate()-10,getdate()-3 union

select '3','emp11','2190',getdate()-10,null union

select '3','emp11','2190',getdate()-10,getdate()-1 union

select '4','emp12','2240',getdate()-10,null union

select '5','emp13','2090',getdate()-10,getdate()-7 union

select '6','emp14','2140',getdate()-10,getdate()-3 union

select '6','emp14','2140',getdate()-10,getdate()-1 union

select '7','emp15','2190',getdate()-10,null union

select '7','emp15','2190',getdate()-10,getdate()-3 union

select '7','emp15','2190',getdate()-10,getdate()-1 union

select '8','emp1','2020',getdate()-10,null union

select '9','emp2','2070',getdate()-10,null union

select '10','emp3','2120',getdate()-10,null union


select '10','emp3','2120',getdate()-10,getdate()-1 union

select '11','emp4','2170',getdate()-10,null union

select '11','emp4','2170',getdate()-10,getdate()-1 union

select '12','emp5','2220',getdate()-10,getdate()-1 union

select '13','emp6','2270',getdate()-10,getdate()-1 union

select '14','emp7','2320',getdate()-10,getdate()-1 union

select '15','emp8','2370',getdate()-10,getdate()-1


---find the duplicate rows from the CTE, and find the duplicate based on first modified and then create date

with emp_dup(EmployID,EmployName,R)

as (

select EmployID,EmployName,ROW_NUMBER() over( partition by EmployID, EmployName order by isnull (modified_date,created_date) desc )r

from Employ)

delete from emp_dup where r>1


CREATE TABLE [dbo].[Employ_DISTINCT](

             [EmployID] [int] NULL,

             [EmployName] [varchar](30) NULL,

             [EmploySalary] [int] NULL,

             [Created_date] [datetime] NULL,

             [modified_date] [datetime] NULL

) ON [PRIMARY]

GO

ALTER TABLE [dbo].[Employ_DISTINCT] ADD DEFAULT (getdate()) FOR [Created_date]

GO


Solution 2: inserting unique records in a new table (this solution can be applied for small tables only and having large amount of data(more than 30%) as duplicate as in this case the delete will take more time than insert)

If there is a unique combination based on 3-4 columns so we should use a row_number function with the partition by clause and find the number then. And insert the rows with row_number =1 in the other new table. Post that the old table can be dropped and new table can be rename to the original table name


Insert into [dbo].[Employ_DISTINCT]

select EmployID, EmployName, EmploySalary, modified_date, created_date from

(select b.EmployID,b.EmployName, b.EmploySalary, b.modified_date,b.created_date,

ROW_NUMBER() over( partition by b.EmployID, b.EmployName order by isnull (modified_date,created_date) desc) R

from Employ b

)a on a.EmployID=b.EmployID and a.EmployName=b.EmployName)HH where R=1


Your Answer

Interviews

Parent Categories