Thursday, July 26, 2007

Deleting duplicate rows in mysql

I am posting this as a public service to myself on how to delete duplicate rows from a mysql table.
In my case I have a unique column id on each row. And I have an indexed column, call it match, that I use to determine duplication.
In short so I can find it next time what I do is

delete t2 from table1 as t1, table2 as t2 where t1.match = t2.match and t2.id > t1.id;


This works and takes about 2-4 minutes on a table with about 280K rows and 35K duplicates.

I've only tested this with isam tables, I know this can be done other ways and can be gleaned from the mysql documentation, but this way, when I can remember it, seems simplest.

4 comments:

tofindit said...

Hi, Google search brought me here. Could you please change to table as t1, table as t2---we have only one table here. Also, this keeps the oldest of the duplicate data. If we want to keep the newest row, the condition can be changed to t2.id < t1.id (hope I'm not missing something)

Thanks for sharing!

Mark Gullett said...

Good stuff. It avoids the use of temporary tables that many people like to use.

Sugam said...

good post but does not work for me. I have 56k rows with 300 duplicates.
When i run this query, it keeps on processing for 70-80 mins and i finally have to shutdown my server.

Any idea why this is happening?

Phil said...

Would be nice if you would actually explain your code for us non-experts.