Hi all,

I'm looking into the management of duplicate records in MySQL, but this case is a little bit more specific. A record is only duplicated if column A and column B have the same value. So a record is only a duplicate if two specific columns have the same values in more than one record.

Table (excerpt):

| id | userid | courseid | 
|  1 |     23 |        3 |
|  2 |     23 |        3 |
|  3 |     28 |        5 |
|  4 |     28 |        5 |
|  5 |     23 |        5 |
|  6 |     23 |        3 |
|  7 |     29 |        3 |

Duplicate rows above that would be deleted by the query = id's 2, 4 and 6 (as I would like to keep one of the duplicate records, in this case, the record with the lowest id value.

Figuring out a solution:

After playing around and doing a little bit of research, this seemed to work a treat:

ALTER IGNORE TABLE table ADD UNIQUE INDEX tmpindex (userid,courseid);
ALTER TABLE table DROP INDEX tmpindex;

My only concern is that this probably isn't the fastest or safest method of achieving the same result, this query could be executed dozens of times an hour or thousands of times by a rouge user (I'll add a control/restriction eventually). Does anyone have any feedback or suggestions? I would truly appreciate a bit more wisdom.

Another thought, would it be more sensible to check for an existing record before inserting duplicates in the first place? There could be 100's of inserts from one submission so a foreach to check if a record exists before inserting it just wouldn't be sensible imo - but I'm no DBA guru...

Thanks in advance :)

Michael

Member Avatar for diafol

I'd suggest that you make a compound PK out of the two columns and not have an 'id' field. But as you're already on this path,

DELETE t1 FROM tablename t1, tablename t2 WHERE t1.id > t2.id AND t1.userid = t2.userid AND t1.courseid = t2.courseid

Not tested!

Before testing - ensure that you create a duplicate table and do your testing on that!

I'll be honest and say this is the first time I've head of a compound key... I'm open to suggestions! (The id column is easily deleted).

DELETE t1 FROM tablename t1, tablename t2 WHERE t1.id > t2.id AND t1.userid = t2.userid AND t1.courseid = t2.courseid

Is it fair to say that the use of t1 and t2 here has totally confused me :/

Sorry for derping Diafol and thanks :)

Member Avatar for diafol

Ok well the delete query I offered will only work with the id field - as I thought that this is what you had. I suggested the compound key as this is the easier alternative to stopping duplicate entries in the first place.

If you try to add a duplicate entry it will fail as primary keys are unique (as you know).

t1 and t2 are just aliases since the query relies upon the fact that we have to place the same tablename in twice. Without the aliases the sql would be ambiguous to say the least, as you can't compare something to its 'direct self'.

Sorted, thanks very much!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.