View Full Version : Complicated (sort-of) duplicates query

03-20-2007, 05:29 PM
Hey All,

I've been struggling for a couple of days now with a duplicates issue which leaves me... well -- just plain stumped:confused:

In a relationship table "contacts_orders" (id,contact_id,orders_id,date,deleted) it is acceptable and even desirable
for there to be TWO records with the SAME orders_id, each having a DISTINCT contact_id (1 sellers_agent_id, 1 buyers_agent_id)
and also for there to be many instances of a single contact_id having been recorded for distinct orders(unique orders_id's.)

On the occasion where there are duplicates -- that is, the SAME contact_id recorded MORE THAN ONCE for a given orders_id --

Something like ...

DELETE FROM contacts_orders GROUP BY orders_id, contact_id HAVING count(*)>1

03-20-2007, 06:34 PM
You can use those two columns as your primary key, or you can simply create a unique index based on those two columns, and avoid duplicates by keeping them from existing in the first place.

To clean up your table as it is you can't use a GROUP BY on a DELETE query, so just write a little script that selects the offending rows and then deletes them. Is it really that important that you do it all in one query?

03-20-2007, 06:57 PM
The query below I wrote will show all records with duplicate contact_ids and orders_ids. Then you'll have to select the method to choose which record to delete (like maybe keeping the record with the lowest id number and deleting the others). You won't be able to do it in one query. This will require a nice little script to complete this.

To select duplicate records:

SELECT tbl1.id, tbl1.contact_id, tbl1.orders_id
FROM contacts_orders tbl1
WHERE tbl1.contact_id IN
SELECT tbl2.contact_id
FROM contacts_orders tbl2
GROUP BY tbl2.contact_id, tbl2.orders_id
HAVING Count(*) > 1
AND tbl2.orders_id = tbl1.orders_id
ORDER BY tbl1.contact_id, tbl1.orders_id;

03-20-2007, 07:23 PM
The following is something else to try (tested - seemed like an interesting exercise.)

Because you cannot DELETE anything in an outer query that is also contained in the inner query, you can use a TEMPORARY table to hold information about the rows that are duplicates.

This will form the temp table -

$query = "CREATE TEMPORARY TABLE new_tbl SELECT contact_id, orders_id
FROM contacts_orders GROUP BY contact_id, orders_id HAVING COUNT(*) > 1";The following will delete one of the duplicate rows each time it is executed. You need to limit the number of rows being deleted to one so that you don't delete all the matching rows.

$query = "DELETE FROM contacts_orders
WHERE (contact_id, orders_id) IN (SELECT contact_id, orders_id FROM new_tbl)
ORDER BY contact_id, orders_id, date LIMIT 1";By executing both of these queries in a loop (I just kept refreshing the page) until the number of affected rows is zero, you can delete all the duplicates.

You need to modify the ORDER BY clause, as already pointed out by TheShaner, to cause the desired row within each duplicate set to be deleted.

Edit: I am assuming that you have already modified the code that inserts these so that it checkes for an existing contact_id, orders_id pair and does not insert a duplicate.

03-20-2007, 07:34 PM
simplest method:

ALTER IGNORE TABLE contacts_orders
ADD UNIQUE INDEX dropthislater (contact_id,orders_id)

this will delete any rows where contact_id and orders_id are the same, leaving one of each of those rows behind. You don't get to choose which rows are kept though. If you have two rows the same but an id column with two different values, with this method you don't choose which id value is kept.

then you can leave that index in place so it doesn't allow that situation to happen again, or drop the index.

03-20-2007, 08:41 PM
If thhis is'nt the poster-child for a WELL RESPONDED thread.
Much to digest. Thanks one and all!

While I think your suggestion of a temp_table looks pretty short & sweet (was headed that way - in a really ugly fashion -- myself) your post-script assertion that the offending code should be rectified is rendered moot (along with the need for tmp_table and accompanying php script) -- by Fumigator and guelphdad's suggestion of creating a unique index based on the two id columns.
To wit, even Fumigator didn't get it entirely right on the first try -- no need to clean up table - and I was going to have ask how to index coulmns in tandem in a follow-up post ;)

guelphdad answered your question for both of us: No, it really is not that important to do it all in one query, just to do it effeciently... like so...
ALTER IGNORE TABLE contacts_orders
ADD UNIQUE INDEX dropthislater (contact_id,orders_id):thumbsup:

I didn't know I could index columns in tandem like that!! COOL!!!
Thanks guelphdad!