...

View Full Version : Delete Duplicates



mharrison
02-18-2013, 02:29 AM
I'm running the following query to gather a list of duplicates based on 2 fields....lnktxt and cattxt:

SELECT P1.lnktxt, P1.cattxt, P1.dsptxt
FROM TABLE AS P1,
( SELECT lnktxt, cattxt, dsptxt, COUNT(*) AS howmany
FROM TABLE
GROUP BY lnktxt, cattxt, dsptxt
HAVING howmany > 1 ) AS P2
WHERE P1.lnktxt = P2.lnktxt AND P1.cattxt = P2.cattxt

It is displaying fine, and in PHPMyAdmin when I check the boxes to delete the individual records, it is not releasing the P1.lnktxt and the SQL server is erroring out. Is there a better method to gather and display a list of duplicates?
I don't want them to be automatically deleted, I would like to review the records first and select which one to delete.

Old Pedant
02-18-2013, 05:38 PM
If you don't have something that uniquely identifies each record, you can't select which one to delete. In other words, if you don't have a PRIMARY KEY in your (very badly named) TABLE then you can't do this easily.

If you do have a primary key, just SELECT it along with the other fields and *probably* PhpMyAdmin will be able to do it.

mharrison
02-18-2013, 07:43 PM
If you don't have something that uniquely identifies each record, you can't select which one to delete. In other words, if you don't have a PRIMARY KEY in your (very badly named) TABLE then you can't do this easily.

If you do have a primary key, just SELECT it along with the other fields and *probably* PhpMyAdmin will be able to do it.


Well, I used TABLE just to mask the table name...so yes, I would suppose my ruse would be a badly named table.

The reason I asked the question is I recently switched hosts and the previous hosts PHPMyAdmin allowed me to use the exact query (with the correct table name) with no problem, but the new hosts phpmyadmin does not.

And no, I do not have a primary key defined as I was not the original creator of the database and now that I have 23k+ records, I was never sure of the implications of trying to define one now, and especially because there isn't a field that can be truly unique...a link in the lnktxt field would have the same dsptxt field info, but could have a multiple entries, each with a unique cattxt....I would almost have to create a brand new field and populate it with data just to be able to have a primary key.

Old Pedant
02-18-2013, 09:39 PM
No no...it's easy!

Just do this:


ALTER TABLE tablename ADD uniqueid INT AUTO_INCREMENT PRIMARY KEY;

And that will not only add the field, make it a primary key, but it will also RIGHT THEN AND THERE populate *ALL* the existing records with the field!

Presto.

(If you want to, after removing the duplicates you could then remove the column, but I wouldn't bother.)

mharrison
02-19-2013, 03:29 AM
Ok, so now my table has a primary key. When I use any simple query that counts the 2 fields I want to reference for duplicates, it will show me that the record has 2 or more matches, but it does not let me delete one of them.

Example:
SELECT lnktxt, cattxt, COUNT(*) c FROM tablenam GROUP BY lnktxt HAVING c > 1;

Old Pedant
02-19-2013, 04:25 AM
You *MUST* also SELECT the primary key field!

If you do that, in PHPMyAdmin, it should then be able to delete the records based on that primary key.

I don't use PHPMyAdmin, but I could certainly build a web page that would allow this in under an hour, for example.



SELECT P1.uniqueid, P1.lnktxt, P1.cattxt, P1.dsptxt
FROM TABLE AS P1,
( SELECT lnktxt, cattxt, dsptxt, COUNT(*) AS howmany
FROM TABLE
GROUP BY lnktxt, cattxt, dsptxt
HAVING howmany > 1 ) AS P2
WHERE P1.lnktxt = P2.lnktxt AND P1.cattxt = P2.cattxt AND P1.dsptxt = P2.dsptxt
ORDER BY lnktxt, cattxt, dsptxt

And if you are going to count two records as identical only if all THREE fields agree (lnktxt, cattxt, dsptxt -- which is the meaning of your GROUP BY and HAVING there), then you really should also include those in the WHERE clause, as shown.

If you don't care about whether dsptxt is duplicated or not, then don't use it in the subquery.

Also, if you don't specify an ORDER BY clause, there is no guarantee that identical (except for uniqueid) records will be even close together in the output.

mharrison
02-19-2013, 05:57 AM
Even with the modifications to the original query you listed, the problem I am having is not being able to find the records, it's being able to delete them. PHPMyAdmin will show me the records and I can select the records I want to delete, but when it writes a query to delete the records, it errors out because it is still trying to use P1.lnktxt and so on.

Perhaps I should just document the uniqueid's and then write my own query to delete them that way.

My frustration is that with the version of phpmyadmin on my old webhost, I could delete the records by selecting which ones I wanted to delete right from phpmyadmin, but with my new webhost, I cannot as it errors out.

Old Pedant
02-19-2013, 08:35 PM
Well, like I said. Write your own code to do this. It's trivial. I don't use PHP, but it will be something as simple as this:


<form action="myDelete.php" method="post">
<table border="1" cellpadding="3">
<tr>
<th>DELETE</th><th>Lnktxt</th><th>Cattxt</th><th>Dsptxt</th>
</tr>
<?php
... make your db connection ...
$sql = "SELECT P1.uniqueid, P1.lnktxt, P1.cattxt, P1.dsptxt
FROM TABLE AS P1,
( SELECT lnktxt, cattxt, dsptxt, COUNT(*) AS howmany
FROM TABLE
GROUP BY lnktxt, cattxt, dsptxt
HAVING howmany > 1 ) AS P2
WHERE P1.lnktxt = P2.lnktxt AND P1.cattxt = P2.cattxt AND P1.dsptxt = P2.dsptxt
ORDER BY lnktxt, cattxt, dsptxt ";

$result = mysql_query($sql);

while ( row = mysql_fetch_assoc($result) )
{
$id = row["uniqueid"];
echo '<tr><td><input type="checkbox" name="delete[]" value="' .$id . '" />' . $id . "</td>\n";
echo "<td>...the other fields...</td></tr>\n";
}
?>
</table>
<input type="submit" value="delete checked rows">
</form>

Now, as I said, I don't use PHP, but I *THINK* your "myDelete.php" page would be as simple as this:


<?php
...make db connection ...
$sql = "DELETE FROM TABLE
WHERE uniqueid IN (" . implode(",",$_POST["delete"]) . ")";
mysql_query($sql);
?>

mharrison
02-22-2013, 07:27 PM
The php code only errors out, but I'm going to approach this from a different angle.

I have 3 columns I want to deal with here....
lnktxt
cattxt
datetxt

I want to find all records that are the same in lnktxt that also have the same value in cattxt. I then want to delete the record whose datetxt record is older (ie... delete 2012-11-12 rather than 2013-01-25)

I can find the records but I am not sure how I would write the delete statement to be able to distinguish the date criteria, if it can even be done.

Old Pedant
02-22-2013, 08:07 PM
** SIGH **

Why didn't you mention the DATETXT field *long* ago??

Okay, now we only have one problem: Why is that field called dateTXT? That would seem to indicate to me that you are (foolishly) storing the date as a text (e.g., VARCHAR) value, rather than as a TIMESTAMP or DATETIME value.

Is that so?

If it is, then were you at least consistent in the format of the dates you stored in it?

Can you select, say, 10 records and show what they all have in that field?

On the other hand, if that field really is a TIMESTAMP or DATETIME field, and only the name is misleading, then we are ready to go.

mharrison
02-22-2013, 08:20 PM
It is a DATE value field...I didn't name it and haven't bother to change it yet due to other concerns with getting the website fully functional...

Old Pedant
02-22-2013, 10:07 PM
Perfect then.

And trivial:


DELETE t1
FROM tablename AS t1, tablename AS t2
WHERE t1.lnktxt = t2.lnktxt
AND t1.cattxt = t2.cattxt
AND t1.datetxt < t2.datetxt

ALSO: If you happen to have 3 (or more!) records with the same lnktxt and cattxt values, that will delete *all* of them except the one with the latest datetxt field.

mharrison
02-22-2013, 10:31 PM
Perfect then.

And trivial:


DELETE t1
FROM tablename AS t1, tablename AS t2
WHERE t1.lnktxt = t2.lnktxt
AND t1.cattxt = t2.cattxt
AND t1.datetxt < t2.datetxt

ALSO: If you happen to have 3 (or more!) records with the same lnktxt and cattxt values, that will delete *all* of them except the one with the latest datetxt field.

That works wonderfully! One final question on this mission....is there any way to modify it so if it runs across an instance where all 3 fields are the same, it can drop all but one?

Old Pedant
02-22-2013, 11:11 PM
Yes, if you added that AUTO_INCREMENT field.

Say you named it "uniqueid".

Then:


DELETE t1
FROM tablename AS t1, tablename AS t2
WHERE t1.lnktxt = t2.lnktxt
AND t1.cattxt = t2.cattxt
AND t1.datetxt = t2.datetxt
AND t1.uniqueid < t2.uniqueid

But run the other query first, so that you get rid of the ones that *DO* differ by date. And only then run this one as a sort of cleanup.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum