View Full Version : SQL Duplicates

LRBangin

03-05-2009, 07:37 PM

Hello,

I am building a 'waterfall' type report where I need to report duplicates based on multiple criteria.

For example, suppose I can concatenate 5 columns to form a key. I would want to report the total number of duplicates with this concatenated key.

That is easy, I can do that.

Now suppose 3 of those 5 columns form another key where I need to also report duplicates. There will not be a one to one relationship between either key.

I am trying to solve this within one SQL statement, and avoid any procedurel logic. I am confident I can come up with some way to hack this together, but I would prefer to keep it all within SQL.

If anyone has ever created such a report, please share a solution. Thanks.

Example:

Col1 Col2 Col3 Col4 Col5 Col6

A 1 6 X Y Z <-- full duplicate

A 1 6 X Y Z <-- full duplicate

A 1 6 $ $ $ <-- partial duplicate

B 2 7 X Y J <-- full duplicate

B 2 7 X Y B <-- full duplicate

B 2 7 $ $ $ <-- partial duplicate

C 4 7 X Y Z

C 4 8 X Y Z

D 5 8 X Y Z

E 5 8 X Y Z

Report Partial Duplicate Full Duplicate Total Duplicate

A 1 6 1 2 3

B 2 7 1 2 3

C 4 7 0 0 0

C 4 8 0 0 0

D 5 8 0 0 0

E 5 8 0 0 0

Old Pedant

03-06-2009, 12:10 AM

Can you please put [ code ] ... [ /code ] tags around your tables so we can see what you are after. Especially around the second table. I'm not sure what column name your various values are supposed to go with for that one.

Old Pedant

03-06-2009, 12:12 AM

Also, you need to define "partial duplicate".

I can see that

A 1 6 X Y Z

A 1 6 L M N

are partial duplicates. But what about

A 1 6 X Y Z

B 2 3 X Y Z

???? that also has 3 field matches.

Or even

A 1 6 X Y Z

B 1 5 X Q Z

again, 3 matches.

Hmmm????

LRBangin

03-06-2009, 03:56 AM

Such a rookie mistake. Here goes a better attempt.

Col 1-5 would be a 'full' key. Col 1-3 would be a 'partial' key. I am trying to devise a way to do this report within one SQL statement (or maybe two) and avoid anything procedural.

I don't have anything against using Perl or some other language to collate this all together, I'm just trying to build on some SQL knowledge. Thanks.

Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |

-----|------|------|------|------|------|

A | 1 | 6 | X | Y | 1 | <-- full duplicate

A | 1 | 6 | X | Y | 2 | <-- full duplicate

A | 1 | 6 | $ | $ | $ | <-- partial duplicate

B | 2 | 7 | X | Y | J | <-- full duplicate

B | 2 | 7 | X | Y | B | <-- full duplicate

B | 2 | 7 | $ | $ | $ | <-- partial duplicate

C | 4 | 7 | X | Y | Z |

C | 4 | 8 | X | Y | Z |

D | 5 | 8 | X | Y | Z |

E | 5 | 8 | X | Y | Z |

Col1 | Col2 | Col3 | Col4| Col5|Partial Dup Count | Full Dup Count | Total Dup Count|

-----|------|------|-----|-----|------------------|----------------|----------------|

A | 1 | 6 | X | Y | 1 | 2 | 3 |

B | 2 | 7 | X | Y | 1 | 2 | 3 |

C | 4 | 7 | X | Y | 0 | 0 | 0 |

C | 4 | 8 | X | Y | 0 | 0 | 0 |

D | 5 | 8 | X | Y | 0 | 0 | 0 |

E | 5 | 8 | X | Y | 0 | 0 | 0 |

Old Pedant

03-06-2009, 08:16 AM

I still don't quite see it...

Consider: Your third row with Col1=A and your third row with Col1=B could *ALSO* be considered "master" rows.

I don't know how you would avoid getting a report that looks like this:

Col1 | Col2 | Col3 | Col4| Col5|Partial Dup Count | Full Dup Count | Total Dup Count|

-----|------|------|-----|-----|------------------|----------------|----------------|

A | 1 | 6 | X | Y | 1 | 2 | 3 |

A | 1 | 6 | $ | $ | 2 | 0 | 2 |

B | 2 | 7 | X | Y | 1 | 2 | 3 |

B | 2 | 7 | $ | $ | 2 | 0 | 2 |

etc.

Either those records with those counts are going to appear or you need *ANOTHER* rule to exclude them. (For example, "don't report records that have no full dupes at all and that have already been counted as a partial dupe in some other record"??? though that rule is a toughie, I think.)

riwan

03-09-2009, 11:09 AM

This should solve your problem

select tblb.*,

sum(if(concat(tbla.Col1,tbla.Col2,tbla.Col3)=concat(tblb.Col1,tblb.Col2,tblb.Col3) and concat(tbla.Col1,tbla.Col2,tbla.Col3,tbla.Col4,tbla.Col5)<>concat(tblb.Col1,tblb.Col2,tblb.Col3,tblb.Col4,tblb.Col5),1,0)) as `Full Dup Count`,

sum(if(concat(tbla.Col1,tbla.Col2,tbla.Col3,tbla.Col4,tbla.Col5)=concat(tblb.Col1,tblb.Col2,tblb.Col 3,tblb.Col4,tblb.Col5),1,0)) as `Full Dup Count`,

sum(if(concat(tbla.Col1,tbla.Col2,tbla.Col3)=concat(tblb.Col1,tblb.Col2,tblb.Col3),1,0)) as `Total Dup Count` from tbla

left join (select Col1, Col2, Col3, Col4, Col5 from tbla group by Col1,Col2,Col3) tblb

on (tbla.Col1=tblb.Col1 and tbla.Col2=tblb.Col2 and tbla.Col3=tblb.Col3)

group by tblb.Col1,tblb.Col2,tblb.Col3,tblb.Col4,tblb.Col5

tbla = your original table

The Full Dup Count and Total Dup Count should be checked if its 1 then print 0

Old Pedant

03-10-2009, 01:16 AM

It can be coded simpler than that, just using a UNION.

But that still doesn't answer my question/objection about the "extra" rows with partial-macth-only counts when those rows have already been counted in the full-match counts.

riwan

03-10-2009, 04:29 PM

I thought its already clear in his post what he wanted.

He already showed the wanted result.

which would be grouping the same 3 first col.

At least the sql query I write above already omitted the extra rows that you said should be showing

Powered by vBulletin® Version 4.2.2 Copyright © 2015 vBulletin Solutions, Inc. All rights reserved.