...

View Full Version : SQL Duplicates



LRBangin
03-05-2009, 06:37 PM
Hello,
I am building a 'waterfall' type report where I need to report duplicates based on multiple criteria.

For example, suppose I can concatenate 5 columns to form a key. I would want to report the total number of duplicates with this concatenated key.

That is easy, I can do that.

Now suppose 3 of those 5 columns form another key where I need to also report duplicates. There will not be a one to one relationship between either key.

I am trying to solve this within one SQL statement, and avoid any procedurel logic. I am confident I can come up with some way to hack this together, but I would prefer to keep it all within SQL.

If anyone has ever created such a report, please share a solution. Thanks.

Example:

Col1 Col2 Col3 Col4 Col5 Col6
A 1 6 X Y Z <-- full duplicate
A 1 6 X Y Z <-- full duplicate
A 1 6 $ $ $ <-- partial duplicate
B 2 7 X Y J <-- full duplicate
B 2 7 X Y B <-- full duplicate
B 2 7 $ $ $ <-- partial duplicate
C 4 7 X Y Z
C 4 8 X Y Z
D 5 8 X Y Z
E 5 8 X Y Z


Report Partial Duplicate Full Duplicate Total Duplicate
A 1 6 1 2 3
B 2 7 1 2 3
C 4 7 0 0 0
C 4 8 0 0 0
D 5 8 0 0 0
E 5 8 0 0 0

Old Pedant
03-05-2009, 11:10 PM
Can you please put [ code ] ... [ /code ] tags around your tables so we can see what you are after. Especially around the second table. I'm not sure what column name your various values are supposed to go with for that one.

Old Pedant
03-05-2009, 11:12 PM
Also, you need to define "partial duplicate".

I can see that

A 1 6 X Y Z
A 1 6 L M N

are partial duplicates. But what about

A 1 6 X Y Z
B 2 3 X Y Z

???? that also has 3 field matches.

Or even

A 1 6 X Y Z
B 1 5 X Q Z

again, 3 matches.

Hmmm????

LRBangin
03-06-2009, 02:56 AM
Such a rookie mistake. Here goes a better attempt.

Col 1-5 would be a 'full' key. Col 1-3 would be a 'partial' key. I am trying to devise a way to do this report within one SQL statement (or maybe two) and avoid anything procedural.

I don't have anything against using Perl or some other language to collate this all together, I'm just trying to build on some SQL knowledge. Thanks.



Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
-----|------|------|------|------|------|
A | 1 | 6 | X | Y | 1 | <-- full duplicate
A | 1 | 6 | X | Y | 2 | <-- full duplicate
A | 1 | 6 | $ | $ | $ | <-- partial duplicate
B | 2 | 7 | X | Y | J | <-- full duplicate
B | 2 | 7 | X | Y | B | <-- full duplicate
B | 2 | 7 | $ | $ | $ | <-- partial duplicate
C | 4 | 7 | X | Y | Z |
C | 4 | 8 | X | Y | Z |
D | 5 | 8 | X | Y | Z |
E | 5 | 8 | X | Y | Z |


Col1 | Col2 | Col3 | Col4| Col5|Partial Dup Count | Full Dup Count | Total Dup Count|
-----|------|------|-----|-----|------------------|----------------|----------------|
A | 1 | 6 | X | Y | 1 | 2 | 3 |
B | 2 | 7 | X | Y | 1 | 2 | 3 |
C | 4 | 7 | X | Y | 0 | 0 | 0 |
C | 4 | 8 | X | Y | 0 | 0 | 0 |
D | 5 | 8 | X | Y | 0 | 0 | 0 |
E | 5 | 8 | X | Y | 0 | 0 | 0 |

Old Pedant
03-06-2009, 07:16 AM
I still don't quite see it...

Consider: Your third row with Col1=A and your third row with Col1=B could *ALSO* be considered "master" rows.

I don't know how you would avoid getting a report that looks like this:


Col1 | Col2 | Col3 | Col4| Col5|Partial Dup Count | Full Dup Count | Total Dup Count|
-----|------|------|-----|-----|------------------|----------------|----------------|
A | 1 | 6 | X | Y | 1 | 2 | 3 |
A | 1 | 6 | $ | $ | 2 | 0 | 2 |
B | 2 | 7 | X | Y | 1 | 2 | 3 |
B | 2 | 7 | $ | $ | 2 | 0 | 2 |
etc.


Either those records with those counts are going to appear or you need *ANOTHER* rule to exclude them. (For example, "don't report records that have no full dupes at all and that have already been counted as a partial dupe in some other record"??? though that rule is a toughie, I think.)

riwan
03-09-2009, 10:09 AM
This should solve your problem



select tblb.*,
sum(if(concat(tbla.Col1,tbla.Col2,tbla.Col3)=concat(tblb.Col1,tblb.Col2,tblb.Col3) and concat(tbla.Col1,tbla.Col2,tbla.Col3,tbla.Col4,tbla.Col5)<>concat(tblb.Col1,tblb.Col2,tblb.Col3,tblb.Col4,tblb.Col5),1,0)) as `Full Dup Count`,
sum(if(concat(tbla.Col1,tbla.Col2,tbla.Col3,tbla.Col4,tbla.Col5)=concat(tblb.Col1,tblb.Col2,tblb.Col 3,tblb.Col4,tblb.Col5),1,0)) as `Full Dup Count`,
sum(if(concat(tbla.Col1,tbla.Col2,tbla.Col3)=concat(tblb.Col1,tblb.Col2,tblb.Col3),1,0)) as `Total Dup Count` from tbla
left join (select Col1, Col2, Col3, Col4, Col5 from tbla group by Col1,Col2,Col3) tblb
on (tbla.Col1=tblb.Col1 and tbla.Col2=tblb.Col2 and tbla.Col3=tblb.Col3)
group by tblb.Col1,tblb.Col2,tblb.Col3,tblb.Col4,tblb.Col5


tbla = your original table

The Full Dup Count and Total Dup Count should be checked if its 1 then print 0

Old Pedant
03-10-2009, 12:16 AM
It can be coded simpler than that, just using a UNION.

But that still doesn't answer my question/objection about the "extra" rows with partial-macth-only counts when those rows have already been counted in the full-match counts.

riwan
03-10-2009, 03:29 PM
I thought its already clear in his post what he wanted.
He already showed the wanted result.
which would be grouping the same 3 first col.
At least the sql query I write above already omitted the extra rows that you said should be showing



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum