PDA

View Full Version : Need a distinct column within a complex multiple join


MattyUK
04-26-2005, 06:56 PM
Hi

I have found similar posts but none with an answer that I can use (or perhaps I mean, that I understand). Hopefully my post doesn't breach any rules.

I have a complex multiple join in a database with multiple tables. It is a relational database and some the tables simple contain indexes of links from two other tables.

The SQL I am using is:


SELECT DISTINCT tblPeoples.*,tblImages.filename,tblProducts.localID,
tblProducts.productID,tblProducts.name,tblProducts.active
FROM tblPeoples,tblImages,tblProducts,tblProductImageLinks,tblHeadLetterLinks
WHERE tblPeoples.peopleID IN (528,531)
AND tblPeoples.peopleID=tblHeadLetterLinks.peopleID
AND tblHeadLetterLinks.imageID=tblImages.imageID
AND tblImages.imageID=tblProductImageLinks.imageID
AND tblProductImageLinks.productID=tblProducts.productID
AND tblProducts.active=1
AND tblProducts.productID!=599


Obviously the hardcoded values will come from php values in the final page.

The output from the above SQL is:

peopleID | firstName | Surname | keywords | filename | localID | productID | name | active
528 | Victoria Adams | | afile.jpg | 1745 | 777 | Spice Girls 4 | 1
531 | Emma Bunton | | afile.jpg | 1745 | 777 | Spice Girls 4 | 1


I do not want the 2nd row returned. I need to use DISTINCT on the productID column only.

Trouble is DISTINCT doesn't work on columns only rows.

I have heard of left and right joins but don't understand what these are, could they be used here?

Now rather than post data and structures from several tables I will try to summarise the problem using a much cut down example.

Table: tblName
name | nameID
-------------
alan | 1
bob | 2
jim | 3


Table: tblAnyText
anytext | nameID
-------------
anytext1 | 1
anytext2 | 2
anytext3 | 2
anytext4 | 3


SELECT tblName.nameID,tblName.name,tblAnyText.anytext
WHERE tblName.nameID = tblAnyText.nameID

produces:


nameID | name | anytext
-----------------------
1 | alan | anytext1
2 | bob | anytext2
2 | bob | anytext3
3 | jim | anytext4

and if I use:

SELECT DISTINCT tblName.nameID,tblName.name,tblAnyText.anytext
WHERE tblName.nameID = tblAnyText.nameID

produces:


nameID | name | anytext
-----------------------
1 | alan | anytext1
2 | bob | anytext2
2 | bob | anytext3
3 | jim | anytext4

How can I get DISTINCT to work on a single column?
I do not want/need bob's name returned a second time. I want to return only DISTINCT nameID's?

Can it be done in a single MySQL query?

So there we have it. It certainly is confusing me. I am hoping someone can help.

MattyUK

Fataqui
04-26-2005, 08:33 PM
I sort of understand what your trying to do, but am confused about how you are doing it!


SELECT DISTINCT tblPeoples.*,tblImages.filename,tblProducts.localID,
tblProducts.productID,tblProducts.name,tblProducts.active
FROM tblPeoples,tblImages,tblProducts,tblProductImageLinks,tblHeadLetterLinks
WHERE tblPeoples.peopleID IN (528,531)
AND tblPeoples.peopleID=tblHeadLetterLinks.peopleID
AND tblHeadLetterLinks.imageID=tblImages.imageID
AND tblImages.imageID=tblProductImageLinks.imageID
AND tblProductImageLinks.productID=tblProducts.productID
AND tblProducts.active=1
AND tblProducts.productID!=599


Above is asking a lot of questions that should be part of the join so you limit what you really want the database to look at and return. Anything after the WHERE clause should contain your question, not contain how you want the question answered. It's more work for the database that way.

Can you show you database layout so I can see what you have indexed, and the relationships so I can write you simple query that will use the least resources.


Fataqui

MattyUK
04-26-2005, 09:26 PM
Hi Fataqui

Thanks for your reply. What you are saying makes sense. However I am unsure how to accomplish it differently.

Is this what you are after?

<>
tblHeadLetterLinks:
imageID int(11)
peopleID int(100)
HeadLetter varchar(100)
HeadLetterLinkID int(11)

Keyname Type Cardinality Action Field
PRIMARY PRIMARY 2621 HeadLetterLinkID
HeadLetterLinkID UNIQUE 2621 HeadLetterLinkID
HeadLetterLinkID_2 INDEX 2621 HeadLetterLinkID
imageID INDEX 2621 imageID
peopleID INDEX 2621 peopleID
<>

<>
tblPeoples:
peopleID int(11)
firstName varchar(100)
Surname varchar(100)
keywords varchar(255)

Keyname Type Cardinality Action Field
PRIMARY PRIMARY 1239 peopleID
peopleID INDEX 1239 peopleID
<>

<>
tblImages:
imageID int(11)
filename varchar(100)
description varchar(100)
supplierid int(11)

Keyname Type Cardinality Action Field
PRIMARY PRIMARY 941 imageID
imageID INDEX 941 imageID
<>

<>
tblProducts:
productID int(11)
localID varchar(100)
name varchar(200)
description text
cost double
date varchar(100)
active tinyint(1)
highlight tinyint(1)

Keyname Type Cardinality Action Field
PRIMARY PRIMARY 901 productID
localID UNIQUE 901 localID
productID INDEX 901 productID
<>

<>
tblProductImageLinks:
productID int(11)
imageID int(100)
ProductImageLinkID int(11)

Keyname Type Cardinality Action Field
PRIMARY PRIMARY 635 ProductImageLinkID
productID INDEX 635 productID
imageID INDEX 635 imageID
<>


Let me know if you need more info. I'd also be interested in an brief explanation of your final code since then I can learn and help others as well. Thanks again for all your efforts.

MattyUK

Kiwi
04-27-2005, 05:30 AM
DISTINCT restricts the results to a unique combination of fields. In this case, you're focusing on the wrong field: two records are produced in the result set because you have two PeopleIDs returned for one ProductID. Without knowing where this duplicate actually occurs, I've assumed it's in tblHeadLetterLinks. What you have to do is force this table to only return a single record for each ProductID. Fortunately, this is quite easy:

SELECT A.*, B.filename, C.localID, C.productID, C.name, C.active
FROM tblPeoples AS A, tblImages AS B, tblProducts AS C, tblProductImageLinks AS D,
AND B.imageID = D.imageID
AND D.productID = C.productID
AND C.active=1
AND A.peopleID =
SELECT (E.peopleID
FROM tblHeadLetterLinks AS E
WHERE E.imageID = B.imageID
ORDER BY RAND()
LIMIT 1)

This will return a single random peopleID from tblHeadLetterLinks for each product matching the other criteria, which joins in tblPeoples. I've stripped out the logically un-necessary criteria. These can be added back in as necessary to restrict your result set: but you're generally better to get the more general result set working, then look at additional restrictive criteria. these should start with the most general (eg restricting product IDs) and move to the more specific.

If the two records are returned from another table, then you simple have to move the sub-select to restrict the records returned for the correct table.

If you want the same result returned each time the query is run, then change the order by to 'ORDER BY E.peopleID'.

Finally, I'm unsure of the functional purpose of this query but it strikes me as slightly strange. If this is run frequently, then it would be worth revisiting the data structures so this sort of inquiry is easier to manage. In particular, if a person is associated with a product, then one would expect a direct link between the product and the person.

MattyUK
04-27-2005, 02:02 PM
Hi Kiwi

I'm learning here.
AND A.peopleID =
SELECT (E.peopleID
FROM tblHeadLetterLinks AS E
WHERE E.imageID = B.imageID
ORDER BY RAND()
LIMIT 1)
I had no idea you could do that. That will be very useful for me in other things as well. Thank you.

First time I've seen AS and A. used in such ways too. I'm impressed.

Anyway back to the problem.

You are right, the duplicate occurs in tblHeadLetterLinks. This is because one product can have multiple people assigned to it, and one person can be assigned to multiple products. During each product they are assigned to they can have different headletters assigned, particular to that product.

Think of a product1 with 3 people in, labeled A>name1, B>name2 C>name3. then have another product2 with 2 people in it, labeled A>name3 B>name1.

I have the task of finding related products to the first one, based on the people assigned to it.

Plain english description in case it helps.
I need to look at each of the people assigned to product1 and identify product2 as related because it contains the same person (name1). However it comes up twice since when I get to name3 from product1 it also returns product2 as a related person. But since the rows are different DISTINCT doesn't work. I need product2 to be returned only once. Not twice. Was hoping to do via a single SQL query rather than process by php at this point.

The query would be run on a high traffic page so I do need to keep it as resource friendly as possible. I had heard that sub queries were resource hungry. But I am not sure.

It is a challenge I do realise. With this extra info have you/or anyone any ideas how it can be accomplished.

I shall try playing with your last posted SQL and see how it fits in.

Thanks for your efforts so far.

MattyUK

Kiwi
04-28-2005, 02:07 AM
The basic problem I see is that you are joining two logically related tables via a third table (images). Having to go through the images tables to find which people are associated with which products means that this is going to be a complicated solution – no matter how many SQL tricks you pull.

So far, you seem to have three major data entities: product, person, image. There are various associations between these, but I’m not sure on the exact relationships and dependencies. So far, it seems that there is a many-to-many relationship between products and images and a many-to-many relationship between images and people. There may be a fourth entity – headletter – but this is only used to join images and people in this example. This means the only way a person can be associated with a product is if they are associated with an image that is associated with that product.

The basic solution is to create a direct association between product and person. Assuming this logical structure accurately reflects your business structure, then there are no easy answers.

In a real database, I’d create a view; unfortunately mysql can’t do this. A view is effectively a query that can be treated as a table (most queries in Access are, in fact, views). They can be stored and optimised by a DBA to be more efficient than a sub-query. In mySql, a view can be mimicked by using sub-queries, but this is not a particularly good solution.

You can create a join (or xref) table. That means any changes to the product, image, headletter or person tables would be checked to see if it’s creating a new relationship between product and person and update this table. Again, in a real database, this would be done with triggers; in mySql, you have to do it in the application layer.

Either way would produce a table (or view) that would simply contain ProductID and PersonID. From there, your solution becomes very easy:

SELECT DISTINCT A.ProductID
FROM personproduct AS A, personproduct AS B
WHERE A.ProductID <> <theproduct>
AND A.PersonID = B.PersonID
AND B.ProductID = <theproduct>;


(Another aside: if you want to preserve resources…
tblHeadLetterLinks.PeopleID is defined as INT(100) – given that there haven’t been that many people ever born on the planet, this seems somewhat excessive. INT(10) would allow you to store up to 10 billion people – which would allow for everyone currently alive and some population growth.).

MattyUK
04-28-2005, 02:32 AM
Hi Kiwi

Yup, the current structure accurately reflects the business structure and you are right. I will need to consider a new table (ProductID and PersonID) to make this work well.

How to create it from exisiting data and integrate it into the site admin system so that it is updated will require some additional thought.


(Another aside: if you want to preserve resources…
tblHeadLetterLinks.PeopleID is defined as INT(100) – given that there haven’t been that many people ever born on the planet, this seems somewhat excessive. INT(10) would allow you to store up to 10 billion people – which would allow for everyone currently alive and some population growth.).

Opps, that was an oversight. I'll correct. Thanks.

MattyUK

Kiwi
04-28-2005, 07:34 AM
The temp table will look something like this:

CREATE TABLE ProductPerson (
ProductID INT(11) NOT NULL,
PersonID INT(11) NOT NULL,
FOREIGN KEY (ProductID) REFERENCES tblProducts(ProductID),
FOREIGN KEY (PersonID) REFERENCES tblPeoples(PersonID),
PRIMARY KEY (productID, PersonID)
);

The basic INSERT will look something like this:

INSERT INTO ProductPerson (ProductID,PersonID)
VALUES (SELECT DISTINCT C.ProductID, A.PersonID
FROM tblPeoples AS A, tblImages AS B, tblProducts AS C, tblProductImageLinks AS D, tblHeadLetterLinks AS E
WHERE B.imageID = D.imageID
AND D.productID = C.productID
AND C.active=1
AND A.peopleID = E.peopleID
AND E.imageID = B.imageID)

This will insert all unique product/person combinations into the xref table. You could add a TRUNCATE to the start of this and run it as a batch process each day.

The select I've used there also forms the basis of a view or a subselect, if you chose to follow either of those paths.

I'm not 100% certain that you don't get problems when using aliases with an INSERT ... SELECT - you might want to check that out. I think mySQL defaults to running ON DUPLICATE IGNORE - which means it simply won't insert duplicate lines in the table. Finally, it might be with adding indexes to the individual fields in the xref table.

MattyUK
04-28-2005, 11:49 AM
Hi Kiwi

Thanks again.

There is a little there I have not come across before so I will need to read up on it first. Otherwise I won't learn from it.

I think I would prefer to run the query whenever a product is updated/added and would probably work on it this way.

You have given me a very good head start. Thank you.

MattyUK