...

View Full Version : Finding two most recent records for each ID

sferson
04-28-2009, 07:33 PM
I have a table with test scores in it. Let's say it's designed something like this:
SubID - VisDate - Score1 - Score2 - Score3 - Score 4

There are often multiple records for each SubID. I want to know how to pull out the most recent two records (based on VisDate) for each unique SubID.

Can someone help?

Old Pedant
04-28-2009, 08:47 PM
Well, one way to do it:

SELECT MAIN.*
FROM table AS MAIN,
( SELECT subid, MAX(visdate) AS oneDate
FROM table
GROUP BY subid
UNION
SELECT T.subid, MAX(T.visdate) AS oneDate
FROM table AS T
( SELECT subid, MAX(visdate) AS topDate
FROM table
GROUP BY subid ) AS X
WHERE T.subid = X.subid AND T.visdate < X.topDate
) AS U
WHERE MAIN.subid = U.subid AND MAIN.visdate = U.oneDate
ORDER BY MAIN.subid, MAIN.visdate DESC

You see it? The UNION gets the latest date for each subid unioned with the next latest date for each subid. So then we JOIN all that back to the main table to get the rest of the info.

Fumigator
04-28-2009, 10:35 PM
You can GROUP BY SubID and order by VisDate Desc. You'll want to do this in a subquery so you can select everything from the table for each row.

SELECT *
FROM your_table
WHERE SubID IN (SELECT
FROM your_table as subQ
GROUP BY subQ.SubID
ORDER BY subQ.VisDate desc
LIMIT 2)

Keep in mind you shouldn't select anything

Old Pedant
04-28-2009, 10:45 PM
Ummmm...I don't think that works.

If you just look at the sub-SELECT:

SELECT subQ.SubID
FROM your_table as subQ
GROUP BY subQ.SubID
ORDER BY subQ.VisDate desc
LIMIT 2

(Fumigator missed the needed field, shown there in red)
that will ONLY return *TWO* records *TOTAL*. It does NOT return two per subid.

I have seen people pass in the outer value to the inner SELECT:

SELECT T.*
FROM your_table AS T
WHERE SubID IN (SELECT subQ.SubID
FROM your_table as subQ
WHERE subQ.SubID = T.SubID
GROUP BY subQ.SubID
ORDER BY subQ.VisDate desc
LIMIT 2)

Dunno if that works in MYSQL or not. As I read the specs, it doesn't have to work. But it might.

sferson
04-28-2009, 10:59 PM
Thank you so much for your help!

I think I'm understanding the theory here (for both suggestions that have been posted) but I'm getting errors.

For Old Pendant's code I get:

Msg 156, Level 15, State 1, Line 11
Incorrect syntax near the keyword 'SELECT'.
Msg 102, Level 15, State 1, Line 13
Incorrect syntax near ')'.

And for Fumigator's code I get:

Msg 156, Level 15, State 1, Line 6
Incorrect syntax near the keyword 'FROM'.

(If you actually want to use the line numbers, you should subtract two. It's just 'Use my_database' and then a blank line before the code.)

Old Pedant
04-29-2009, 05:49 AM
You should try to run the SQL code in a MySQL utility, even from the command line if need be, and see if you can get a better diagnostic. Too often, server-side languages such as PHP obscure the real problems.

And I believe I showed you why you get the syntax error in Fumigator's code. He was missing a field in the SELECT.

I do see one typo in my code: A missing comma in the "FROM table AS T" line. But again, if you still get an error after changing it, you should use a MySQL tool to get better diagnostics. No idea if that's the only error. I'm just typing it all in off the top of my head, with no way to debug it.

SELECT MAIN.*
FROM table AS MAIN,
( SELECT subid, MAX(visdate) AS oneDate
FROM table
GROUP BY subid
UNION
SELECT T.subid, MAX(T.visdate) AS oneDate
FROM table AS T,
( SELECT subid, MAX(visdate) AS topDate
FROM table
GROUP BY subid ) AS X
WHERE T.subid = X.subid AND T.visdate < X.topDate
) AS U
WHERE MAIN.subid = U.subid AND MAIN.visdate = U.oneDate
ORDER BY MAIN.subid, MAIN.visdate DESC

Fumigator
04-29-2009, 03:32 PM
Yeap sorry I was in the middle of my post/solution and got sidetracked and had to leave really quick so I just posted the reply thinking you could work out any minor errors and fill in the blanks. I hope the concept came across though, that's the imporant thing.

04-30-2009, 03:25 PM
SELECT yt.subid,
yt.visdate,
score1,
score2,
score3,
score4
FROM
yourtable yt
INNER JOIN
yourtable yt2
ON
yt.subid = yt2.subid
AND
yt.visdate <= yt2.visdate
GROUP BY
yt.subid,
yt.visdate
HAVING
Count(*) <= 2
ORDER BY
yt.subid,
yt.visdate DESC

by the way you need to redesign your table if it really looks similar to that. any time you find yourself with repeating columns like your scores columns, your data isn't normalized.

Old Pedant
04-30-2009, 07:53 PM

HOW does that query work, Guelph???

AHHHH...Never mind. Figured it out.

WOW! Very clever.

If others want to figure it out, try this variation on it:

SELECT yt.subid,
yt.visdate,
count(*) as what
FROM
yourtable yt
INNER JOIN
yourtable yt2
ON
yt.subid = yt2.subid
AND
yt.visdate <= yt2.visdate
GROUP BY
yt.subid,
yt.visdate

ORDER BY
yt.subid,
yt.visdate DESC;

Note that I purposely left out the HAVING to make it clearer. And added in a display of COUNT(*) so you can see why it all works.

Now I'm curious: Is this more efficient than the method I proposed??? Does it depend on which fields are indexed?? The HAVING clause there can't be cheap, because it means that the SELECT has to create results for *ALL* possible subid/visdate combinations and then eliminate all those where the COUNT(*) is too big. I'd guess this one depends a *LOT* on how smart the query engine is.

I found another error in my query. A missing GROUP BY:

SELECT MAIN.*
FROM yourtable AS MAIN,
( SELECT subid, MAX(visdate) AS oneDate
FROM yourtable
GROUP BY subid
UNION
SELECT T.subid, MAX(T.visdate) AS oneDate
FROM yourtable AS T,
( SELECT subid, MAX(visdate) AS topDate
FROM yourtable
GROUP BY subid ) AS X
WHERE T.subid = X.subid AND T.visdate < X.topDate
GROUP BY T.subid
) AS U
WHERE MAIN.subid = U.subid AND MAIN.visdate = U.oneDate
ORDER BY MAIN.subid, MAIN.visdate DESC

But yes, my query gives the same results as Guelph's.

sferson
05-05-2009, 07:26 PM
Thank you all!
I got pulled away from the project I need this for for a few days. Now that I am working on it again, I'll try these suggestions.

(BTW, guelphdad, thanks for looking out for my data. But the Score 1 -- Score 2 -- etc. was just to replace the actual name of the variables. It's a battery of different tests, so every score would be from a unique test.)

Old Pedant
05-05-2009, 07:39 PM
I'm not convinced that GuelphDad isn't still right about the table design.

What happens next week/month/year/decade when somebody decides to drop one of those tests from the battery and add three more???

With your DB design, you have to go in an modify the database. UGH! Ugly!

If, instead, you had a separate SCORES table then adding/removing scores requires no change at all in the DB schema.

Yes, yes, I know you can say with 100% certainty that the set of tests will never change. That's what all those people who had only two slots in their tables for SAT scores said for what, 30 years?

It's all about NORMALIZATION and you won't be the last person to think "that doesn't apply to me."

sferson
05-05-2009, 08:01 PM
Well, I guess I don't belong here because a) None of your suggestions worked for me and b) I don't see a problem with the table. What do you mean a separate Scores table? This IS the Scores table.

sferson
05-05-2009, 08:09 PM
Nevermind, while I was off doing something else it dawned on my what you were talking about. Well, I didn't build this table. Someone else did. When we redesign the system, I'll suggest a change.

sferson
05-05-2009, 08:27 PM
I'm not convinced that GuelphDad isn't still right about the table design.

What happens next week/month/year/decade when somebody decides to drop one of those tests from the battery and add three more???

With your DB design, you have to go in an modify the database. UGH! Ugly!

If, instead, you had a separate SCORES table then adding/removing scores requires no change at all in the DB schema.

Yes, yes, I know you can say with 100% certainty that the set of tests will never change. That's what all those people who had only two slots in their tables for SAT scores said for what, 30 years?

It's all about NORMALIZATION and you won't be the last person to think "that doesn't apply to me."

By the way, I think you could have found a nicer way to present this. I got pretty offended at the idea that I am actually thinking "that doesn't apply to me" or 'Gee. The tests we do will never change so they're wrong.' If what someone says makes sense, I am willing to listen. I just wasn't thinking about it. I was focusing on trying to get some SQL code to work for me, not focusing on redesigning the table (and we are currently working on a system redesign anyway, so please don't tell me we need to do it NOW, because we are doing it NOW). So before going off on a rant about how someone is ignoring your advice and just being defiant, perhaps you should make your argument without the accusations.

05-05-2009, 08:44 PM
Now I'm curious: Is this more efficient than the method I proposed??? Does it depend on which fields are indexed?? The HAVING clause there can't be cheap, because it means that the SELECT has to create results for *ALL* possible subid/visdate combinations and then eliminate all those where the COUNT(*) is too big. I'd guess this one depends a *LOT* on how smart the query engine is.

Yours is fine, but the one with the HAVING CLAUSE will scale better. What if the OP had asked for top 5 scores for each player for instance. Only difference in mine is changing the having clause.

Fumigator
05-05-2009, 09:43 PM
By the way, I think you could have found a nicer way to present this. I got pretty offended at the idea that I am actually thinking "that doesn't apply to me" or 'Gee. The tests we do will never change so they're wrong.' If what someone says makes sense, I am willing to listen. I just wasn't thinking about it. I was focusing on trying to get some SQL code to work for me, not focusing on redesigning the table (and we are currently working on a system redesign anyway, so please don't tell me we need to do it NOW, because we are doing it NOW). So before going off on a rant about how someone is ignoring your advice and just being defiant, perhaps you should make your argument without the accusations.

Sferson... don't take it too personally (http://dictionary.reference.com/browse/pedant) ;)

Old Pedant
05-06-2009, 08:01 PM
I apologize for any perceived insult or slam at you personally.

Those were "generic" remarks. The same kind that I read when I was starting out in DB design. Guidelines. General rules. NOT a demand, at all.

Heck, there *ARE* times when it is good to violate normalization. Some of the most important tables are purposely non-normalized, just to improve performance. In fact, I'm building a "caching" system right now where I grab data from a bunch of normalized tables and cache it in faster-to-fetch form.

And notice that I did *NOT* say categorically that the design was wrong.

I quote myself: "I'm not convinced that GuelphDad isn't still right..." I really did purposely word it that way, just because I was trying to say that you might well have the right design under your particular circumstances. The rest of what I said was only to justify my opinion, to say WHY I wasn't convinced.

So you aren't the first to take me to task for how I say things. I really believe that if we'd had this conversation face-to-face that you would find I was more just questioning than admonishing, and that I would quickly agree that there are always circumstances that call for other answers.

Again, apologies. If I ever do it to you again, feel free to slap me upside the head to remind me.

Ang\$t
05-06-2009, 08:29 PM
Hey Guys
I have an almost identical problem but I only need the most recent not the 2 most recent for each id. I am working in m\$ Sequel Server and have gotten the original solution from Old Pendant to work with a couple tweeks but not sure how to modify it to return just the most recent. Thanks in advance
A

Old Pedant
05-07-2009, 12:25 AM
Returning just the most recent is easy, comparatively.

If you mean ONE record, total, it's just:

SELECT TOP 1 * FROM table ORDER BY datefield DESC

If you mean you want most recent PER NAME (to pick an example) you could either use my solution without the UNION to the second table or GuelphDad's, just changing the 2 to a 1.

I'm pretty sure my solution will be more efficient, for only one per name, but either should work.

Let's say your table has these fields:

Table: Posts
postedby : int, FK to posters table
whenposted : datetime
subject : text
post : text

Table: Posters
posterid : int, PK
name : text

And you wanted the most recent post for each person, showing their name and the subject of the post, in alphabetic order.

SO:

SELECT Posters.Name, Posts.whenPosted, Posts.subject
FROM Posters,
Posts,
( SELECT postedby, MAX(whenPosted) AS lastpost
FROM Posts
GROUP BY postedby ) AS PLast
WHERE Posts.postedby = Posters.posterid
AND Posts.postedby = PLast.postedby
AND Posts.whenPosted = PLast.lastpost
ORDER BY Posters.Name

The only way that would have a problem would be if a single given poster had two posts at exactly the same time. Assuming that your WHENPOSTED field is accurate to the second (or better, in most systems!), this should never happen.

Ang\$t
05-07-2009, 02:25 PM
Thanks for such a quick reply Old Pedant. I need to return results from two tables so a Find-Replace got your example to work just fine. Still trying to figure out exactly what it is doing but it does what I need it to. Thanks again.

05-07-2009, 02:41 PM
Old Pedant, don't forget you can't use TOP with mysql!

Old Pedant
05-07-2009, 09:26 PM
Yes, but Ang\$t stated "I am working in m\$ Sequel [sic] Server."

Probably should have asked to have the subthread moved to the "other databases" forum?

05-08-2009, 03:14 AM
yeah I missed that. probably could split off but the answer is there now. no worries.

SQLforGirls
12-15-2009, 07:51 PM
I know this thread is old, but I needed it, found it, used it, and figured I'd enter my feedback for posterity.

First, THANK YOU for having this discussion. I found what I needed so quickly.

I am in a good position to test the efficiency of the methods suggested by both Old Pedant and Guelph. In my testing, Old Pedant wins.

I have a table of transactions. 21 million transactions from about 4 million unique users. Both methods provided identical (and from what I can tell so far accurate) results. I am using MSSQL 2005 (I know it's the wrong forum, but comparable, all things considered), selecting user_id and visit_dt into a #temp table

Guelph's: (10-15 minutes depending on server activity)
SELECT yt.user_id,
yt.visit_dt
into #recent_2_visits
FROM
visits yt
INNER JOIN visits yt2 ON yt.user_id = yt2.user_id
AND yt.visit_dt <= yt2.visit_dt
GROUP BY
yt.user_id,
yt.visit_dt
having count(1) <= 2
ORDER BY
yt.user_id,
yt.visit_dt DESC
-- 6,925,306 in 10 min 35 sec

Old Pedant's: (40-60 seconds consistently!!)
SELECT MAIN.user_id, Main.visit_dt
into #recent_2_visits_2
FROM visits AS MAIN,
( SELECT profile_id, MAX(visit_dt) AS oneDate
FROM visits
GROUP BY user_id
UNION
SELECT T.user_id, MAX(T.visit_dt) AS oneDate
FROM visits AS T,
( SELECT user_id, MAX(visit_dt) AS topDate
FROM visits
GROUP BY user_id ) AS X
WHERE T.user_id = X.user_id AND T.visit_dt < X.topDate
GROUP BY T.usere_id
) AS U
WHERE MAIN.user_id = U.user_id AND MAIN.visit_dt = U.oneDate
ORDER BY MAIN.user_id, MAIN.visit_dt DESC
-- 6,925,306 in 40 sec

Thanks again!!
-SfG

Old Pedant
12-15-2009, 08:37 PM
Well, I thought mine would be more efficient, but never thought it would be that much so. Very interesting.

I was so impressed by the cleverness of Guelph's answer that I was ready to concede to it just on that basis. <grin/>