...

View Full Version : Generic query question



arekanderu
03-28-2009, 03:08 AM
Hello,

I was looking at a friends code and came up with a problem. We solved it by changing the structure a bit and simplified the query but I want to share with you the original implementation and give me your opinions how the best query would be.

Two Tables:



CREATE TABLE `stuff` (
`stuff_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`link` VARCHAR( 220 ) NOT NULL
) ENGINE = MYISAM




CREATE TABLE `votes` (
`vote_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`stuff_id` INT( 10 ) NOT NULL ,
`count` INT( 5 ) NOT NULL DEFAULT '1',
`date` DATETIME NULL,
INDEX ( `stuff_id` )
) ENGINE = MYISAM


Data:



INSERT INTO `stuff` (
`stuff_id` ,
`link`
)
VALUES (
NULL , 'http://www.whatever1.com'
), (
NULL , 'http://www.whatever2.com'
);




INSERT INTO `votes` (
`vote_id` ,
`stuff_id` ,
`count`
)
VALUES ( NULL , '1', '1' ), (NULL , '2', '1'), (NULL , '2', '1');


OK. So, the flow of the whole thing went like this. He had a list of links at the page and every time a user clicked on one of them, he INSERTED a NEW vote at the votes table with a default count value = 1 and date = now(). Please note here that the count is always '1' and no UPDATES are occuring; only INSERTS.

He then had another page which he requested from the database to find the link from 'stuff' which had the maximum value from the SUM of all votes which matched the same stuff_id. So, the query is the following:



SELECT link, SUM(`count`) as totalCount
FROM stuff
INNER JOIN votes
USING(stuff_id)
GROUP BY stuff_id
ORDER BY totalCount DESC
LIMIT 0,1


Please ignore the ultimate stupidity which surrounds the whole thing and focus on the query itself. The question is, is there a way to change the query (and without changing the tables structure) in order to get better performance? If you have a look at the EXPLAIN you will see that it's very very very bad!

Old Pedant
03-28-2009, 05:46 AM
Hmmm...am I allowed to specify indexes, if they aren't already there???

Because if I could put an index (duplicates allowed) on votes.stuff_id, and since the value of Sum(`count`) is clearly the same as the value of COUNT(*), then one would hope that the query engine could do the COUNT(*) from right in the index.

So:


SELECT link, COUNT(*) AS cnt
FROM stuff, votes
WHERE stuff.stuff_id = votes.stuff_id
GROUP BY votes.stuff_id
ORDER BY cnt DESC
LIMIT 1


If that doesn't force MySQL to use the index for doing the COUNT(*), then you could do it via an inner SELECT, presumably:


SELECT link, X.cnt
FROM stuff, (
SELECT stuff_id, COUNT(*) AS cnt
FROM votes GROUP BY sutff_id ) AS X
WHERE stuff.stuff_id = X.stuff_id
ORDER BY X.cnt DESC
LIMIT 1


Dunno how smart the MySQL engine gets about stuff like that.

Only works (if it does) because the count field is always 1.

Old Pedant
03-28-2009, 05:52 AM
I do have to ask:

In your INSERTs, (a) WHY are you specifying the autoincrement field, at all? (b) WHY are you using STRINGS for the value of INT fields???

I would have done:


INSERT INTO stuff ( `link`)
VALUES ( 'http://www.whatever1.com' ),
( 'http://www.whatever2.com' );

INSERT INTO votes ( stuff_id, `count` )
VALUES (1,1), (2,1), (2,1);


The code looks like a MySQL DUMP output, but surely mysqldump wouldn't use text strings fro integer values??

Curiosity, only.

arekanderu
03-28-2009, 12:26 PM
Thanks for your time Old Pedant. The insert i copy pasted them from phpMyAdmin (which were automatically produced) as you correctly said.

I also tried the alternative you suggest (yes you are allowed to add indexes by the way) but I wasn't sure if it was better anyway.

If you have a look at the EXPLAIN of your query it produces this:



id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 2 Using filesort
1 PRIMARY stuff eq_ref PRIMARY PRIMARY 4 X.stuff_id 1
2 DERIVED votes index NULL stuff_id 4 NULL 3 Using index


And the original query produces this:



d select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE stuff ALL PRIMARY NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE votes ALL stuff_id NULL NULL NULL 3 Using where; Using join buffer


Can you please share me your opinion?

Old Pedant
03-28-2009, 06:48 PM
I don't use PHP. I have to say I don't think much of phpMyAdmin if it puts those apostrophes around numeric values in the dump. Tch. Luckily, MySQL is flexible enough to do the text-to-int conversion, but it shouldn't have to.

********
Which one of my two queries is that the EXPLAIN for???

It *looks* like the second one, because of the <derived2>.

Would be interested in seeing EXPLAIN for both versions.

Don't have enough experience with MySQL performance tuning to be sure, but the docs for EXPLAIN *clearly* say that the WORST possible "type" you can find in a row is "ALL", and the original query has not one but *TWO* of those.

My version indeed produces one ALL, but remember that this is on the <derived2> table, which I assume is the inner SELECT in my second version. And that derived table has far far fewer rows in it than the original table.

Gut feeling: My version is faster. Did you benchmark it???

Still be interesting to see what my other query does.

And be curious to see an actual benchmark.

arekanderu
03-28-2009, 08:12 PM
OK, i'll show you the EXPLAIN for all the queries:

This is for the query i wrote at my original post:



id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE stuff ALL PRIMARY NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE votes ALL stuff_id NULL NULL NULL 3 Using where; Using join buffer


This query:



SELECT link, COUNT(*) AS cnt
FROM stuff, votes
WHERE stuff.stuff_id = votes.stuff_id
GROUP BY votes.stuff_id
ORDER BY cnt DESC
LIMIT 1


produces this:



id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE stuff ALL PRIMARY NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE votes ref stuff_id stuff_id 4 testing.stuff.stuff_id 2 Using index


This query:



SELECT link, X.cnt
FROM stuff, (
SELECT stuff_id, COUNT(*) AS cnt
FROM votes GROUP BY sutff_id ) AS X
WHERE stuff.stuff_id = X.stuff_id
ORDER BY X.cnt DESC
LIMIT 1


produces this EXPLAIN:



id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 2 Using filesort
1 PRIMARY stuff eq_ref PRIMARY PRIMARY 4 X.stuff_id 1
2 DERIVED votes index NULL stuff_id 4 NULL 3 Using index


The original query i wrote and the 2nd one from the ones you wrote seem to have much bigger execution time than the one with COUNT.

So you were right and the one with COUNT is indeed faster. Can you please be a little bit more analytic why this is happening in order to be sure that I've understood?
Thank you for your time by the way

Old Pedant
03-28-2009, 08:39 PM
I'm a little surprised that the times for my two queries weren't identical or nearly so, but that's probalby due to my SQL Server experience. I think SQL Server would have optimized either to match the other.

Anyway...

The reason using COUNT(*) is faster when you use an index on votes.stuff_id is simple: In order to get the value of COUNT(*) for *EACH* of those separate votes.stuff_id values, it does *NOT* need to look in the actual votes table, at all! It only needs to open up the INDEX for votes.stuff_id. And the counting is much much faster, in any case, because the index will be in stuff_id order.

So instead of having to go through records such as


id :: stuff_id :: count
1 :: 173 :: 1
2 :: 921 :: 1
3 :: 173 :: 1
4 :: 801 :: 1
5 :: 173 :: 1
... etc. ...

where the stuff_id values are in random order (meaning that it not only has to scan all the records, it has to indeed create a temp internal table that is organized by stuff_id and that holds those SUM values), it simply has to scan an index that looks like


stuff_id :: id [would actually be internal pointer, but id works for demo]
173 :: 1
173 :: 3
173 :: 5
801 :: 4
921 :: 2
...

You can see that converting *THAT* into


stuff_id :: count(*)
173 :: 3
801 :: 1
921 :: 1
...

can be done lightning fast. Yes, it's a full scan of the index. But building the output temp internal table is about as simple a process as it could be.

If the query optimizer is really really smart (as I think SQL Server could be, but dunno about MySQL), it would realize that it only needs *ONE* result, so it wouldn't even need to build a temp internal table!

That is, the pseudo code to find the top count might be as simple as:


var topCount = 0;
var topValue = null;
var priorValue = null;
var count = 0;
for ( i = 0; i < index.length; ++i )
{
var curValue = index[i];
if ( curValue != priorValue )
{
if ( count > topCount )
{
topCount = count;
topValue = priorValue;
}
priorValue = curValue;
count = 0;
}
}
if ( count > topCount )
{
topCount = count;
topValue = priorValue;
}

And topValue gives you the stuff_id you need to join to the other table.

I doubt seriously that MySQL optimizes things that far for this kind of special case, but the point is that it could.

arekanderu
03-30-2009, 06:23 PM
Thank you for your detailed explanation, it was very helpful.

You queries btw where almost identical in speed; I made a mistake at my previous benchmark :)



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum