Go Back   CodingForums.com > :: Server side development > MySQL

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 03-28-2009, 03:08 AM   PM User | #1
arekanderu
New Coder

 
Join Date: Feb 2007
Posts: 38
Thanks: 1
Thanked 0 Times in 0 Posts
arekanderu is on a distinguished road
Generic query question

Hello,

I was looking at a friends code and came up with a problem. We solved it by changing the structure a bit and simplified the query but I want to share with you the original implementation and give me your opinions how the best query would be.

Two Tables:

Code:
 CREATE TABLE `stuff` (
`stuff_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`link` VARCHAR( 220 ) NOT NULL
) ENGINE = MYISAM
Code:
 CREATE TABLE `votes` (
`vote_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`stuff_id` INT( 10 ) NOT NULL ,
`count` INT( 5 ) NOT NULL DEFAULT '1',
 `date` DATETIME NULL,
INDEX ( `stuff_id` )
) ENGINE = MYISAM
Data:

Code:
INSERT INTO `stuff` (
`stuff_id` ,
`link`
)
VALUES (
NULL , 'http://www.whatever1.com'
), (
NULL , 'http://www.whatever2.com'
);
Code:
INSERT INTO `votes` (
`vote_id` ,
`stuff_id` ,
`count`
)
VALUES ( NULL , '1', '1' ), (NULL , '2', '1'), (NULL , '2', '1');
OK. So, the flow of the whole thing went like this. He had a list of links at the page and every time a user clicked on one of them, he INSERTED a NEW vote at the votes table with a default count value = 1 and date = now(). Please note here that the count is always '1' and no UPDATES are occuring; only INSERTS.

He then had another page which he requested from the database to find the link from 'stuff' which had the maximum value from the SUM of all votes which matched the same stuff_id. So, the query is the following:

Code:
SELECT link, SUM(`count`) as totalCount
FROM stuff
INNER JOIN votes
USING(stuff_id)
GROUP BY stuff_id
ORDER BY totalCount DESC
LIMIT 0,1
Please ignore the ultimate stupidity which surrounds the whole thing and focus on the query itself. The question is, is there a way to change the query (and without changing the tables structure) in order to get better performance? If you have a look at the EXPLAIN you will see that it's very very very bad!
arekanderu is offline   Reply With Quote
Old 03-28-2009, 05:46 AM   PM User | #2
Old Pedant
Supreme Master coder!

 
Old Pedant's Avatar
 
Join Date: Feb 2009
Posts: 23,162
Thanks: 59
Thanked 3,992 Times in 3,961 Posts
Old Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to all
Hmmm...am I allowed to specify indexes, if they aren't already there???

Because if I could put an index (duplicates allowed) on votes.stuff_id, and since the value of Sum(`count`) is clearly the same as the value of COUNT(*), then one would hope that the query engine could do the COUNT(*) from right in the index.

So:
Code:
SELECT link, COUNT(*) AS cnt
FROM stuff, votes 
WHERE stuff.stuff_id = votes.stuff_id
GROUP BY votes.stuff_id
ORDER BY cnt DESC
LIMIT 1
If that doesn't force MySQL to use the index for doing the COUNT(*), then you could do it via an inner SELECT, presumably:
Code:
SELECT link, X.cnt
FROM stuff, (
    SELECT stuff_id, COUNT(*) AS cnt
    FROM votes GROUP BY sutff_id ) AS X
WHERE stuff.stuff_id = X.stuff_id
ORDER BY X.cnt DESC
LIMIT 1
Dunno how smart the MySQL engine gets about stuff like that.

Only works (if it does) because the count field is always 1.
Old Pedant is offline   Reply With Quote
Old 03-28-2009, 05:52 AM   PM User | #3
Old Pedant
Supreme Master coder!

 
Old Pedant's Avatar
 
Join Date: Feb 2009
Posts: 23,162
Thanks: 59
Thanked 3,992 Times in 3,961 Posts
Old Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to all
I do have to ask:

In your INSERTs, (a) WHY are you specifying the autoincrement field, at all? (b) WHY are you using STRINGS for the value of INT fields???

I would have done:
Code:
INSERT INTO stuff ( `link`)
VALUES ( 'http://www.whatever1.com' ), 
       ( 'http://www.whatever2.com' );

INSERT INTO votes ( stuff_id, `count` )
VALUES (1,1), (2,1), (2,1);
The code looks like a MySQL DUMP output, but surely mysqldump wouldn't use text strings fro integer values??

Curiosity, only.
Old Pedant is offline   Reply With Quote
Old 03-28-2009, 12:26 PM   PM User | #4
arekanderu
New Coder

 
Join Date: Feb 2007
Posts: 38
Thanks: 1
Thanked 0 Times in 0 Posts
arekanderu is on a distinguished road
Thanks for your time Old Pedant. The insert i copy pasted them from phpMyAdmin (which were automatically produced) as you correctly said.

I also tried the alternative you suggest (yes you are allowed to add indexes by the way) but I wasn't sure if it was better anyway.

If you have a look at the EXPLAIN of your query it produces this:

Code:
id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	PRIMARY 	<derived2> 	ALL 	NULL 	NULL 	NULL 	NULL 	2 	Using filesort
1 	PRIMARY 	stuff 	eq_ref 	PRIMARY 	PRIMARY 	4 	X.stuff_id 	1 	 
2 	DERIVED 	votes 	index 	NULL 	stuff_id 	4 	NULL 	3 	Using index
And the original query produces this:

Code:
d 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	stuff 	ALL 	PRIMARY 	NULL 	NULL 	NULL 	2 	Using temporary; Using filesort
1 	SIMPLE 	votes 	ALL 	stuff_id 	NULL 	NULL 	NULL 	3 	Using where; Using join buffer
Can you please share me your opinion?

Last edited by arekanderu; 03-28-2009 at 12:29 PM..
arekanderu is offline   Reply With Quote
Old 03-28-2009, 06:48 PM   PM User | #5
Old Pedant
Supreme Master coder!

 
Old Pedant's Avatar
 
Join Date: Feb 2009
Posts: 23,162
Thanks: 59
Thanked 3,992 Times in 3,961 Posts
Old Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to all
I don't use PHP. I have to say I don't think much of phpMyAdmin if it puts those apostrophes around numeric values in the dump. Tch. Luckily, MySQL is flexible enough to do the text-to-int conversion, but it shouldn't have to.

********
Which one of my two queries is that the EXPLAIN for???

It *looks* like the second one, because of the <derived2>.

Would be interested in seeing EXPLAIN for both versions.

Don't have enough experience with MySQL performance tuning to be sure, but the docs for EXPLAIN *clearly* say that the WORST possible "type" you can find in a row is "ALL", and the original query has not one but *TWO* of those.

My version indeed produces one ALL, but remember that this is on the <derived2> table, which I assume is the inner SELECT in my second version. And that derived table has far far fewer rows in it than the original table.

Gut feeling: My version is faster. Did you benchmark it???

Still be interesting to see what my other query does.

And be curious to see an actual benchmark.
Old Pedant is offline   Reply With Quote
Old 03-28-2009, 08:12 PM   PM User | #6
arekanderu
New Coder

 
Join Date: Feb 2007
Posts: 38
Thanks: 1
Thanked 0 Times in 0 Posts
arekanderu is on a distinguished road
OK, i'll show you the EXPLAIN for all the queries:

This is for the query i wrote at my original post:

Code:
id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	stuff 	ALL 	PRIMARY 	NULL 	NULL 	NULL 	2 	Using temporary; Using filesort
1 	SIMPLE 	votes 	ALL 	stuff_id 	NULL 	NULL 	NULL 	3 	Using where; Using join buffer
This query:

Code:
SELECT link, COUNT(*) AS cnt
FROM stuff, votes 
WHERE stuff.stuff_id = votes.stuff_id
GROUP BY votes.stuff_id
ORDER BY cnt DESC
LIMIT 1
produces this:

Code:
id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	stuff 	ALL 	PRIMARY 	NULL 	NULL 	NULL 	2 	Using temporary; Using filesort
1 	SIMPLE 	votes 	ref 	stuff_id 	stuff_id 	4 	testing.stuff.stuff_id 	2 	Using index
This query:

Code:
SELECT link, X.cnt
FROM stuff, (
    SELECT stuff_id, COUNT(*) AS cnt
    FROM votes GROUP BY sutff_id ) AS X
WHERE stuff.stuff_id = X.stuff_id
ORDER BY X.cnt DESC
LIMIT 1
produces this EXPLAIN:

Code:
id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	PRIMARY 	<derived2> 	ALL 	NULL 	NULL 	NULL 	NULL 	2 	Using filesort
1 	PRIMARY 	stuff 	eq_ref 	PRIMARY 	PRIMARY 	4 	X.stuff_id 	1 	 
2 	DERIVED 	votes 	index 	NULL 	stuff_id 	4 	NULL 	3 	Using index
The original query i wrote and the 2nd one from the ones you wrote seem to have much bigger execution time than the one with COUNT.

So you were right and the one with COUNT is indeed faster. Can you please be a little bit more analytic why this is happening in order to be sure that I've understood?
Thank you for your time by the way

Last edited by arekanderu; 03-28-2009 at 08:32 PM..
arekanderu is offline   Reply With Quote
Old 03-28-2009, 08:39 PM   PM User | #7
Old Pedant
Supreme Master coder!

 
Old Pedant's Avatar
 
Join Date: Feb 2009
Posts: 23,162
Thanks: 59
Thanked 3,992 Times in 3,961 Posts
Old Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to allOld Pedant is a name known to all
I'm a little surprised that the times for my two queries weren't identical or nearly so, but that's probalby due to my SQL Server experience. I think SQL Server would have optimized either to match the other.

Anyway...

The reason using COUNT(*) is faster when you use an index on votes.stuff_id is simple: In order to get the value of COUNT(*) for *EACH* of those separate votes.stuff_id values, it does *NOT* need to look in the actual votes table, at all! It only needs to open up the INDEX for votes.stuff_id. And the counting is much much faster, in any case, because the index will be in stuff_id order.

So instead of having to go through records such as
Code:
id :: stuff_id :: count
1  ::   173  :: 1
2  ::   921  :: 1
3  ::   173  :: 1
4  ::   801  :: 1
5  ::   173  :: 1
... etc. ...
where the stuff_id values are in random order (meaning that it not only has to scan all the records, it has to indeed create a temp internal table that is organized by stuff_id and that holds those SUM values), it simply has to scan an index that looks like
Code:
stuff_id :: id [would actually be internal pointer, but id works for demo]
173   ::  1
173   ::  3
173   ::  5
801   ::  4
921   ::  2
...
You can see that converting *THAT* into
Code:
stuff_id :: count(*)
173   ::  3
801   ::  1
921   ::  1
...
can be done lightning fast. Yes, it's a full scan of the index. But building the output temp internal table is about as simple a process as it could be.

If the query optimizer is really really smart (as I think SQL Server could be, but dunno about MySQL), it would realize that it only needs *ONE* result, so it wouldn't even need to build a temp internal table!

That is, the pseudo code to find the top count might be as simple as:
Code:
var topCount = 0;
var topValue = null;
var priorValue = null;
var count = 0;
for ( i = 0; i < index.length; ++i )
{
     var curValue = index[i];
     if ( curValue != priorValue )
     {
         if ( count > topCount ) 
         {
              topCount = count;
              topValue = priorValue;
         }
         priorValue = curValue;
         count = 0;
    }
}
if ( count > topCount ) 
{
     topCount = count;
     topValue = priorValue;
}
And topValue gives you the stuff_id you need to join to the other table.

I doubt seriously that MySQL optimizes things that far for this kind of special case, but the point is that it could.
Old Pedant is offline   Reply With Quote
Users who have thanked Old Pedant for this post:
arekanderu (03-30-2009)
Old 03-30-2009, 06:23 PM   PM User | #8
arekanderu
New Coder

 
Join Date: Feb 2007
Posts: 38
Thanks: 1
Thanked 0 Times in 0 Posts
arekanderu is on a distinguished road
Thank you for your detailed explanation, it was very helpful.

You queries btw where almost identical in speed; I made a mistake at my previous benchmark
arekanderu is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 06:20 PM.


Advertisement
Log in to turn off these ads.