Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 8 of 8
  1. #1
    New Coder
    Join Date
    Feb 2007
    Posts
    38
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Generic query question

    Hello,

    I was looking at a friends code and came up with a problem. We solved it by changing the structure a bit and simplified the query but I want to share with you the original implementation and give me your opinions how the best query would be.

    Two Tables:

    Code:
     CREATE TABLE `stuff` (
    `stuff_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
    `link` VARCHAR( 220 ) NOT NULL
    ) ENGINE = MYISAM
    Code:
     CREATE TABLE `votes` (
    `vote_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
    `stuff_id` INT( 10 ) NOT NULL ,
    `count` INT( 5 ) NOT NULL DEFAULT '1',
     `date` DATETIME NULL,
    INDEX ( `stuff_id` )
    ) ENGINE = MYISAM
    Data:

    Code:
    INSERT INTO `stuff` (
    `stuff_id` ,
    `link`
    )
    VALUES (
    NULL , 'http://www.whatever1.com'
    ), (
    NULL , 'http://www.whatever2.com'
    );
    Code:
    INSERT INTO `votes` (
    `vote_id` ,
    `stuff_id` ,
    `count`
    )
    VALUES ( NULL , '1', '1' ), (NULL , '2', '1'), (NULL , '2', '1');
    OK. So, the flow of the whole thing went like this. He had a list of links at the page and every time a user clicked on one of them, he INSERTED a NEW vote at the votes table with a default count value = 1 and date = now(). Please note here that the count is always '1' and no UPDATES are occuring; only INSERTS.

    He then had another page which he requested from the database to find the link from 'stuff' which had the maximum value from the SUM of all votes which matched the same stuff_id. So, the query is the following:

    Code:
    SELECT link, SUM(`count`) as totalCount
    FROM stuff
    INNER JOIN votes
    USING(stuff_id)
    GROUP BY stuff_id
    ORDER BY totalCount DESC
    LIMIT 0,1
    Please ignore the ultimate stupidity which surrounds the whole thing and focus on the query itself. The question is, is there a way to change the query (and without changing the tables structure) in order to get better performance? If you have a look at the EXPLAIN you will see that it's very very very bad!

  • #2
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    25,538
    Thanks
    77
    Thanked 4,381 Times in 4,346 Posts
    Hmmm...am I allowed to specify indexes, if they aren't already there???

    Because if I could put an index (duplicates allowed) on votes.stuff_id, and since the value of Sum(`count`) is clearly the same as the value of COUNT(*), then one would hope that the query engine could do the COUNT(*) from right in the index.

    So:
    Code:
    SELECT link, COUNT(*) AS cnt
    FROM stuff, votes 
    WHERE stuff.stuff_id = votes.stuff_id
    GROUP BY votes.stuff_id
    ORDER BY cnt DESC
    LIMIT 1
    If that doesn't force MySQL to use the index for doing the COUNT(*), then you could do it via an inner SELECT, presumably:
    Code:
    SELECT link, X.cnt
    FROM stuff, (
        SELECT stuff_id, COUNT(*) AS cnt
        FROM votes GROUP BY sutff_id ) AS X
    WHERE stuff.stuff_id = X.stuff_id
    ORDER BY X.cnt DESC
    LIMIT 1
    Dunno how smart the MySQL engine gets about stuff like that.

    Only works (if it does) because the count field is always 1.

  • #3
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    25,538
    Thanks
    77
    Thanked 4,381 Times in 4,346 Posts
    I do have to ask:

    In your INSERTs, (a) WHY are you specifying the autoincrement field, at all? (b) WHY are you using STRINGS for the value of INT fields???

    I would have done:
    Code:
    INSERT INTO stuff ( `link`)
    VALUES ( 'http://www.whatever1.com' ), 
           ( 'http://www.whatever2.com' );
    
    INSERT INTO votes ( stuff_id, `count` )
    VALUES (1,1), (2,1), (2,1);
    The code looks like a MySQL DUMP output, but surely mysqldump wouldn't use text strings fro integer values??

    Curiosity, only.

  • #4
    New Coder
    Join Date
    Feb 2007
    Posts
    38
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thanks for your time Old Pedant. The insert i copy pasted them from phpMyAdmin (which were automatically produced) as you correctly said.

    I also tried the alternative you suggest (yes you are allowed to add indexes by the way) but I wasn't sure if it was better anyway.

    If you have a look at the EXPLAIN of your query it produces this:

    Code:
    id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
    1 	PRIMARY 	<derived2> 	ALL 	NULL 	NULL 	NULL 	NULL 	2 	Using filesort
    1 	PRIMARY 	stuff 	eq_ref 	PRIMARY 	PRIMARY 	4 	X.stuff_id 	1 	 
    2 	DERIVED 	votes 	index 	NULL 	stuff_id 	4 	NULL 	3 	Using index
    And the original query produces this:

    Code:
    d 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
    1 	SIMPLE 	stuff 	ALL 	PRIMARY 	NULL 	NULL 	NULL 	2 	Using temporary; Using filesort
    1 	SIMPLE 	votes 	ALL 	stuff_id 	NULL 	NULL 	NULL 	3 	Using where; Using join buffer
    Can you please share me your opinion?
    Last edited by arekanderu; 03-28-2009 at 12:29 PM.

  • #5
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    25,538
    Thanks
    77
    Thanked 4,381 Times in 4,346 Posts
    I don't use PHP. I have to say I don't think much of phpMyAdmin if it puts those apostrophes around numeric values in the dump. Tch. Luckily, MySQL is flexible enough to do the text-to-int conversion, but it shouldn't have to.

    ********
    Which one of my two queries is that the EXPLAIN for???

    It *looks* like the second one, because of the <derived2>.

    Would be interested in seeing EXPLAIN for both versions.

    Don't have enough experience with MySQL performance tuning to be sure, but the docs for EXPLAIN *clearly* say that the WORST possible "type" you can find in a row is "ALL", and the original query has not one but *TWO* of those.

    My version indeed produces one ALL, but remember that this is on the <derived2> table, which I assume is the inner SELECT in my second version. And that derived table has far far fewer rows in it than the original table.

    Gut feeling: My version is faster. Did you benchmark it???

    Still be interesting to see what my other query does.

    And be curious to see an actual benchmark.

  • #6
    New Coder
    Join Date
    Feb 2007
    Posts
    38
    Thanks
    1
    Thanked 0 Times in 0 Posts
    OK, i'll show you the EXPLAIN for all the queries:

    This is for the query i wrote at my original post:

    Code:
    id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
    1 	SIMPLE 	stuff 	ALL 	PRIMARY 	NULL 	NULL 	NULL 	2 	Using temporary; Using filesort
    1 	SIMPLE 	votes 	ALL 	stuff_id 	NULL 	NULL 	NULL 	3 	Using where; Using join buffer
    This query:

    Code:
    SELECT link, COUNT(*) AS cnt
    FROM stuff, votes 
    WHERE stuff.stuff_id = votes.stuff_id
    GROUP BY votes.stuff_id
    ORDER BY cnt DESC
    LIMIT 1
    produces this:

    Code:
    id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
    1 	SIMPLE 	stuff 	ALL 	PRIMARY 	NULL 	NULL 	NULL 	2 	Using temporary; Using filesort
    1 	SIMPLE 	votes 	ref 	stuff_id 	stuff_id 	4 	testing.stuff.stuff_id 	2 	Using index
    This query:

    Code:
    SELECT link, X.cnt
    FROM stuff, (
        SELECT stuff_id, COUNT(*) AS cnt
        FROM votes GROUP BY sutff_id ) AS X
    WHERE stuff.stuff_id = X.stuff_id
    ORDER BY X.cnt DESC
    LIMIT 1
    produces this EXPLAIN:

    Code:
    id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
    1 	PRIMARY 	<derived2> 	ALL 	NULL 	NULL 	NULL 	NULL 	2 	Using filesort
    1 	PRIMARY 	stuff 	eq_ref 	PRIMARY 	PRIMARY 	4 	X.stuff_id 	1 	 
    2 	DERIVED 	votes 	index 	NULL 	stuff_id 	4 	NULL 	3 	Using index
    The original query i wrote and the 2nd one from the ones you wrote seem to have much bigger execution time than the one with COUNT.

    So you were right and the one with COUNT is indeed faster. Can you please be a little bit more analytic why this is happening in order to be sure that I've understood?
    Thank you for your time by the way
    Last edited by arekanderu; 03-28-2009 at 08:32 PM.

  • #7
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    25,538
    Thanks
    77
    Thanked 4,381 Times in 4,346 Posts
    I'm a little surprised that the times for my two queries weren't identical or nearly so, but that's probalby due to my SQL Server experience. I think SQL Server would have optimized either to match the other.

    Anyway...

    The reason using COUNT(*) is faster when you use an index on votes.stuff_id is simple: In order to get the value of COUNT(*) for *EACH* of those separate votes.stuff_id values, it does *NOT* need to look in the actual votes table, at all! It only needs to open up the INDEX for votes.stuff_id. And the counting is much much faster, in any case, because the index will be in stuff_id order.

    So instead of having to go through records such as
    Code:
    id :: stuff_id :: count
    1  ::   173  :: 1
    2  ::   921  :: 1
    3  ::   173  :: 1
    4  ::   801  :: 1
    5  ::   173  :: 1
    ... etc. ...
    where the stuff_id values are in random order (meaning that it not only has to scan all the records, it has to indeed create a temp internal table that is organized by stuff_id and that holds those SUM values), it simply has to scan an index that looks like
    Code:
    stuff_id :: id [would actually be internal pointer, but id works for demo]
    173   ::  1
    173   ::  3
    173   ::  5
    801   ::  4
    921   ::  2
    ...
    You can see that converting *THAT* into
    Code:
    stuff_id :: count(*)
    173   ::  3
    801   ::  1
    921   ::  1
    ...
    can be done lightning fast. Yes, it's a full scan of the index. But building the output temp internal table is about as simple a process as it could be.

    If the query optimizer is really really smart (as I think SQL Server could be, but dunno about MySQL), it would realize that it only needs *ONE* result, so it wouldn't even need to build a temp internal table!

    That is, the pseudo code to find the top count might be as simple as:
    Code:
    var topCount = 0;
    var topValue = null;
    var priorValue = null;
    var count = 0;
    for ( i = 0; i < index.length; ++i )
    {
         var curValue = index[i];
         if ( curValue != priorValue )
         {
             if ( count > topCount ) 
             {
                  topCount = count;
                  topValue = priorValue;
             }
             priorValue = curValue;
             count = 0;
        }
    }
    if ( count > topCount ) 
    {
         topCount = count;
         topValue = priorValue;
    }
    And topValue gives you the stuff_id you need to join to the other table.

    I doubt seriously that MySQL optimizes things that far for this kind of special case, but the point is that it could.

  • Users who have thanked Old Pedant for this post:

    arekanderu (03-30-2009)

  • #8
    New Coder
    Join Date
    Feb 2007
    Posts
    38
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thank you for your detailed explanation, it was very helpful.

    You queries btw where almost identical in speed; I made a mistake at my previous benchmark


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •