mysql subquery for sum with limit using filesort

Question

jeffsbaker 0 Newbie Poster

13 Years Ago

Does anyone know why the following subquery is using filesort and scanning all rows in the table according to EXPLAIN? I am trying to get the total of the top ten scores and the result is fine.

EXPLAIN SELECT SUM( score ) AS top10_total
FROM (
SELECT score
FROM answers
WHERE question_id = '4'
ORDER BY score DESC 
LIMIT 10
) AS subquery

id  select_type  table      type  possible_keys  key         key_len  ref  rows  Extra  
1   PRIMARY      <derived2> ALL   NULL           NULL         NULL    NULL 9   
2   DERIVED      answers    ALL   question_id    question_id  5            547   Using filesort

Is there any way to get the same result (a sum of the top 10 votes) without the query scanning all rows in the table?

Thanks

limit math mysql query

3 Contributors
11 Replies
1K Views
4 Days Discussion Span
Latest Post 13 Years Ago Latest Post by tesuji

rch1231 169 Posting Shark

13 Years Ago

Hello,

This link to the mysql dev site will explain where the filesort is coming from:

http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html

As far as a way to get the results without querying the whole table, if you have an field with the number of wrong answers or with 100-score then you could build an index on question_id and that field and then limit 10 and it should only hit those 10. Sounds good anyway......

rch1231 169 Posting Shark

13 Years Ago

If you have an id field on each record then you could use:

#
SELECT SUM( score ) AS top10_total
FROM answers
WHERE record_id in 
(select record_id from answers 
where question_id=4
order by score DESC)
limit 10

That will do a query first to get only the answers for question 4 and put them in order by descending score, then select the first 10 and sum them

tesuji 135 Master Poster

13 Years Ago

Let me try to explain what happen when your query executes:

To filter out all rows having '4' key question_id is used. 547 rows are affected. Next step is sorting the result set in descending order on column score. Obviously there is no Index/key on this column which could support this sort. Therefore all rows in resultset must be explicitly sorted by tablesort. Finally the sorted resultset is skrinked to ten rows having the highest score values in descending order. This shrinked resultset is returned in from clause where the sum() is calculated from.

Explicit tablesort carried out by a fast sorting method like mod. quicksort or mergesort must always be done in the context of group-by or order-by clauses if no convenient index or key is available which would speed up sorting considerably.

Using a subselect is completely correct for the sum() should be done over the 10 highest score values. If you write select any_aggregate_function(...) from answers limit 10, then any_aggregate_function(...) is always carried out on ALL rows of table answers. Limit 10 is useless in this case.

You may try count(*): select count(*) from answers where question_id = '4'; What is the result?

-- tesu

Edited 13 Years Ago by tesuji because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jeffsbaker 0 Newbie Poster · Answer 1 · 2010-08-11T23:28:21+00:00

Hello,
This link to the mysql dev site will explain where the filesort is coming from:
http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
As far as a way to get the results without querying the whole table, if you have an field with the number of wrong answers or with 100-score then you could build an index on question_id and that field and then limit 10 and it should only hit those 10. Sounds good anyway......

Thanks for the resource. I have actually read that page before, but it does not seem to have a solution for my particular problem. Actually the table is not a list of right and wrong answers. It is more like a poll. I'm trying to get the top 10 voted for answers (the score for those answers) added together to do a percentage.

There already is an index on question_id and EXPLAIN says that it is using that index. But it also says it is scanning 547 rows. The funny thing is if I just do the subquery by itself like so:

EXPLAIN SELECT SUM( score ) AS top10_total
FROM answers
WHERE question_id = '4'
ORDER BY score DESC
LIMIT 10

then EXPLAIN says it is only scanning 14 rows. But of course the top10_total is wrong because it is summing all rows of question_id instead of limiting it to 10. That is why I had to make it a subquery. Any other suggestions?

jeffsbaker 0 Newbie Poster · Answer 2 · 2010-08-14T02:55:28+00:00

If you have an id field on each record then you could use:
#
SELECT SUM( score ) AS top10_total
FROM answers
WHERE record_id in 
(select record_id from answers 
where question_id=4
order by score DESC)
limit 10
That will do a query first to get only the answers for question 4 and put them in order by descending score, then select the first 10 and sum them

Yes, I do have an id field on each record. But I tried your query above and it gave me the sum of all scores for question_id=4, not just the top 10.

jeffsbaker 0 Newbie Poster · Answer 3 · 2010-08-14T03:04:24+00:00

Let me try to explain what happen when your query executes:
To filter out all rows having '4' key question_id is used. 547 rows are affected.

But I don't see why it is affecting 547 rows if question_id is indexed and there are only 14 entries with question_id of '4'.

You may try count(*): select count(*) from answers where question_id = '4'; What is the result?
-- tesu

The result to the above select count(*) is 14. But I'm trying to get the sum of the score of the top 10 answers without doing a full table scan.

Thanks for the help. Let me know if you think of anything else.

Jeff

tesuji 135 Master Poster · Answer 4 · 2010-08-14T03:42:55+00:00

Supposing that the calculated sum(score) were correct, you already wrote that the result were fine and you might have already computed this sum manually, and further, If there are only 14 rows having question_id = '4' as you just reported, then the ORDER BY score can only include these 14 rows and no any further rows.

If all 547 were sorted in decending order, the sum would have been plain wrong, except all score values are equal. That means that the tablesort needed for sorting only considered these 14 rows and not the indicated 547 rows. Thus, for some reason EXPLAIN reports wrong row number.

This is why I tried to explain the steps your query is executed in particular and why I asked how many rows fulfill question_id = '4'.

Addition: You can proof this immediately: Look up the maximum score values for question_id other than '4'. If you find one which is greater than any of the 14 score values for question_id = '4' AND sum(score) is correct, then you have got the evidence that only those 14 row were sorted by tablesort.

-- tesu

jeffsbaker 0 Newbie Poster · Answer 5 · 2010-08-14T04:35:41+00:00

If all 547 were sorted in decending order, the sum would have been plain wrong, except all score values are equal. That means that the tablesort needed for sorting only considered these 14 rows and not the indicated 547 rows. Thus, for some reason EXPLAIN reports wrong row number.
-- tesu

Yes, I am positive that the results are correct. The SUM statement with the subquery in my first post gives me the correct results for any question. It SUMs the top 10 scores.

My concern is this: How do I know that EXPLAIN is reporting the wrong row number? In another couple of months this table will have hundreds of thousands of records. I can't have MySQL do a table scan for each poll just to get the SUM of the top ten scores. But I NEED to get the SUM of the top ten scores for each poll often.

I understand what you are saying, that mysql couldn't be sorting all 547 rows because then the SUM would be wrong. But Mysql EXPLAIN is stating that it is scanning all 547 rows. How do I know that Mysql isn't going through all 547 rows (as it states it is) and just picking the 14 scores with question_id of '4' as if there is no index on 'question_id'.

I'm sorry if I am not making my concern clear. I'm trying my best.

Jeff

tesuji 135 Master Poster · Answer 6 · 2010-08-14T05:25:26+00:00

I understand your concern completely. I know by extensive experiences that one can hardly trust that database system besides its considerable lack of basic functionality of ANSI SQL standards. I have never met another such error-prone database.

The only way to figure out whether tablesort is carried out on only those rows constrained by WHERE-condition is doing tests. You may automatically generate test data and carrying out your query on increasing numbers of rows measuring their execution time.

To get expressive results, your table must be correctely defined, especially there must be primary and foreign keys as well as most likely an index on column question_id. Just a day ago I did a similar test on a Sybase database where I have comprehensive test tools for investigating SQL statements. However, it isn´t that difficult and costly to write a simple test case for your individuell problem. I can support you. So you can post your exact select statement you want to investigate.

-- tesu

jeffsbaker 0 Newbie Poster · Answer 7 · 2010-08-14T23:36:51+00:00

I understand your concern completely. I know by extensive experiences that one can hardly trust that database system besides its considerable lack of basic functionality of ANSI SQL standards. I have never met another such error-prone database.
The only way to figure out whether tablesort is carried out on only those rows constrained by WHERE-condition is doing tests. You may automatically generate test data and carrying out your query on increasing numbers of rows measuring their execution time.

Thanks for your help. I was just hoping someone already came across this problem and had run tests on it so that I didn't have to populate a test table and run tests on it and then try to interpret the results.

I guess I'll have to make a table and fill it with a million random records. But I'm not sure which routine I should use. I have looked up a few on the web, but most of them seem pretty large and overly complicated. Do you have a suggestion for quickly and easily populating a test table with random data. Here is my table:

CREATE TABLE IF NOT EXISTS answers
( id INT NOT NULL AUTO_INCREMENT, 
PRIMARY KEY(id), 
question_id INT,		
answer VARCHAR(50),		
score INT,
INDEX(question_id))

I'm not sure, but I suppose with a million records that question_id could be a random number between 1 and 1000. And then answer could be random letters about 10 characters long. Score should probably be a random number between 1 and 5000.

Thanks for any help you have.

tesuji 135 Master Poster · Answer 8 · 2010-08-15T05:24:57+00:00

Hi,

EXPLAIN always shows the number of all rows if filesort is required for GROUP BY and ORDER BY clauses. I have examined over ten queries on a larger mysql database. No matter how the where clause looks like, EXPLAIN always quotes the number of all rows plus something more (see table below)! The result by EXPLAIN cannot considered to be a true execution plan.

I generated a test program using your select statement and table definition on Sybase database system. The execution times for number of rows from range 100,000 up to 1,000,000 were between 0.23 and 1.89 sec.

Porting this PSM and C program to mysql is rather impossible for there are too much limitations, for example creation of tables or indexes within a user defined function are not allowed for mysql always associate an implicit commit with DDL, and as a result, the function will be aborted. Also subselect (from clause) will not be carried out properly.

So I wrote a little PSM function for mysql to generating random data in the style of test tools on Sybase. I regarded the data ranges you quoted. However, the analysing and statistical methods can't be implemented on mysql, just fancy it doesn't have any output functionality for user defined functions! So one can't send any result to user screen. I can only show you the execution times on mysql database, here is the result:

/*
reps     rows     insert (sec)  select (sec)   drop+create (sec)  rows by EXPLAIN
--------------------------------------------------------------------------------------
 100000   100000   37.703       0.422 (0.000)  0.034              100239
 500000   500000  202.234       1,397 (0.004)  0.093              510269
1000000  1000000  417.276       2.142 (0.016)  0.185              1000225*/

Carry out your select sum(score) query lasts 2.142 seconds for one million rows. For each test mysql server was shut down and restarted to prevent distortion through memory cached data. (0.016) is the time in seconds if the query is executed a second time what proves that data caching works fine on mysql at least.

I thing that 2.142 sec for one million rows is a good result for your planned queries. By contrast with sorting these rows on column adate (added by myself) just lasts 2.78 seconds where filesort now really had to use all 1000225 rows (there are only 1000000).

Important: Only index (question_id) is essential for your specific query !

-- tesu