What are some ways to improve MySQL performance on queries against large tables that include the HAVING BY clause. It's my understanding that anything in there doesn't benefit from table indexes.

Recommended Answers

All 35 Replies

I'm not very familiar with HAVING but my understanding is that it is used to filter results after a GROUP operation so I can't imagine that indexes would improve performance other than on the original SELECT. Using WHERE would return rows based on one or more criteria, and would benefit from indexing, but HAVING, as I understand, is performed after the selection and grouping.

Yes, that's true. But for those instances where we are left in circumstances where we have to perform HAVING on large tables, is there anything we can take into consideration for performance? Even if we must do large table scans, are there any tips in the most efficient way of doing them?

Perhaps a smarter someone on this forum can help you look at the execution plan of your query and suggest some improvements.

I have a friend who spent the better part of a career doing SQL. I wrote up your question and sent it off to him. Just for sh!ts and giggles, he decided to feed it to ChatGPT first. He said that what he got back was what he would have written if he spent a lot of time researching. Here is what ChatGPT said...

Yes, you're correct that efficient indexing plays a crucial role in optimizing the SELECT/WHERE part of the query. However, when it comes to improving the efficiency of the HAVING part, there are several strategies you can employ:

Indexing: Just like with the WHERE clause, appropriate indexing can improve the efficiency of the HAVING clause. If the columns used in the HAVING clause are frequently filtered on, consider creating indexes on those columns. However, be cautious with indexing as it comes with overhead and can affect write performance.

Optimize the Query: Ensure that your query is optimized and written in a way that allows the database engine to execute it efficiently. Avoid unnecessary joins, subqueries, or complex expressions in the HAVING clause that can slow down the query processing.

Aggregate Functions: If possible, try to use more efficient aggregate functions in your HAVING clause. Some aggregate functions might be more computationally expensive than others. For example, SUM() might be more efficient than COUNT() in certain scenarios.

Limit the Result Set: Reduce the number of rows processed by the HAVING clause by applying more selective conditions in the WHERE clause. This can help in reducing the computational load on the database engine during the HAVING phase.

Materialized Views: In some cases, you can pre-compute the aggregates and store them in materialized views. This can significantly speed up queries that involve aggregations, including those using the HAVING clause. However, this approach comes with maintenance overhead and considerations about data freshness.

Partitioning: If your dataset is large, consider partitioning your tables based on certain criteria (e.g., date ranges). This can help in narrowing down the data that needs to be processed by the HAVING clause, thus improving efficiency.

Database Statistics: Ensure that your database statistics are up-to-date. The query optimizer relies on statistics to generate efficient execution plans. Outdated statistics can lead to suboptimal query performance, including queries involving the HAVING clause.

Hardware Resources: Sometimes, improving hardware resources like CPU, memory, or storage can also have a positive impact on the performance of queries involving the HAVING clause, especially for complex aggregations on large datasets.

By applying these strategies, you can improve the efficiency of the HAVING clause and optimize the overall performance of your SQL queries.

He also said it was scary

commented: Keep It Clear: Do not post spun or AI-generated content -8

Everything there is very generic and, dare I say, robotic, and doesn't really answer my question as to if any MySQL experts out there have come up with any creative tips they'd like to share from their own experience. I would have been much more interested in what your friend had to say from his firsthand experience than what ChatGPT has to say.

Either way, the AI missed one big one: When faced with a HAVING clause, before all else, see if you can convert it to a WHERE clause, which is what I ended up being successful in doing this morning.

Clearly you missed the part where my friend said, it's the same answer he would have given. In other words, the answer generated by ChatGPT was vetted by an expert.

I do not understand your response in this case. If I were to post a code project here, and that project happened to contain a ChatGPT generated function, would that be a no-no in your books?

If I had taken the ChatGPT reply, reworded, then posted it, would that have been OK? Or would that be plagiarism? What if I posted my reworded version AND credited it as a rewording of a ChatGPT generated answer? Then that just makes me look like an idiot.

Where is the line?

see if you can convert it to a WHERE clause

I made a foolish assumption that the HAVING was something that depended on the GROUPing and thus could not be converted to a WHERE.

As for my expert, I also assumed that if he could have been more specific he would have been. That would have required more info on your specific situation. May I suggest that next time, you post your query? Changing the HAVING to a WHERE might have been obvious on seeing it.

commented: It did depend on GROUP. I spent hours and hours rewriting the query to be able to remove it. +34

Clearly you missed the part where my friend said, it's the same answer he would have given.

No, you didn't say it's the same answer he would have given. You said it's what he would have written if he spent a lot of time researching. There's a big difference between asking an industry expert for their personal opinion based on their own expertise and experience, and asking them to simply sign off on auto-generated research that was collated from across random sources on the Internet.

If I were to post a code project here, and that project happened to contain a ChatGPT generated function, would that be a no-no in your books?

There's a big difference between using AI as a tool, and just copying and pasting an entire article of AI generated garbage and posting it here in the forums. Especially after you just posted a week ago about how Google is going hard against websites that are churning out pages and pages of AI generated articles that don't contribute any additional value to the web, and then you went and posted one such article yourself.

I'll give you another example. I've been going to the doctor a lot lately.

After each appointment, there is a long form summary of everything that was said, test results, interpretation of the test results, and next steps, all generally typed up by a medical student or resident. At the bottom it has my doctor's signature saying something along the lines of, "I agree with these findings." To my dismay, on multiple occasions I've found blatant errors, some of which could impact the treatment prescribed (e.g. in one case it said I had no heart arrhythmias, while that is very much not the case.) As someone who hangs on every word my doctor says, my doctor "signing off" is very, very different than asking him directly for his professional opinion on a treatment plan.

I asked someone who is an acknowledged expert (not just by me). He agreed with the generated answer. If I had instead written him that advice, letting him believe it was mine, he would have given the same response. In other words, it was most definitely not simply AI generated garbage.

If I asked ChatGPT and my doctor what to do for a particular pain and they both replied with "Take two Aspirin, get a good night's sleep, and call me in the morning if it still hurts", would you consider the advice from your doctor to be garbage simply because the AI gave the same response?

What your expert friend would have written, had he been a forum member here, is different than what the AI wrote. Even if the overall sentiment is the same, the actual typed words coming from a human are different than those coming from an AI.

While he agreed with the generated answer, it's not what he would have answered himself upon being asked for his professional opinion.

commented: Now you are just splitting hairs -3
commented: Hence the quote and not posted as an answer... +0

Now you are just splitting hairs

I don't think so, but let's just agree to disagree.

commented: Works for me :-) +0

And so the queen bee has spoken again, no matter the input or consideration from others. IMHO Reverend Jim were not out of line here, merely posted a scenario which could have helped... I truly saw/hoped for a turn around on the site but it seems to be governed by a "my way or the highway" mindset. Through some years I have really tried to make a comeback here (Yes, I am very highly rated in other sites, ..Jim can elaborate) but with the loss of so many other "major players/members stating the same facts" for the same reasons I think it is time to greet all and thank you for your years of being a digital friend. I know this is not the place to post this, maybe being in the open/public will create the realization that there is more to thsi site than a controlled owner, like maybe it's members and contributors...

To the relevant messages here, it was posted openly in various threads that AI has a role to play on the site, even promoted by some senior members, and to then for ..Jim be shot down like a war enemy does not play well with others I am sure.

I am honestly sorry taht I stepped on some toes here, maybe what is needed right now as most posts here is so irrelevant to what was started so many good years ago. I bid you all a blessed future, from my heart, I am however out from here.

commented: Sorry to see you go. +15

I see it very differently than you see it.

I consider Jim a real world friend of mine, as is his son, Adam, who I had actually invited to my wedding.

I didn't bash Jim with my moderator nor admin nor forum owner hat on. He contributed a post where I disagree with his viewpoint. He disagrees with mine. We had a public debate about it. I took the time to elaborately explain my perspective and thought process. I voted down one of his posts, just as he voted down one of mine. In the end, we both agreed to disagree. Everyone was respectful. We "bashed" viewpoints, and never each other. There were no personal attacks. (After all, we're friends debating a controversial topic, and I would hope that's the way Jim took it too.) Isn't that the epitome of a healthy conversation in an online discussion forum?

If the forum was indeed "governed by a 'my way or the highway' mindset," then I would moderate posts by members that disagree with my perspective, or contain these types of discussions to behind-the-scenes forums.

As a side note, I can give my own opinions as to the demise of DaniWeb, but I'll spare you ;)

We "bashed" viewpoints, and never each other

As it should be. I wish this applied to the real world on a broader scale (we all know who I am talking about).

I would hope that's the way Jim took it too

Absolutely. Dani is a respected friend and I always appreciate her viewpoint whether I agree or disagree with it. And I have been convinced to change my mind from time to time. That's the great thing about informed opinions.

I realise this has been marked as solved, but I wanted to make it known that the HAVING clause runs on the returned result set of your query, which as you say has no indexes on it as it is just a temporarily created result set - but has the benefit of allowing you do do some simple post-processing on the result set. (This is for MariaDB at least)

you can see it by making an alias from a column name, like SELECT sum(points) total FROM ... HAVING points > 10 (will error cause points doesn't exist in the temp result set, only total

I only use HAVING for botch jobs when you just want some complicated data filtered but don't want to spend the time restructuring the sub-queries to give it in the right format or it just isn't a problem that it takes 5 minutes to run.

Also note that LEFT JOIN (SELECT * FROM tbl) tbl2 messes up indexing too, as that subquery loses its indexes, you need to solve the query so it joins the table directly like LEFT JOIN tbl2 as tbl2 ON tbl1.idx = tbl2.idx AND tbl2.abc = 2

Then you could add a double index on tbl2 for columns idx, abc so it can quickly filter those rows out and return it in 0.0001s

commented: Thank you for bringing this topic back on track! +34

SELECT sum(points) total FROM ... HAVING points > 10

I don't have experience with MariaDB, but in MySQL, something like that will work as long as I do SELECT sum(points) AS total FROM ... HAVING total > 10. Is that what you were getting at?

As far as what I was trying to accomplish that provoked this question, I was working on a HAVING clause that was filtering recommended topics by a bunch of OR clauses. Nearly all of the filters could be accomplished in WHERE, but there were two (specifically, that looked at whether the end-user had posted in the topic, or had unread posts in the topic). At the time I was using a subquery in the SELECT clause, hence the need for HAVING. I switched to using JOINS, and then was able to use WHERE. And that's how I shaved nearly 50ms off of https://www.daniweb.com/programming/4 for logged in members.

Now as far as whether going down that rabbit hole that day was worth the cost of losing AndreRet, then I'd have to give a resounding no.

I have to admit that 95% of my work from 1995 to 2008 was back end stuff where I didn't have to worry about stuff like that. Digital plumbing and monitoring. The other 5% was single user apps. Not counting the 20% which was pointless meetings. So if you don't mind explaining, I'm curious as to why 50ms would even be noticible. I'm not asking just to be picky.

I'm too exhausted for an in depth explanation right now but 90% of web development is optimizing for performance. The average DaniWeb page takes anywhere from 80ms up to 250ms to load the HTML (when dealing with low network latency), depending on the type of page, so shaving 50ms off of that is a huge win.

Here are some articles that can explain it in greater depth:

The HTML page must be downloaded in its entirety before the web browser can begin loading anything else (CSS, Javascript, images, etc.) and start rendering the page. CWV dictates that the entire page must be fully loaded, meaning CSS files downloaded and rendering the HTML, JS files downloaded and executed, etc., in 2s or less. That means the fastest we can get that HTML over the wire to the user's browser, the sooner we can start doing any of those things.

And, with a serverside language, we have to interpret it to generate the HTML code before we can even start sending it over the wire. That means all PHP interpreted, SQL queries executed, etc. Everything we need to build the HTML.

I suppose I am looking at it in terms of practicality. I suppose there might be several reasons to optimize:

  1. It saves resources on the hosting platform
  2. It improves the user experience
  3. It provides personl satisfaction

Point #1 would save you money if the savings were significant.

Point #2 would apply if the improvement was noticible but I doubt anyone could seriously comment, "I think this page rendered 50ms faster than it used to", especially considering all the other things that affect timing, for example, the current loading on my ISP servers, anything else running on my computer or home network, etc.

As for point #3, personal satisfaction is a big deal, but I could not have used that at my office to justify the time spent on improving a process by that little (which is why I used to sneak those changes in under the radar).

I can see how saving 50ms in a process that took 80ms to 250ms is a big deal. I just wonder if it is a big deal that is perceptually visible to the user.

I don't have experience with MariaDB, but in MySQL, something like that will work as long as I do SELECT sum(points) AS total FROM ... HAVING total > 10. Is that what you were getting at?

Kind of, MariaDB is a fork of MySQL from around 2009 or something like that, MySQL 5.* and the creator continued developing MariaDB and Oracle took MySQL - so that's why they are very similar as in 2009 they were the same!

That said, I mean't that the HAVING statement is like dumping your query result into a temp table and then running another query on it after

As far as what I was trying to accomplish that provoked this question, I was working on a HAVING clause that was filtering recommended topics by a bunch of OR clauses. Nearly all of the filters could be accomplished in WHERE, but there were two (specifically, that looked at whether the end-user had posted in the topic, or had unread posts in the topic). At the time I was using a subquery in the SELECT clause, hence the need for HAVING. I switched to using JOINS, and then was able to use WHERE. And that's how I shaved nearly 50ms off of https://www.daniweb.com/programming/4 for logged in members.

I know that in my experience, OR's are very bad for speed you can get away with a few but they can get very bad when they stop the whole query making use of one of the main indexes, you can actually speed it up by moving the or's into the HAVING usually. Subqueries is similar depending on the size of the result it returns, the subquery returns no index but if it is on the main queries used index this can still be fast. Best is to use a join with no subquery if possible, especially for the speeds you are looking at.

Another option is to use AJAX to load the data after the page has loaded, I moved to AJAX & Javascript websites like 6 or 7 years ago - let the DOM load and use javascript to put the data in after it has already rendered. I found ReactJS recently and that seems insanely fast for making pages more responsive, I'm attempting a NodeJS backend & frontend but i'm still just learning despite using JS for around 17 years.

I currently am looking at storing all data in a Redis DB updated periodically from MariaDB, so the backend updates the data instantly over SSE(server sent events) so the react app keeps up to date without needing to wait for data over the network.

My work line might be a bit different since I work on the intranet mostly now rather than the internet so I don't have to pay too much attention to internet security.

Another option is to switch your Database server to an M2 SSD if you haven't already, those things are insanely fast, I remember the speed jump of my MySQL server back circa 2008-2010 from a HDD to SSD cut queries down from 4 minutes to 40/50 seconds, that's about 130mb/s to 400mb/s plus random access speed increases (HDD is very bad at 6.3ms average access time, SSD is 0.14ms) - and now an M2 SSD gets up to 6gb/s with a faster access latency too.

The actual bencemarks I did on SSD vs M2 SSD was read: 447MB/s vs 5.9GB/s write: 460MB/s vs 4.1 GB/s access time: 0.14ms vs 0.06ms.

I haven't tried a Database on one yet but based on the above stats it should at least be 3x faster.

Point #2 would apply if the improvement was noticible but I doubt anyone could seriously comment, "I think this page rendered 50ms faster than it used to", especially considering all the other things that affect timing, for example, the current loading on my ISP servers, anything else running on my computer or home network, etc.

Remember, we aren't shaving 50ms off of something that took 80ms to 250ms. That's how long it typically takes to generate the HTML from the PHP code and send it over the wire (e.g. including network latency). When I said I was able to shave off 50ms, we're talking strictly about the time to generate the HTML from the PHP code, which I would guesstimate is overall like 60ms-80ms (although I've never benchmarked it). So a huge win for an afternoon's worth of work.

I can see how saving 50ms in a process that took 80ms to 250ms is a big deal. I just wonder if it is a big deal that is perceptually visible to the user.

Well, when you consider that people tend to not even stick around if the HTML takes more than ~400ms to retrieve, then yeah :)

As far as saving resources on the hosting platform, while that's true, the other thing to consider here is the almighty SEO. Every website has a certain amount of "crawl budget" that is allocated to the domain based on its popularity, ranking, incoming links, etc. Crawl budget is essentially how many resources Google will spend in crawling the site. With sites as large as DaniWeb, it's an uphill battle having enough crawl budget to entice Google to crawl deeper and deeper until it's able to crawl (and, hopefully, index) all the pages of the site you want it to. Typically Google will give up when it runs out of crawl budget. One way to optimize crawl budget is by making pages load as fast as you possibly can. If you can bring page download time down from 100ms to 50ms, you can double the number of pages of your website that Google is going to crawl each day.

I know that in my experience, OR's are very bad for speed you can get away with a few but they can get very bad when they stop the whole query making use of one of the main indexes, you can actually speed it up by moving the or's into the HAVING usually.

Thank you for that tip! I never thought of that, but it totally makes sense!!

Another option is to use AJAX to load the data after the page has loaded, I moved to AJAX & Javascript websites like 6 or 7 years ago - let the DOM load and use javascript to put the data in after it has already rendered.

We do that with DaniWeb Connect business cards e.g. https://www.daniweb.com/connect/users/view/1 because figuring out the matching is super resource intensive.

I currently am looking at storing all data in a Redis DB updated periodically from MariaDB, so the backend updates the data instantly over SSE(server sent events) so the react app keeps up to date without needing to wait for data over the network.

DaniWeb uses Redis for a handful of things here, but most certainly not as sophisticated as storing the majority of our database and updating in realtime. (We also use Memcached, but I like Redis for the combination of performance and persistence.)

Another option is to switch your Database server to an M2 SSD if you haven't already, those things are insanely fast

That's super duper crazy! Our database servers definitely use SSDs, but I had never heard of M2 SSDs. I guess I'll have to talk to blud (our sysadmin) about it ;)

I'm a little late to the party, however, I want to share my experience learning MySQL in the past day. I setup a server, a database, and wrote a C++ program to connect to it. It measures the times for 3 SELECTs. The whole product list, using HAVING, and WHERE. I also tested reversing the orders.

Query execution time: 0.0002336 seconds
WHERE Query execution time: 0.0002878 seconds
HAVING Query execution time: 0.0002313 seconds

Query execution time: 0.0001929 seconds
HAVING Query execution time: 0.0002674 seconds
WHERE Query execution time: 0.0004905 seconds

#include <iostream>
#include <mysql_driver.h>
#include <mysql_connection.h>
#include <cppconn/driver.h>
#include <cppconn/connection.h>
#include <cppconn/statement.h>
#include <cppconn/exception.h> 
#include <chrono>
#include <Windows.h>
using namespace std;
#pragma comment(lib, "libcrypto.lib")
#pragma comment(lib, "libssl.lib")
#pragma comment(lib, "mysqlcppconn.lib") // For MySQL Connector/C++ version 6
#pragma comment(lib, "mysqlcppconn8.lib") // For MySQL Connector/C++ version 8

sql::mysql::MySQL_Driver* driver;
sql::Connection* con;
sql::Statement* stmt;
int ct = 0;
int main(){
    while(ct!=5){
    try {
        driver = sql::mysql::get_mysql_driver_instance();

        con = driver->connect("tcp://127.0.0.1:3306", "root", "MySQLDani");
        con->setSchema("electronics");

        // Create a statement
        stmt = con->createStatement();
        auto start_time0 = std::chrono::high_resolution_clock::now();
        // SQL query
        stmt->execute("SELECT * FROM Products");

        sql::ResultSet* result = stmt->getResultSet();

        while (result->next()) {
            int id = result->getInt("ProductID");
            string name = result->getString("ProductName");
            string price = result->getString("Price");
            cout << id << " " << name << " " << price << endl;
        }
        auto end_time0 = std::chrono::high_resolution_clock::now();

        std::chrono::duration<double> elapsed_seconds0 = end_time0 - start_time0;
        std::cout << "Query execution time: " << elapsed_seconds0.count() << " seconds\n";

        auto start_time1 = std::chrono::high_resolution_clock::now();
        // Second query with WHERE
        stmt->execute("SELECT ProductName, AVG(Price) AS AvgPrice "
            "FROM electronics.Products "
            "WHERE Price > 4.00 "  // Use WHERE here
            "GROUP BY ProductName");

        sql::ResultSet* result1 = stmt->getResultSet();

        while (result1->next()) {
            string name = result1->getString("ProductName");
            double avgPrice = result1->getDouble("AvgPrice");

            cout << name << " " << avgPrice << endl;
        }

        auto end_time1 = std::chrono::high_resolution_clock::now();

        std::chrono::duration<double> elapsed_seconds1 = end_time1 - start_time1;
        std::cout << "WHERE Query execution time: " << elapsed_seconds1.count() << " seconds\n";

        auto start_time2 = std::chrono::high_resolution_clock::now();
        // Third query with HAVING
        stmt->execute("SELECT ProductName, AVG(Price) AS AvgPrice "
            "FROM electronics.Products "
            "GROUP BY ProductName "
            "HAVING AVG(Price) > 4.00");
        sql::ResultSet* result2 = stmt->getResultSet();

        while (result2->next()) {
            string name = result2->getString("ProductName");
            double avgPrice = result2->getDouble("AvgPrice");

            cout << name << " " << avgPrice << endl;
        }
        auto end_time2 = std::chrono::high_resolution_clock::now();

        std::chrono::duration<double> elapsed_seconds2 = end_time2 - start_time2;
        std::cout << "HAVING Query execution time: " << elapsed_seconds2.count() << " seconds\n";
        Sleep(1000);

    delete con;
}
catch (sql::SQLException& e) {
    std::cerr << "SQL Exception: " << e.what() << std::endl;

}
ct++;
 }
system("pause");
return 0;
}



1 capacitors 2.50
2 resistors 4.50
3 rectifiers 7.50
4 diodes 10.00
5 ICs 25.00
Query execution time: 0.0013856 seconds
resistors 4.5
rectifiers 7.5
diodes 10
ICs 25
WHERE Query execution time: 0.0011957 seconds
resistors 4.5
rectifiers 7.5
diodes 10
ICs 25
HAVING Query execution time: 0.0010317 seconds

Benchmarking your queries is the way to go.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.