As a software developer, one of the most common problems you’ll face when working with large datasets is optimizing slow database queries. Query performance can severely impact your application’s speed and responsiveness, which is especially noticeable when users expect real-time data or reports.

In this blog post, I’ll walk you through the real-life optimization of a MySQL query that initially took over 4 hours to execute. By following a series of structured steps, I transformed this query into one that now runs in just 4 minutes. This guide will help any software developer working with large databases understand how to effectively optimize slow queries and implement these strategies in their own projects.


The Problem: Slow Sales Report Query

I was working on an e-commerce platform where we needed to generate detailed sales reports based on product categories, regions, and sales channels. The report was pulling data from three large tables:

  • sales: Contains millions of records detailing transactions.
  • products: Contains information about each product, including category and ID.
  • regions: Stores information about different regions for each sale.

The query I wrote for generating the report looked something like this:

SELECT
    p.category_id,
    r.region_id,
    SUM(s.sale_amount) AS total_sales,
    COUNT(s.sale_id) AS total_transactions
FROM
    sales s
JOIN
    products p ON s.product_id = p.product_id
JOIN
    regions r ON s.region_id = r.region_id
WHERE
    s.sale_date >= CURDATE() - INTERVAL 1 YEAR
GROUP BY
    p.category_id, r.region_id
ORDER BY
    total_sales DESC;

While the query seemed simple, it took over 4 hours to complete due to the size of the sales table and the complexity of the joins and aggregations. This led to performance bottlenecks, timeouts, and delayed reports.


Step-by-Step Solution: How I Optimized the Query

Step 1: Analyzing the Query Execution Plan

The first thing I did was run the query with EXPLAIN to analyze the query execution plan. This gave me valuable insights into how MySQL was executing the query and which parts were causing delays.

Key findings from the query execution plan:

  • The query was performing full table scans on the sales table because there was no index on the sale_date column.
  • The JOINs between the sales, products, and regions tables were inefficient. MySQL was scanning large tables even before applying filters, leading to high processing times.
  • Aggregations (SUM and COUNT) were happening on a huge dataset, meaning that MySQL had to process and group many unnecessary rows.

Step 2: Adding the Right Indexes

The first major change was ensuring that the relevant columns were indexed for faster lookups. Indexes improve the speed of searching and joining large datasets.

I added the following indexes:

  1. Composite Index on sales table:
    • Columns: sale_date, product_id, region_id
    • Reason: This index allows MySQL to efficiently filter records based on date and quickly find related product and region data.
  2. Indexes on products and regions tables:
    • category_id in the products table
    • region_id in the regions table
    • Reason: These indexes speed up the JOIN operations between sales, products, and regions.
-- Create a composite index for better filtering and join performance
CREATE INDEX idx_sales_date_product_region ON sales(sale_date, product_id, region_id);
-- Index on category_id for the products table
CREATE INDEX idx_products_category ON products(category_id);
-- Index on region_id for the regions table
CREATE INDEX idx_regions_region_id ON regions(region_id);

Adding these indexes allowed MySQL to quickly filter records and perform joins more efficiently, reducing the number of rows it had to process.

Step 3: Rewriting the Query to Reduce Data Early

Once the indexes were in place, I turned my attention to rewriting the query for better performance. The key here was reducing the dataset as early as possible in the query execution.

  • Subqueries and CTEs (Common Table Expressions): By using a CTE, I could pre-filter the sales data to only include transactions from the last year. This reduced the number of rows that needed to be processed in the join and aggregation stages.

Here’s the optimized query:

WITH filtered_sales AS (
    SELECT 
        sale_id, 
        product_id, 
        region_id, 
        sale_amount
    FROM
        sales
    WHERE
        sale_date >= CURDATE() - INTERVAL 1 YEAR
)
SELECT
    p.category_id,
    r.region_id,
    SUM(fs.sale_amount) AS total_sales,
    COUNT(fs.sale_id) AS total_transactions
FROM
    filtered_sales fs
JOIN
    products p ON fs.product_id = p.product_id
JOIN
    regions r ON fs.region_id = r.region_id
GROUP BY
    p.category_id, r.region_id
ORDER BY
    total_sales DESC;

This query first filters the sales data (via the CTE) to only include relevant records from the last 12 months. Then, it performs the join and aggregation operations on the already filtered data, significantly reducing the amount of work MySQL needs to do.

Step 4: Partitioning Large Tables for Better Performance

The sales table was growing quickly, and as a result, queries that needed to scan the entire table were getting slower. To optimize this, I implemented partitioning on the sales table based on the sale_date column.

Partitioning divides a table into smaller, more manageable chunks, allowing MySQL to scan only relevant partitions when filtering by date.

Here’s how I partitioned the sales table:

ALTER TABLE sales
PARTITION BY RANGE (YEAR(sale_date)) (
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION p2024 VALUES LESS THAN (2025)
);

This partitioning scheme ensures that MySQL only needs to scan the relevant partitions for the past year when running the query, improving the query execution speed.

Step 5: Caching for Frequently Run Reports

For reports that were run frequently, such as daily or weekly sales summaries, I added query caching at the database level and used Redis caching in the application layer. Caching ensures that the same query doesn’t need to be executed repeatedly, saving computational resources and improving response times.

  • MySQL Query Cache: I enabled query caching for reports that didn’t require real-time data, storing the results for a short period of time.
  • Redis Caching: For even faster access, I implemented Redis to store the query results for reports that were frequently requested.

Step 6: Database Configuration Tuning

Finally, I worked with the database administrator to ensure that MySQL was properly tuned for optimal performance:

  • Increased buffer pool size to allow more data to be cached in memory.
  • Enabled parallel query execution to take advantage of multiple CPU cores when processing large queries.

The Results:

After applying all of these optimizations, I re-ran the query, and the results were remarkable:

  • The query that took over 4 hours to execute now completed in just 4 minutes.
  • The system experienced significantly reduced load during the report generation process, allowing other parts of the application to perform without delay.

Key Takeaways for Developers:

  1. Proper Indexing: Ensure that your database tables are indexed on columns that are frequently used in WHERE, JOIN, and ORDER BY clauses.
  2. Reduce Data Early: Filter your data as early as possible to reduce the number of rows that need to be processed during joins and aggregations.
  3. CTEs and Subqueries: Use Common Table Expressions and subqueries to break down complex queries and make them more manageable.
  4. Table Partitioning: For large datasets, partition your tables to speed up queries by reducing the amount of data MySQL needs to scan.
  5. Caching: Use caching strategies at the database or application level for frequently accessed data to reduce query execution time.
  6. Database Tuning: Adjust your MySQL configuration to optimize memory usage and query execution.

Conclusion:

Optimizing database queries is a crucial skill for any software developer working with large datasets. By applying indexing, query restructuring, partitioning, and caching techniques, I was able to transform a slow and inefficient query into one that runs quickly and efficiently. If you’re dealing with slow queries in your project, following these steps can drastically improve performance and ensure that your application runs smoothly at scale.

Categorized in: