How to Debug Slow PostgreSQL Queries: A Practical Walkthrough

Developer & creator of pgref.dev · 12 min read

performanceexplain-analyzeindexes

The query was taking 14 seconds. It should have taken under 100 milliseconds.

The setup

The application was a multi-tenant SaaS. The slow query was a dashboard summary: total orders, revenue, and average order value per customer for the last 90 days. The tables:

orders — 4.1 million rows, with customer_id, created_at, total_amount, status
customers — 312,000 rows, with id, account_id, created_at
order_items — 18 million rows, with order_id, product_id, quantity, unit_price

SELECT
  c.id AS customer_id,
  c.email,
  COUNT(o.id) AS order_count,
  SUM(o.total_amount) AS revenue,
  AVG(o.total_amount) AS avg_order_value
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
  AND o.created_at > NOW() - INTERVAL '90 days'
  AND o.status != 'cancelled'
WHERE c.account_id = $1
GROUP BY c.id, c.email
ORDER BY revenue DESC NULLS LAST;

Looked reasonable. Ran it. 14.2 seconds.

Step 1 — EXPLAIN ANALYZE

First thing, always:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...;  -- the full query

The important parts from the output:

Hash Join  (cost=12483.21..89432.18 rows=89201 width=72)
           (actual time=312.441..13891.223 rows=89201 loops=1)
  Hash Cond: (o.customer_id = c.id)
  Buffers: shared hit=4821 read=128432
  ->  Seq Scan on orders  (cost=0.00..71234.18 rows=1823401 width=28)
                          (actual time=0.019..8234.112 rows=1823401 loops=1)
        Filter: ((status <> 'cancelled') AND (created_at > ...))
        Rows Removed by Filter: 2276599
        Buffers: shared hit=891 read=124882
  ->  Seq Scan on customers  (cost=0.00..8234.11 rows=18321 width=44)
                             (actual time=0.012..231.443 rows=18321 loops=1)
        Filter: (account_id = '550e8400...'::uuid)
        Rows Removed by Filter: 293679

Two sequential scans. The orders table scan was reading 124,882 blocks from disk (read=124882) and taking 8 seconds on its own. How to read these numbers:

cost=X..Y — planner estimate. First number is startup cost, second is total.
actual time=X..Y — wall clock milliseconds.
Buffers: shared hit=N read=N — hit means data was in memory cache; read means it hit disk. High read values mean I/O is the bottleneck.

Problem clear: no index on orders.customer_id or orders.created_at.

Step 2 — Finding the missing index

Before adding indexes blindly, check what's already there:

SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'orders';

Result: a primary key on id, an index on status. Nothing on customer_id or created_at. Also check sequential scan frequency:

SELECT
  relname AS table,
  seq_scan,
  seq_tup_read,
  idx_scan
FROM pg_stat_user_tables
WHERE relname = 'orders';

seq_scan was 847. This table was being sequentially scanned almost a thousand times since the last stats reset.

CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at DESC)
WHERE status != 'cancelled';

The CONCURRENTLY flag builds the index without locking writes — essential on a production table. The partial WHERE condition makes it smaller and faster, and the planner will use it because our query has that exact filter.

After the index:

Index Scan using idx_orders_customer_created on orders
  (cost=0.56..8234.21 rows=89201 width=28)
  (actual time=0.148..412.221 rows=89201 loops=1)
  Buffers: shared hit=92341 read=1823

Disk reads dropped from 124,882 to 1,823. Time dropped to 412ms. Better — but still too slow for a page-load query.

Step 3 — The customers table scan

The EXPLAIN output showed the customers table scan was filtering 293,679 rows down to 18,321 — scanning the whole table to find 6% of it.

CREATE INDEX CONCURRENTLY idx_customers_account
ON customers (account_id);

After that index, the customers scan became a direct index scan. Full query: 38ms.

Step 4 — Table bloat

While in there, I checked table bloat:

SELECT
  relname AS table,
  n_dead_tup,
  n_live_tup,
  round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers')
ORDER BY n_dead_tup DESC;

The orders table had 890,000 dead tuples — about 22% dead. Last autovacuum was 9 days ago. Not the primary cause of the slowness, but contributing to index bloat and making buffer reads less efficient.

VACUUM ANALYZE orders;

This reclaims space from dead tuples and updates the planner statistics. After running it, the planner's row estimates got more accurate and the query dropped another 8ms.

Final state

Before: 14.2 seconds
After index on orders: 412ms
After index on customers: 38ms
After VACUUM: 30ms

Total debugging time: about 35 minutes. The fix itself took 8 minutes to run on production with no downtime.

The lesson isn't "always index everything." It's: read the EXPLAIN output, look at actual vs estimated rows, check buffer hits vs reads, and look at pg_stat_user_tables before assuming you need to rewrite the query. Usually the query is fine — the data access path isn't.

The pg_stat functions on pgref.dev are worth knowing. pg_stat_user_tables, pg_stat_statements, and pg_indexes are the three I open first on any performance problem.