Top 7 Ways PostgreSQL Statistics Go Bad and Make Queries Slow

Introduction: When PostgreSQL Statistics Quietly Break Your Performance

Some of the slowest PostgreSQL databases I’ve tuned didn’t suffer from bad hardware or terrible SQL; they were held back by quietly broken PostgreSQL statistics. The planner was making perfectly logical decisions based on completely wrong assumptions about row counts, data distribution, and correlation.

When that happens, you see symptoms like sudden slowdowns after a data load, query plans that ignore obvious indexes, or hash joins exploding into multi-gigabyte memory users. In my own troubleshooting work, I’ve learned to suspect statistics early—especially when performance changes without any code changes.

In this article I’ll walk through seven common ways PostgreSQL statistics go bad, how that translates into slow queries, and the practical checks and fixes I rely on in real systems. By the end, you’ll know what to look for, how to confirm statistics problems, and how to keep them from creeping back.

1. Relying on Default ANALYZE Frequency for Rapidly Changing Tables

On almost every busy PostgreSQL system I’ve tuned, at least one high-churn table was running with completely misleading PostgreSQL statistics because the default autovacuum and autoanalyze settings just couldn’t keep up. The data was changing minute by minute, but the statistics still described how the table looked hours ago.

How infrequent ANALYZE breaks row count estimates

By default, PostgreSQL only triggers autoanalyze after a certain number of row inserts, updates, or deletes. On rapidly changing tables, that means the planner might estimate 10,000 matching rows when there are really 10 million—or the other way around. In my experience, this usually leads to:

Sequential scans where an index scan would be far cheaper.
Nested loop joins that should have been hash or merge joins.
Join orders that explode intermediate result sizes unexpectedly.

Those choices are not “bad” in theory; they’re rational decisions based on stale statistics. I’ve seen simple lookup queries jump from milliseconds to several seconds right after a large batch load, purely because the planner was still trusting outdated row count estimates.

When to tune per-table autovacuum and autoanalyze

For hot tables—think event logs, orders, or session tables—I’ve had much better results by giving them custom autovacuum settings so ANALYZE runs more aggressively. A typical approach that worked well for me is to lower the autovacuum_analyze_threshold and autovacuum_analyze_scale_factor on just those tables, not globally for the whole cluster.

Here’s an example of tightening autoanalyze on a single high-churn table:

ALTER TABLE public.event_log
  SET (autovacuum_analyze_threshold = 50,
       autovacuum_analyze_scale_factor = 0.01);

After changes like this, I always watch EXPLAIN output to confirm that row count estimates are closer to reality and that the planner is choosing better indexes and join strategies. If you’re unsure how to pick sane thresholds for your workload, it can be helpful to study a deeper guide on tuning autovacuum and autoanalyze behavior: VACUUM – PostgreSQL Documentation.

2. Ignoring Extended Statistics for Correlated Columns

Some of the worst PostgreSQL statistics mistakes I’ve seen weren’t about staleness at all, but about correlation. The planner assumed columns were independent when, in reality, they were tightly linked—like country and state, or status and closed_at. Without extended statistics, PostgreSQL misestimates how many rows match multi-column filters, and that cascades into bad join choices, wrong index usage, and bloated hash tables.

How missing extended statistics skew cardinality estimates

By default, PostgreSQL statistics are mostly single-column: it tracks things like NULL fraction and most common values for each column in isolation. When you run a query with predicates on multiple correlated columns, the planner assumes they’re independent and multiplies selectivities.

In practice, I’ve seen this go wrong in patterns like:

Geographic data: country_code and state_code or region.
Time-bounded states: status = 'closed' and closed_at IS NOT NULL.
Type + subtype: order_type and payment_method, where only certain combinations make sense.

Without extended statistics, the planner might believe a multi-column filter matches 0.1% of the table when it’s actually hitting 20%, or vice versa. I once debugged a reporting query where every predicate made sense, but PostgreSQL insisted on a nested loop with an index on the “wrong” side of the join because it massively underestimated the filtered row count due to correlated filters.

Defining extended statistics for correlated columns

From PostgreSQL 10 onward, you can teach the planner about correlated columns using extended statistics. In my experience, the simplest and most impactful step is to define ndistinct or dependencies statistics on the column groups that show up together in WHERE clauses and joins.

Here’s a pattern I use regularly for correlated geographic columns:

-- Tell PostgreSQL these columns are correlated
CREATE STATISTICS stats_users_country_state
  (ndistinct, dependencies)
ON country_code, state_code
FROM public.users;

-- Recompute statistics so the planner can use them
ANALYZE public.users;

After creating these extended statistics, I rerun EXPLAIN (ANALYZE) on the affected queries to compare estimated vs actual row counts. When they line up much more closely, index selection and join strategies usually improve immediately, without touching the SQL itself.

If you’re not sure which statistics kinds to use, or how PostgreSQL uses ndistinct, dependencies, and mcv internally, it’s worth reading a focused guide on choosing and interpreting extended statistics types in PostgreSQL: 14.2. Statistics Used by the Planner – PostgreSQL Documentation.

3. Low default statistics targets on skewed or wide distributions

PostgreSQL’s default statistics_target is a reasonable compromise for many workloads, but in tables with heavily skewed values or very wide distributions, I’ve seen it miss the real shape of the data completely. The result is PostgreSQL statistics that under-represent hot values and long tails, so the planner has no idea that a “rare” filter actually hits half the table, or that a popular key will explode a hash join.

How low statistics targets hide skew and long tails

With a low target, PostgreSQL builds relatively small histograms and most-common-values lists. On skewed data—like a status column where one value dominates, or an account_id where a few tenants generate most of the traffic—that limited sample can smooth over the very patterns that drive performance problems. I’ve traced more than one slow query back to this: the planner estimated a tiny result set from a “selective” predicate, then chose a nested loop and index plan that was completely overwhelmed by reality.

Raising per-column statistics targets where it matters

Instead of cranking up the global default, I’ve had the best results by increasing statistics_target only on the columns that truly need more detail—typically hot foreign keys, tenant IDs, or status flags that show extreme skew. That gives the planner richer PostgreSQL statistics without bloating analyze time for the whole schema.

Here’s how I usually tune a single problematic column:

-- Increase statistics detail for a skewed column
ALTER TABLE public.orders
  ALTER COLUMN customer_id
  SET STATISTICS 500;

-- Refresh statistics so the new target is applied
ANALYZE public.orders;

After this kind of change, I always compare EXPLAIN (ANALYZE) before and after. When the estimates for hot predicates get closer to actual row counts, join types and index usage typically improve on their own—no query rewrite required.

4. Forgetting to ANALYZE After Bulk Loads and Large Migrations

Some of the ugliest slowdowns I’ve seen happened right after a successful bulk import or migration. The data was in place, indexes were built, but PostgreSQL statistics still reflected the old, tiny tables. The planner was making decisions as if there were thousands of rows when there were now tens or hundreds of millions.

Why bulk operations poison existing statistics

Bulk loads, backfills, and large UPDATE sweeps change a massive portion of a table in one go. Autovacuum and autoanalyze are not designed to react instantly to that kind of step change, so for a while your statistics are completely disconnected from reality. I’ve seen this lead to full-table scans on fresh, well-indexed tables and catastrophic join choices in nightly reporting jobs that run right after ETL.

One thing I learned the hard way was that even perfect index design can’t save you if the planner believes the table is still small or uniformly distributed when it’s not.

Targeted ANALYZE strategies after big changes

These days, I treat ANALYZE as part of the deployment checklist for any large data change. For big imports or migrations, I run an explicit analyze on the affected tables—and sometimes only the key columns—to get the planner back on track quickly.

-- Refresh statistics for heavily modified tables
ANALYZE public.orders;
ANALYZE public.customers;

-- Or, if time is tight, focus on the most important columns
ANALYZE public.orders (customer_id, status, created_at);

On high-traffic systems, I’ll often schedule this analyze step during a low-traffic window or immediately after the bulk job finishes. Watching EXPLAIN (ANALYZE) before and after is usually enough to confirm that row count estimates, join orders, and index usage have snapped back into line.

5. Misleading PostgreSQL Statistics on Partitioned Tables

Partitioned tables are fantastic for big time-series or multi-tenant workloads, but I’ve also seen them produce some of the strangest query plans when PostgreSQL statistics weren’t in sync across partitions. The planner ends up mixing global assumptions with per-partition stats, and if some partitions are stale or unanalyzed, it can badly misjudge how much data a query will actually touch.

How partition-level stats go out of sync

Each child partition keeps its own statistics, and PostgreSQL also maintains global statistics on the parent in some versions and configurations. In practice, a few common issues crop up:

New partitions created for recent dates never get analyzed before they receive heavy traffic.
Older partitions are tiny, newer ones are huge, but the parent-level stats suggest they’re all similar.
Data distribution shifts over time, so the most common values differ wildly between partitions.

I’ve seen this lead to queries that should hit a single recent partition but get planned as if they might touch a large fraction of the whole table, resulting in bad join orders and overly cautious index choices.

Keeping partitioned table statistics accurate

What’s worked best for me is treating partition stats as a first-class maintenance concern. That means:

Running ANALYZE explicitly on new partitions right after creation or initial load.
Scheduling targeted analyze jobs for the “hot” recent partitions more often than for older, rarely touched ones.
Watching EXPLAIN (ANALYZE) to compare estimated vs actual row counts per partition, especially on time-bounded queries.

Here’s a simple pattern I use to refresh statistics on the latest partitions after a nightly load:

-- Analyze the most recent partitions explicitly
ANALYZE public.events_2026_02;
ANALYZE public.events_2026_03;

-- Optionally, analyze only critical columns if time is tight
ANALYZE public.events_2026_03 (account_id, created_at, event_type);

Once I wired this kind of targeted analyze into maintenance scripts, the wild swings in estimated row counts on partitioned tables disappeared, and the planner started producing much more stable, predictable plans. For a deeper dive into best practices around statistics and partitioned planning, it can be helpful to consult a dedicated resource on managing statistics for PostgreSQL partitioned tables: 5.12. Table Partitioning – PostgreSQL Official Documentation.

6. Misreading pg_stats and pg_statistic When Debugging Slow Queries

When I’m chasing a bad query plan, I almost always end up in pg_stats or pg_statistic. Early on, I made things worse by misreading those PostgreSQL statistics and convincing myself the planner was “wrong” when it was simply following the numbers I hadn’t understood. Learning how to read these catalogs properly turned slow-query debugging from guesswork into something much more systematic.

Key fields in pg_stats and what they really mean

In day-to-day work, I rarely query pg_statistic directly; I use pg_stats, which is a more readable view. The main columns I lean on are:

n_distinct – Estimated number of distinct values. Negative values mean “a fraction of the table”; for example, -0.1 means roughly 10% of the row count. If this is way off for a join key, you’ll see join misestimates.
null_frac – Fraction of rows that are NULL. This affects selectivity for predicates like IS NULL or IS NOT NULL.
most_common_vals / most_common_freqs – Which values are “popular” and how often they appear. On skewed columns, I check that my hot values really show up here.
histogram_bounds – Sampled distribution across the remaining values. This is what the planner uses when a predicate doesn’t hit a most-common value.

When a query is slow, I’ll compare these numbers to what I know about the data. If, for example, n_distinct says there are only 100 distinct customer_id values on a table that clearly has millions, I know the planner is estimating join sizes on fantasy data.

Here’s a small query I’ve used many times while debugging:

SELECT
    schemaname,
    tablename,
    attname,
    n_distinct,
    null_frac,
    most_common_vals,
    most_common_freqs
FROM pg_stats
WHERE schemaname = 'public'
  AND tablename  = 'orders'
  AND attname IN ('customer_id', 'status');

After running this, I’ll put it side by side with EXPLAIN (ANALYZE) to see whether the planner’s row estimates line up with what these statistics suggest. If they’re both wrong in the same direction, it’s a stats problem, not a planner bug.

A simple workflow for using statistics to debug misestimates

Over time I’ve settled into a repeatable process for using PostgreSQL statistics catalogs when a plan looks suspicious:

Capture EXPLAIN (ANALYZE) for the slow query, and note where estimated vs actual row counts diverge badly (often on joins or filters).
Identify the key columns on those nodes—usually join keys and filter columns.
Inspect pg_stats for those columns and tables, checking n_distinct, null_frac, and most-common lists for obvious nonsense or staleness.
Check recency by looking at when the table was last analyzed (via pg_stat_all_tables) and whether there were bulk changes since then.
Adjust and verify: run ANALYZE, tweak statistics_target, or add extended statistics as needed, then rerun EXPLAIN (ANALYZE) to confirm the estimates improved.

Once I started using this loop consistently, diagnosing misestimation stopped being a guessing game. Instead of just “adding indexes and hoping,” I could see exactly how the PostgreSQL statistics were shaping the planner’s decisions and fix the root cause.

7. Assuming PostgreSQL Statistics Are the Only Performance Problem

Once I learned how powerful PostgreSQL statistics are, I swung too far the other way and tried to solve every slow query with ANALYZE and statistics_target tweaks. That helped a lot, but it also hid other real problems: bad indexes, underpowered hardware, or simply inefficient SQL.

Statistics are one piece of the puzzle

When a plan looks wrong, I now treat stats as just one possible culprit alongside:

Index design – Missing composite indexes, wrong column order, or indexes that don’t match common predicates.
Query shape – OR-heavy predicates, unnecessary DISTINCTs, or complex CTEs can all push the planner into weaker plans.
Configuration and hardware – Too-small work_mem, poor I/O, or CPU pressure can turn a reasonable plan into a slow one.
Bloat and vacuum – Heavily bloated tables or indexes increase I/O even if the row estimates are perfect.

These days, when I debug performance, I start with EXPLAIN (ANALYZE) and then deliberately check: are the PostgreSQL statistics wrong, or is the plan simply exposing a deeper design or infrastructure issue? Keeping that distinction clear has saved me from over-tuning stats while the real bottleneck sat elsewhere.

Conclusion: A Practical Checklist for Healthy PostgreSQL Statistics

When I started treating PostgreSQL statistics as a core part of performance tuning, slow-query work became much more predictable. The goal isn’t perfection; it’s to avoid the big, easy-to-make mistakes that push the planner into obviously bad choices.

Here’s the checklist I now keep in the back of my mind:

Keep autovacuum and autoanalyze healthy – Don’t disable them casually; tune thresholds instead.
Run ANALYZE after bulk loads and migrations – Especially on fresh tables and heavily updated ones.
Raise statistics_target selectively – Focus on skewed columns, hot join keys, and important predicates.
Give partitioned tables extra care – Analyze new and hot partitions explicitly; don’t rely only on parent stats.
Use pg_stats to validate assumptions – Check n_distinct, null_frac, and most_common_vals against your mental model of the data.
Compare estimates to reality – Use EXPLAIN (ANALYZE) to spot big misestimates and confirm your fixes.
Remember stats aren’t everything – Revisit index design, query shape, configuration, and bloat as part of the same investigation.

If you bake these habits into your regular maintenance and review workflows, PostgreSQL statistics stop being a mysterious source of random slowdowns and instead become one of your most reliable tuning tools.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.