Introduction: Why Columnar Storage Compression Fails in Practice
Columnar storage compression promises huge wins for modern OLAP systems: smaller disks, cheaper memory footprints, and dramatically faster scans. On paper, it looks like a free performance upgrade. But in my experience tuning real-world warehouses, those gains often evaporate once workloads hit production.
The problem isn’t the idea of columnar storage compression; it’s the subtle design mistakes around how it’s applied. Poorly chosen encodings, badly clustered data, or compression that ignores access patterns can turn a theoretically optimal layout into something slower than a plain row store. I’ve seen teams proudly achieve 10x smaller data files, only to discover query latencies are worse because CPUs are burning cycles decompressing the wrong data.
This article breaks down the top mistakes I keep encountering with columnar storage compression and shows how to avoid design choices that silently erase your expected performance gains.
1. Treating Columnar Storage Compression as a Generic “On/Off” Toggle
When I first worked with columnar storage compression, I made the same mistake I now see on a lot of teams: treating compression as a single checkbox to flip on at the table level. In practice, compression should be a per-column, per-encoding decision, guided by data distribution and query patterns, not a blanket policy.
Why a One-Size-Fits-All Switch Backfires
Different columns benefit from very different strategies. Highly repetitive low-cardinality dimensions (like status codes) thrive with dictionary or run-length encoding, while wide, high-cardinality text fields might be better left lightly compressed or even uncompressed. If everything is forced through the same codec, you often get:
- Unnecessary CPU burn on columns that are frequently scanned but don’t compress well.
- Poor scan speeds because the engine spends more time decompressing than filtering.
- Misleading size savings that don’t translate into end-to-end query performance.
In my experience, the biggest wins came when I started classifying columns (hot vs. cold, dimension vs. metric, point lookup vs. scan-heavy) and then tuning compression at that level instead of relying on defaults.
Designing Compression Per Column, Not Per Table
A simple mental model that has worked well for me is:
- Access-heavy numeric metrics: prefer light, CPU-cheap compression (e.g., simple encoding) to keep scans fast.
- Static or rarely queried columns: push for more aggressive compression to save storage and memory.
- High-cardinality text: benchmark different codecs and don’t be afraid to skip compression if CPU dominates.
Here’s a conceptual example of how I might configure per-column compression in a warehouse that supports explicit encodings:
CREATE TABLE events (
event_id BIGINT ENCODING RAW, -- hot, frequently joined key
event_type VARCHAR ENCODING DICT, -- low-cardinality dimension
user_id BIGINT ENCODING DELTA, -- sorted, good for delta encoding
created_at TIMESTAMP ENCODING RLE, -- often clustered, runs compress well
payload_json VARCHAR ENCODING LZO -- compressible but CPU-aware codec
);
By thinking this way, columnar storage compression becomes a design tool rather than a magic switch. That shift alone can turn a system that looks good in storage metrics into one that actually feels fast under real analytical workloads.
For readers who want a deeper dive into how different compression encodings impact OLAP query performance, I recommend looking for a detailed guide on columnar encoding strategies in modern data warehouses: A Data Engineer’s Guide to Columnar Storage.
2. Overcompressing Hot Columns and Creating CPU Bottlenecks
One thing I learned the hard way with columnar storage compression is that it’s very easy to “win” on disk usage and completely lose on CPU. The classic trap is overcompressing hot columns—the ones every dashboard and ad-hoc query touches—so much that the cost of repeatedly decompressing them dwarfs any I/O savings.
When Compression Shifts the Bottleneck from I/O to CPU
In modern vectorized engines, data moves through tight CPU loops in batches. If those loops have to invoke a heavy codec on every batch for a hot column, your scans stall on decompression instead of disk. I’ve seen this when teams enable the strongest codec globally and suddenly observe:
- High CPU utilization with little improvement in end-to-end latency.
- Reduced throughput under concurrency because cores are busy inflating data instead of evaluating predicates.
- Poor cache behavior as complex decompression logic thrashes instruction and data caches.
In practice, the columns targeted in every WHERE clause, GROUP BY, or JOIN key should favor lightweight encodings, even if that means accepting slightly larger storage footprints.
Choosing Lighter Encodings for Hot Paths
My rule of thumb is simple: if a column is in your top 10 most referenced by queries, benchmark a lighter encoding first. For example, here’s a conceptual sketch of how I’d lean away from heavyweight codecs on hot analytics columns:
-- Pseudocode-style DDL focusing on hot vs cold columns
CREATE TABLE page_views (
view_id BIGINT ENCODING RAW, -- hot, scanned constantly
user_id BIGINT ENCODING DELTA, -- hot, filters & joins
url_id INT ENCODING DICT, -- hot, low-cardinality
country VARCHAR ENCODING DICT, -- hot, group-bys
user_agent VARCHAR ENCODING LZO, -- colder, heavier codec ok
raw_payload VARCHAR ENCODING ZSTD -- cold, maximize compression
);
Once I started profiling “CPU cost per GB scanned” instead of just “GB on disk,” it became obvious which hot columns needed lighter treatment. That small adjustment alone often recovers the performance teams expected from columnar storage compression in the first place. For a deeper exploration of how vectorized query engines interact with compression choices, it’s worth reading more about compression-aware vectorized execution: Composable Data Management: An Execution Overview | Proceedings of the VLDB Endowment.
3. Ignoring Encoding Strategies Beyond Basic Compression
In my experience, the teams that get the most out of columnar storage compression don’t obsess over which general-purpose codec to use; they obsess over encodings. Focusing only on ZSTD vs. Snappy vs. GZIP misses the real power of column formats: how values are represented before they ever reach a codec.
Why Encodings Matter More Than the Codec Logo
Encodings like dictionary, run-length encoding (RLE), delta, and bit-packing reshape your data into patterns that are both smaller and faster to process. When I moved a log analytics workload from “Snappy everywhere” to carefully chosen encodings, I saw bigger query gains from encoding changes than from swapping codecs.
- Dictionary encoding replaces repeated strings or IDs with small integer codes, shrinking data and boosting CPU cache efficiency.
- Run-length encoding (RLE) collapses long runs of the same value, which is common on sorted or clustered columns.
- Delta encoding stores differences between consecutive values, ideal for monotonic timestamps or IDs.
- Bit-packing squeezes small integers into minimal bits (e.g., 0–15 into 4 bits), dense and CPU-friendly for vectorized scans.
The key insight is that these encodings are semantic: they exploit properties of the column itself. The codec (ZSTD, Snappy, etc.) then works on an already compact, regular representation and often becomes almost a secondary concern.
Practical Examples of Encoding-Aware Design
When I design a schema now, I start by asking: “What’s the smallest and most CPU-friendly way to represent this column?” Only after that do I think about which codec to layer on top. Here’s a simplified example of how I might lay out encodings for an analytics table:
CREATE TABLE orders (
order_id BIGINT ENCODING DELTA, -- mostly increasing
customer_id INT ENCODING BITPACK, -- bounded ID range
status VARCHAR ENCODING DICT, -- few distinct values
country VARCHAR ENCODING DICT, -- classic dimension
created_at TIMESTAMP ENCODING DELTA, -- time-series
ship_date TIMESTAMP ENCODING DELTA, -- correlated with created_at
total_amount DECIMAL(12,2) ENCODING BITPACK, -- numeric facts
notes VARCHAR ENCODING LZO -- free text, best-effort
);
For columns like status and country, dictionary encoding not only compresses well, it speeds up filters and group-bys because the engine can operate on compact integer codes. For time-series timestamps, delta encoding plus optional bit-packing usually beats throwing a heavyweight codec at raw 64-bit values.
Once I started treating encodings as my “first line of compression” and codecs as an optional second layer, columnar storage compression stopped feeling like a black box and started behaving predictably under real workloads.
4. Designing Columnar Storage Compression Without Realistic Workload Benchmarks
The worst columnar storage compression configs I’ve seen all had the same root cause: they were tuned against pretty synthetic benchmarks instead of the ugly, mixed OLAP workloads that actually run in production. I’ve made this mistake myself—optimizing for a single wide table scan, then watching performance collapse once real dashboards, joins, and concurrent users hit the system.
Why Synthetic or Micro Benchmarks Lie
Micro-benchmarks usually focus on a narrow scenario: scan one table, run a single predicate, maybe aggregate a column. Real workloads combine:
- Multiple joins across differently compressed tables.
- Filters on hot columns plus occasional probes into wide text fields.
- Concurrent users triggering overlapping scans and cache pressure.
If you only benchmark a single-threaded full scan, you’ll naturally pick the strongest codec and most aggressive encodings. Under concurrency, that same choice can explode CPU usage and trash caches. In my experience, compression that looks “perfect” in isolation often crumbles once BI tools start firing dozens of similar-but-not-identical queries.
Building Compression Benchmarks That Resemble Reality
What’s worked well for me is to capture production-like query mixes and replay them against alternative compression profiles. That means testing:
- A blend of heavy dashboards, ad-hoc queries, and background batch jobs.
- Realistic concurrency (e.g., 10–50 parallel sessions, not just one).
- Both steady-state behavior and bursts, such as morning traffic spikes.
Even a simple replay script that runs your top 20 queries against two different compression configs can be revealing. For example, you might maintain two schemas, facts_fast and facts_small, and run a basic harness:
# Pseudo-benchmark loop comparing two compression profiles for q in queries/*.sql; do echo "Running $q on facts_fast" >> results.log time psql -f $q -v schema=facts_fast >> results.log echo "Running $q on facts_small" >> results.log time psql -f $q -v schema=facts_small >> results.log echo "---" >> results.log done
Once I started measuring “queries per second at realistic concurrency” instead of “fastest single scan,” my compression choices changed dramatically—usually toward lighter encodings on hot columns and more targeted heavy compression on cold data. For more depth, it’s worth exploring best practices around workload-driven benchmarking for analytical databases: DSB: A Decision Support Benchmark for Workload-Driven and Traditional Database Systems – Microsoft Research.
5. Forgetting Operational Concerns: Backfill, Vacuum, and Schema Evolution
Even when columnar storage compression looks perfect on day one, I’ve watched it quietly degrade over months because nobody planned for the boring operational stuff: backfills, compaction, vacuum, and schema changes. On long-lived OLAP systems, these processes often matter more than the initial compression settings.
How Backfills and Maintenance Drift Ruin Compression
Backfills are one of the biggest silent killers. When you trickle historical data into a system designed for append-only, time-ordered loads, you break clustering and value locality—the very things your encodings depend on. I’ve seen beautifully compressed partitions turn into a patchwork of tiny fragments after repeated backfills and partial rewrites.
- Out-of-order inserts destroy run-length and delta patterns, reducing their effectiveness.
- Frequent small updates create lots of versions that need vacuum or compaction to reclaim space.
- Skipped maintenance windows leave bloated storage files that scan slower and compress worse.
In my own systems, the moment I started scheduling regular compaction/vacuum jobs with compression in mind, storage stabilized and query performance stopped drifting downward over time.
Schema Evolution and Compression-Friendly Layouts
Schema evolution is another area where compression quietly suffers. Adding columns in the middle of wide tables, changing types, or altering encodings without a plan can lead to mixed layouts and partially compressed blocks that are harder for the engine to optimize.
- New rarely-used columns added to hot tables can inflate row width and hurt cache efficiency.
- Type changes (e.g., VARCHAR to BIGINT) can leave legacy segments still using the old, less efficient encoding.
- Re-encoding migrations done piecemeal may mix old and new encodings within the same partition.
What’s worked well for me is treating schema evolution as a planned migration, not an ad-hoc DDL tweak. For example, I’ll often:
- Stage new schemas (e.g.,
facts_v2) with better encodings. - Backfill in large, ordered batches to preserve clustering.
- Run a controlled vacuum/compaction cycle before cutting traffic over.
Once I started tying operational playbooks directly to compression goals—backfill strategy, maintenance cadence, and schema evolution policies—columnar storage compression behaved consistently instead of decaying into a slow, bloated mess over time.
Conclusion: Designing Columnar Storage Compression Like a System Engineer
When I treat columnar storage compression as a system design problem—not a checkbox—the results are consistently better: faster scans, predictable CPU use, and stable performance over time. The common mistakes all stem from ignoring context: real workloads, hot vs. cold data, encodings, and long-term operations.
Here’s the compact checklist I now use when tuning:
- Per-column mindset: Avoid a global on/off switch; choose encodings and codecs per column.
- Protect hot columns: Prefer lighter encodings and codecs on heavily scanned/filter/join columns.
- Exploit encodings first: Use dictionary, RLE, delta, and bit-packing before reaching for heavier codecs.
- Benchmark real workloads: Replay production-like query mixes with realistic concurrency, not just micro-tests.
- Plan operations: Align backfills, vacuum/compaction, and schema evolution with your compression strategy.
If you approach columnar storage compression with that checklist and a system engineer’s curiosity, you’ll rarely need to guess which settings make your warehouse both small and fast.

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.





