Introduction: Why Vectorized Query Execution Still Goes Wrong
Vectorized query execution looks straightforward on paper: process batches of values at a time, keep the CPU hot, and watch performance climb. Yet in my experience working on execution engines, this is exactly where subtle, expensive bugs and regressions tend to hide. The core problem isn’t the idea of vectorization itself, but how it interacts with branch-heavy logic, memory layout, pipelining, and modern CPUs.
Even experienced designers still repeat the same vectorized query execution mistakes: mixing row-wise assumptions into columnar paths, breaking cache locality with poorly sized batches, or bolting vectorization onto an already-fragmented pipeline. I’ve seen teams nail microbenchmarks but regress badly on real workloads because of these issues. In the rest of this article, I’ll walk through the top five pitfalls I keep encountering, and what has actually worked for me to avoid them in production systems.
1. Treating Vectorized Query Execution as a Drop-In Replacement for Iterators
When I first helped move an engine from a classic volcano iterator model to a vectorized design, the biggest trap was assuming we could just batch up tuples and keep the same control flow. On the surface, replacing next() with nextBatch() feels natural, but this mindset is one of the most common vectorized query execution mistakes I still see.
The iterator model is fundamentally row-oriented: each operator pulls one tuple at a time, often with lots of virtual calls and branches. Vectorized execution, by contrast, is about pushing big, uniform chunks of columnar data through tight loops that the CPU can predict and optimize. If I preserve per-row control decisions, deep call stacks, and stateful operators while merely operating on small batches, I usually end up with more overhead and almost none of the SIMD or cache benefits.
Here’s a minimal illustration in C-style pseudocode of the wrong mental model versus a more appropriate one:
// Naive "drop-in" style: iterator thinking with batches
int nextBatch(RowBatch *batch) {
int produced = 0;
while (produced < BATCH_SIZE) {
Row r;
if (!child_next(&r)) break; // per-row call
if (eval_predicate(&r)) { // per-row branching
batch->rows[produced++] = r; // copy row
}
}
return produced;
}
// More vector-native thinking: operate per column in tight loops
int nextBatch(VectorBatch *vb) {
int n = child_nextBatch(vb); // pull columns at once
for (int i = 0; i < n; i++) {
vb->selection[i] = eval_predicate_col(vb, i);
}
compact_with_selection(vb); // apply predicate in bulk
return vb->size;
}
In my experience, engines that just retrofit batches into iterator-style operators suffer from:
- Excess virtual dispatch, killing instruction cache locality.
- Branch-heavy per-row logic inside what should be vector-friendly loops.
- Limited SIMD usage because expressions aren’t designed for column-wise evaluation.
Instead of treating vectorization as a drop-in, I’ve had far better results by redesigning operators around columnar data flow, explicit selection vectors, and bulk expression evaluation. That usually means flattening control paths, minimizing per-batch state transitions, and letting whole vectors ride a single tight loop. Without that shift in mental model, vectorized query execution often underperforms a carefully tuned volcano engine, despite appearing more modern on paper.
Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask
2. Using the Wrong Batch Size and Ignoring Cache Behavior
One of the most damaging vectorized query execution mistakes I see is treating batch size as a random constant instead of a cache-conscious design choice. When I first started tuning vectorized operators, I was surprised how often a single change in batch size swung performance by 2–3×, even though the algorithm was identical.
If batches are too small, fixed overhead dominates: function calls, bookkeeping, and branch checks all eat into the gains of vectorization. If batches are too large, the working set for an operator spills beyond L1/L2 cache, and suddenly those beautiful columnar scans turn into cache-miss storms. Things get worse when a batch touches many wide columns; a “reasonable” row count can still explode the per-batch memory footprint.
What has worked for me in practice is to size batches so that the hot columns for a pipeline stage fit comfortably in L2, then validate that with profiling, not guesswork. I also pay close attention to memory layout: tightly packed, column-major vectors with simple selection masks usually keep the cache happier than scattered buffers and indirection-heavy structures. Here’s a small sketch of the kind of sizing logic I’ve used in prototypes:
# Pseudo-calculation for batch sizing L2_KB=1024 HOT_COLS=4 BYTES_PER_VALUE=8 TARGET_FRACTION=0.5 # leave headroom for other state max_rows=$(( (L2_KB * 1024 * TARGET_FRACTION) / (HOT_COLS * BYTES_PER_VALUE) )) echo "Suggested batch size: $max_rows rows"
In my own engines, the sweet spot is rarely the same across workloads, so I treat batch size as a tunable with sensible defaults, guided by cache sizes and typical column widths. Ignoring these details is a reliable way to end up with vectorized code that looks right but benchmarks poorly. How does one write code that best utilizes the CPU cache to improve performance?
3. Ignoring Pipelining and Materializing Every Operator Output
Another pattern I keep seeing in vectorized query execution mistakes is the urge to materialize after every operator. When I first wired up a vector engine, it felt “clean” to have each node produce a full, owned batch, but that design quietly killed both cache locality and end-to-end latency.
In a healthy pipeline, vectors flow through several operators before they’re written out: a scan feeds a filter, which feeds a projection, which might feed a simple aggregation. As long as the same batch is reused and transformed in place (or with light selection vectors), the working set stays small and hot in cache. Materializing at each step forces repeated writes and reads to memory, bloats bandwidth, and increases pressure on allocators and garbage collectors.
Here’s a simplified sketch of the contrast in C-style pseudocode:
// Over-materialized style: every operator allocates and copies
VectorBatch *scan_out = scan_table();
VectorBatch *filt_out = apply_filter(scan_out);
VectorBatch *proj_out = apply_projection(filt_out);
VectorBatch *agg_out = aggregate(proj_out);
// Pipelined style: transform a shared batch in-place
while (scan_next(scan_batch)) {
apply_filter_inplace(scan_batch); // update selection mask
apply_projection_inplace(scan_batch); // derive new columns
aggregate_incremental(scan_batch); // update aggregate state
}
In my own systems, moving to in-place pipelining often produced bigger wins than any micro-optimization of a single operator. It also simplified backpressure and scheduling, because I was pushing fewer, more meaningful batches through the graph.
4. Underestimating Null, Filter, and Selection Vector Handling
Once the basic vector operators are in place, nulls and selection vectors are where many vectorized query execution mistakes start to surface. In my experience, teams either bolt on null handling as an afterthought, or they over-complicate it until performance collapses under branches and mask shuffling.
A solid vector engine usually tracks three things per column: the values buffer, a null bitmap, and a selection vector that says which positions in the batch are actually active. If any of those get out of sync, you end up with nasty correctness bugs: predicates skipping null semantics, joins probing using stale selections, or aggregates counting filtered-out rows.
What has worked best for me is to treat the selection vector as the single source of truth for which indices are live, and make every operator either respect it or explicitly rebuild it. Null handling then becomes a matter of combining bitmasks instead of sprinkling if (is_null) checks everywhere. Here’s a small C-style sketch of that idea:
void and_predicate_with_nulls(Col *val, Bitmap *is_null,
SelVector *sel, Bitmap *out_mask) {
// For every active index, compute (expr && !NULL)
for (int k = 0; k < sel->count; k++) {
int i = sel->idx[k];
bool active = !bitmap_get(is_null, i) && eval_expr(val, i);
bitmap_set(out_mask, i, active);
}
}
void rebuild_selection_from_mask(Bitmap *mask, SelVector *sel) {
sel->count = 0;
for (int i = 0; i < BATCH_SIZE; i++) {
if (bitmap_get(mask, i)) {
sel->idx[sel->count++] = i;
}
}
}
When I started centralizing all filtering through selection vectors like this, a lot of edge-case bugs simply disappeared, and I could reason about CPU cost more clearly. The key is consistency: every operator must understand how nulls and selections propagate, or the whole pipeline becomes a minefield.
5. Forgetting Cost Models and Adaptive Strategies for Vectorized Plans
After moving to a vector engine, I’ve seen teams proudly keep their old iterator-era cost model and then wonder why the new system still picks terrible plans. This is one of the more subtle vectorized query execution mistakes: the physical execution changes dramatically, but the planner still thinks in terms of per-row costs and branch-heavy operators.
Vectorized execution shifts the trade-offs. Operators with expensive per-row logic may become cheap when amortized over a batch, while operators that trigger random I/O or scatter/gather patterns can become disproportionately expensive. Hash joins, for example, can benefit enormously from vectorized probing, but only if the cost model understands how batch size, cache fit, and SIMD-friendly predicates affect CPU time.
In my own engines, I’ve had to introduce distinct cost parameters for vectorized paths (e.g., cpu_per_batch, cpu_per_vector_op, cache miss penalties) and then calibrate them with microbenchmarks. I also found it valuable to add simple adaptive feedback: measure actual batch processing time at runtime, then feed that back to inform future decisions about join strategy or batch sizing.
Here’s a sketch of the kind of separation I like to keep between iterator and vector costs:
class CostModel:
def estimate_iterator_filter(self, rows, sel_frac):
# legacy row-wise estimate
return rows * (self.cpu_per_row_filter + self.branch_mispredict_penalty)
def estimate_vector_filter(self, rows, sel_frac, batch_size):
batches = (rows + batch_size - 1) // batch_size
return (batches * self.cpu_per_batch_overhead +
rows * self.cpu_per_vector_predicate)
def choose_filter_mode(self, rows, sel_frac, batch_size):
it_cost = self.estimate_iterator_filter(rows, sel_frac)
vec_cost = self.estimate_vector_filter(rows, sel_frac, batch_size)
return "vector" if vec_cost < it_cost else "iterator"
Once I started modeling vector-specific costs like this, the planner stopped favoring plans that looked good on paper but thrashed caches in practice. Pairing that with lightweight runtime feedback closes the loop, so vectorized execution isn’t just an implementation detail but a first-class part of planning and adaptation.
Testing the accuracy of query optimizers
Conclusion: Designing Vectorized Engines That Age Well
From what I’ve seen in real systems, vectorized query execution mistakes rarely come from bad intentions; they come from treating vectorization as a cosmetic refactor instead of a different execution model. Naively ported iterators, cache-blind batch sizing, over-materialized pipelines, sloppy null and selection handling, and iterator-era cost models all chip away at the gains you hoped for.
When I review a vectorized engine, I use a simple checklist:
- Does the engine truly think in columns and batches, or just wrap iterators?
- Are batch sizes and layouts tuned to CPU caches, not just “nice” powers of two?
- Are operators pipelined, with minimal intermediate materialization?
- Is there a single, consistent story for nulls, filters, and selection vectors?
- Does the optimizer have vector-aware cost models and adaptive feedback?
Engines that answer “yes” to those questions tend to age well: they’re easier to tune, more predictable under new workloads, and ready to take advantage of the next generation of hardware without another ground-up rewrite.

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.





