Designing a Vectorized Query Execution Engine: Modern Tradeoffs and Patterns

Introduction: Why Vectorized Query Execution Engines Are Dominating Modern Databases

When I started working with analytical databases, most engines I touched were still built around classic iterator-based execution: the familiar Volcano-style model that pulls one tuple at a time through a chain of operators. It was elegant and composable, but on modern CPUs it always felt like I was driving a sports car in first gear. As hardware evolved, that gap between theoretical peak performance and real query throughput became impossible to ignore.

A modern vectorized query execution engine flips the core assumption of the iterator model. Instead of calling next() for a single tuple, operators process whole batches (vectors) of columnar data at once. That small conceptual shift unlocks a stack of hardware-friendly optimizations: better CPU cache usage, efficient SIMD (Single Instruction, Multiple Data) execution, fewer virtual function calls, and less branching overhead. In my experience, once you profile both models side by side on real analytical workloads, the gains from vectorization are hard to walk away from.

This article dives into how to design a vectorized query execution engine with a focus on practical tradeoffs and patterns I’ve seen work in production systems. I’ll walk through the execution model, batch and memory layout decisions, operator interfaces, expression evaluation strategies, and how to balance raw performance against maintainability. I won’t rehash textbook database theory; instead, I’ll zoom in on the real-world engineering decisions you have to make when taking a vectorized engine from prototype to something you’d trust with serious workloads.

Iterator vs Vectorized Query Execution Engine: Core Model Comparison

When I first profiled a classic iterator engine against a vectorized query execution engine on the same workload, the surprise wasn’t just that vectorization was faster; it was how much of the time in the iterator model vanished into function calls, branches, and cache misses. To understand why, it helps to contrast the core execution models side by side: how they move data, how they schedule work, and how they interact with the CPU.

Control Flow: next()-per-tuple vs batch-at-a-time

The traditional Volcano-style iterator model exposes a simple interface: every operator implements a next() method that returns one tuple at a time. Parent operators repeatedly call next() on their children, pulling tuples up the pipeline. The control flow is deeply nested and very call-heavy.

In a vectorized query execution engine, operators instead process a whole batch of rows per call. The parent asks the child for a vector of, say, 1k or 4k rows, processes them in a tight loop, and then asks for the next batch. Control flow becomes much flatter and more predictable.

// Simplified contrast of execution styles

// Volcano-style iterator (row-at-a-time)
Tuple *next(Operator *op) {
    Tuple *t = next(op->child);
    if (!t) return NULL;
    if (predicate(t)) return t;
    return next(op); // recurse until we find a match or hit EOF
}

// Vectorized engine (batch-at-a-time)
size_t next_batch(Operator *op, Batch *out) {
    size_t n = next_batch(op->child, out);
    if (n == 0) return 0;

    // apply predicate over a selection vector
    out->nrows = filter_with_selection_vector(out, n);
    return out->nrows;
}

In my experience, this shift has three very practical effects:

Far fewer virtual calls: a vectorized engine may execute millions of rows with only thousands of next_batch calls.
Less branching: predicate and projection logic move into tight inner loops instead of living around the call boundary.
Better scheduling: the engine can choose batch sizes that balance latency and throughput for the workload.

One thing I learned the hard way was that even well-optimized iterator chains spend an alarming amount of time bouncing between frames; vectorization changes that profile completely.

Data Flow and Layout: Tuples vs Columns

Iterator engines are typically tuple-centric. Each next() call returns a full row, often as a struct or a tuple object. This works naturally with row-oriented storage but fights modern CPUs: each operator touches a few fields per row, pulling in cache lines full of unused data.

A vectorized query execution engine usually pairs batch processing with columnar layout. Instead of a batch of row objects, operators receive a set of column vectors (one per attribute) and, often, a selection vector indicating which row indices in the batch are currently active.

Row-oriented iterator model: good for OLTP-style point queries, but poor cache efficiency for analytical scans.
Columnar vectorized model: operators touch only the columns they need, which packs hot data tightly in memory.

In my own designs, the moment I switched to column vectors with a shared selection vector, expression evaluation became dramatically simpler: most operators just iterate over an index list and apply the same operation to contiguous arrays.

CPU Utilization and SIMD Opportunities

The main performance win of a vectorized query execution engine is how naturally it aligns with the CPU’s SIMD and cache hierarchy. Row-at-a-time execution tends to be branch-heavy, with irregular access patterns that confuse the branch predictor and trash the cache. SIMD is hard to exploit effectively when every tuple goes through its own code path.

With vectors of contiguous values, the compiler and hand-written kernels can leverage SIMD instructions to process multiple elements per CPU instruction. For example, applying a simple filter over a batch can be written as a tight loop that the compiler autovectorizes, or as an explicit SIMD kernel.

// Vector-style filter kernel over a column
size_t filter_gt_int32(const int32_t *col, size_t n,
                       int32_t threshold, uint16_t *sel_out) {
    size_t out_count = 0;
    for (size_t i = 0; i < n; i++) {
        // branch-light, SIMD-friendly pattern
        int32_t v = col[i];
        if (v > threshold) {
            sel_out[out_count++] = (uint16_t)i;
        }
    }
    return out_count;
}

From what I’ve seen in real systems, this pattern gives three clear wins over a traditional iterator engine:

SIMD-friendly loops: operators operate on contiguous arrays, making explicit or implicit vector instructions practical.
Cache locality: only the relevant columns are touched, so fewer cache lines are fetched.
Amortized overhead: loop setup and branch costs are spread over hundreds or thousands of rows per batch.

The tradeoff, of course, is that vectorization adds complexity around batch management and memory layout. The rest of this article digs into those design choices in more depth—how to pick batch sizes, how to structure operator interfaces, and where vectorization pays off the most in a modern analytical engine. For a deeper conceptual background on columnar execution and vectorization patterns, see Partition-based SIMD Processing and its Application to Columnar Database Systems.

CPU and Memory Realities Driving Vectorized Execution

Every time I’ve tuned a query engine, the same theme comes back: the bottleneck is almost never the raw ALU speed; it’s everything around it. A vectorized query execution engine is really just an admission that our CPUs, caches, and memory subsystems behave in specific, predictable ways, and that an engine that aligns with those realities will win almost every time.

Why Pipelines and Branch Predictors Hate Tuple-at-a-Time

Modern CPUs are deeply pipelined and speculative. They execute instructions out of order, try to guess your branches, and keep many operations in flight. That machinery shines when you give it long, predictable loops. Traditional tuple-at-a-time engines do the opposite: they stitch together lots of tiny next() calls, virtual dispatches, and branch-heavy control flow around each tuple.

In profiling sessions, I’ve seen this manifest as:

Frequent branch mispredictions: every if (predicate(tuple)) per row is an opportunity to flush the pipeline.
Instruction cache thrash: many small functions, often spread across the codebase, get exercised in a hot loop.
Limited instruction-level parallelism (ILP): work is too granular for the CPU to effectively overlap operations.

By contrast, a vectorized query execution engine pushes most per-row logic into tight loops over arrays. The branches become more predictable (“most of this column satisfies the predicate” or “almost none does”), and the hot path collapses into a few kernels that sit comfortably in the instruction cache.

// Pseudo-kernel for a vectorized comparison, friendly to CPU pipelines
size_t compare_eq_int32(const int32_t *left,
                        const int32_t *right,
                        size_t n, uint16_t *sel_out) {
    size_t out = 0;
    for (size_t i = 0; i < n; i++) {
        // Single predictable branch; compiler may even turn this into
        // branchless code using vector instructions.
        if (left[i] == right[i]) {
            sel_out[out++] = (uint16_t)i;
        }
    }
    return out;
}

In my experience, once the hot operators are written this way, the CPU pipeline finally gets the long, predictable streams it’s been designed for.

SIMD and the Case for Operating on Vectors

SIMD (Single Instruction, Multiple Data) is the other big hardware lever. Wide registers (e.g., AVX2, AVX-512) let a single instruction operate on multiple values at once. Tuple-at-a-time execution makes it painful to exploit SIMD: each tuple may follow different code paths, and extracting just the relevant values into vectors incurs overhead.

A vectorized query execution engine, on the other hand, is built around contiguous columns and batches, which are essentially tailor-made for SIMD. Compilers can often autovectorize simple loops, and where they can’t, it’s straightforward to drop in explicit intrinsics for the truly hot paths.

Same operation, many values: filters, projections, arithmetic, and many joins naturally map to SIMD.
Columnar layout: values of the same type are stored contiguously, ideal for loading into vector registers.
Fewer control hazards: minimal branching within the inner loop lets SIMD pipelines stay full.

One thing I learned early is that you don’t need to hand-write SIMD for every operator. Just structuring the engine around vectors and columnar data unlocks enough autovectorization to get big wins; then you selectively optimize the top few kernels where explicit SIMD pays off.

Caches, Memory Bandwidth, and Why Columnar Batches Win

Even with SIMD, CPU cores routinely sit idle waiting for data. Cache hierarchies (L1/L2/L3) and main memory bandwidth are often the real limiting factors. Tuple-at-a-time engines pull full rows through the pipeline, which means dragging in entire cache lines even if you only need a couple of columns.

A vectorized query execution engine typically uses a columnar or columnar-like layout with batches sized to fit comfortably in cache. That design addresses multiple hardware pain points:

Spatial locality: operators touch contiguous arrays of a single column, maximizing each cache line’s usefulness.
Temporal locality within a batch: the same data is reused across multiple operators before being evicted.
Controlled working set: batch size lets you tune how much data an operator keeps hot in cache.

Here’s a simplified sketch of what a batch might look like in memory:

typedef struct {
    uint16_t *sel;      // selection vector (indices into columns)
    size_t    count;    // number of active rows

    // Pointers to column data; each column is a contiguous array
    int32_t  *col_a;    // e.g., order_id
    double   *col_b;    // e.g., amount
    int32_t  *col_c;    // e.g., customer_id
} Batch;

In practice, I’ve found that right-sizing the batch is one of the most impactful tuning knobs. Too small and you lose amortization benefits; too large and you blow the cache and thrash memory bandwidth. The sweet spot often lands in the low thousands of rows for typical analytical workloads.

This hardware-aware view is what really motivates the design of a vectorized query execution engine. We’re not just changing the API from next() to next_batch(); we’re aligning execution with deep pipelines, SIMD lanes, multi-level caches, and finite memory bandwidth. The rest of the design – operator interfaces, expression evaluators, join and aggregation strategies – flows from that alignment. For a broader hardware-focused perspective on how CPUs and memory hierarchies shape database engine design, see Database Internals: Working with CPUs – ScyllaDB.

Designing the Vectorized Query Execution Interface

When I started building my first vectorized query execution engine, I underestimated how much the core interface decisions would shape everything else: performance, debuggability, and even how easy it was to add new operators. In practice, three pieces matter most: how a batch is represented, how operators expose their APIs, and how state is managed across calls. Get these right and the rest of the engine design falls into place.

Batch Representation: Columns, Selection Vectors, and Nulls

The batch is the basic currency of a vectorized engine. I like to think of it as a small in-memory columnar segment: a fixed-size set of rows, with one vector per column and metadata that lets operators manipulate subsets efficiently.

In my experience, a practical batch representation usually includes:

Column vectors: contiguous arrays, one per logical column in the pipeline schema.
Selection vector: an optional array of row indices indicating which slots are currently active.
Null bitmap or mask: a compact representation of nullability per column.
Row count: the number of active rows in the batch (after filters, joins, etc.).

// Simplified C-style batch layout for a vectorized engine

typedef struct {
    void    *data;      // pointer to column data
    uint8_t *nullmask;  // bitmask: 1 = null, 0 = not null (optional)
} ColumnVector;

typedef struct {
    uint16_t    *sel;       // selection vector, or NULL if all rows active
    uint16_t     count;     // number of active rows
    uint16_t     capacity;  // max rows per batch

    ColumnVector *cols;     // array of column vectors
    uint16_t      ncols;    // number of columns
} Batch;

One thing I learned the hard way is that committing early to a clear null representation saves endless bugs later. A per-column bitmap plus a shared selection vector has worked best for me; it keeps the hot paths simple and lets operators avoid allocating and freeing per-row objects.

Operator APIs: next_batch and Push vs Pull

On top of the batch abstraction, each operator in a vectorized query execution engine needs a minimal, predictable interface. I’ve consistently found a pull-based API with a next_batch method to be the most straightforward: parents ask children for the next batch, process it, and repeat until EOF.

A simple interface in C-like pseudocode might look like this:

typedef struct Operator Operator;

typedef struct {
    // opaque per-operator state, e.g., join hash table, group-by state
    void *impl;

    // Pull API: parent asks child for the next batch of results
    // Returns 0 rows to signal end-of-stream.
    uint16_t (*next_batch)(Operator *op, Batch *out);

    // Optional lifecycle hooks
    void (*open)(Operator *op);
    void (*close)(Operator *op);
} OperatorVTable;

struct Operator {
    OperatorVTable *vtable;
    void           *state;    // operator-specific state
};

The engine’s execution loop then becomes:

uint16_t execute_root(Operator *root, Batch *out) {
    // Called repeatedly until it returns 0
    return root->vtable->next_batch(root, out);
}

I’ve experimented with push-based models as well, where child operators push batches upstream. Those can shine for pipelining and backpressure, but they’re more complex to reason about. For most analytical workloads, a pull-based next_batch API strikes a good balance between simplicity and performance, especially when combined with well-chosen batch sizes.

State Management: Per-Operator Context and Lifecycles

Unlike stateless row operators, vectorized operators often maintain significant internal state: buffered batches, hash tables, aggregation buckets, or partially materialized results. Designing a clean state model is critical if you want your vectorized query execution engine to stay maintainable.

In my own implementations, I treat each operator as owning a single opaque state struct, with explicit open and close hooks:

typedef struct {
    Operator  base;

    // Example: state for a simple filter operator
    Operator *child;
    uint16_t  threshold;
} FilterState;

uint16_t filter_next_batch(Operator *op, Batch *out) {
    FilterState *st = (FilterState *)op->state;

    // Pull a child batch
    uint16_t n = st->child->vtable->next_batch(st->child, out);
    if (n == 0) {
        out->count = 0;
        return 0;
    }

    // Apply predicate using selection vector
    uint16_t *sel = out->sel;
    if (!sel) {
        // initialize selection vector if absent
        sel = out->sel = malloc(sizeof(uint16_t) * out->capacity);
    }

    int32_t *col = (int32_t *)out->cols[0].data; // assume column 0
    uint16_t out_count = 0;
    for (uint16_t i = 0; i < n; i++) {
        if (col[i] >= st->threshold) {
            sel[out_count++] = i;
        }
    }
    out->count = out_count;
    return out_count;
}

This pattern gives me a few concrete benefits:

Isolation: each operator’s state lives in a single struct, so it’s easy to inspect or extend.
Lifecycle clarity: memory ownership and cleanup are tied to open/close rather than scattered across the engine.
Optimization hooks: it’s trivial to add operator-specific caches, precomputed structures, or adaptive behavior.

In practice, once the batch, operator API, and state management patterns are solid, adding new operators in a vectorized query execution engine becomes a mechanical exercise. That’s usually when performance work can focus on a handful of hot kernels instead of wrestling with ad-hoc interfaces or leaky state.

Pipelining in a Vectorized Query Execution Engine

Once the basic batch and operator interfaces are in place, the next big design question is: how do we actually wire operators together and keep the hardware busy? In my experience, pipelining is where a vectorized query execution engine either feels crisp and efficient or sluggish and over-synchronized. The key is to understand how to form pipelines, where pipeline breakers live, and how to map the whole plan to parallel workers.

Forming Pipelines from the Query Plan

Conceptually, a pipeline is a maximal sequence of operators that can run in a streaming, pull-based fashion: scans feed filters, which feed projections, which feed simple joins or aggregations, all with batches flowing straight through. The query planner produces a logical plan; the executor’s job is to cut that plan into pipelines based on what can be streamed and what requires materialization.

In a vectorized query execution engine, I usually treat a pipeline as:

One or more sources: table or index scans that produce initial batches.
A chain of pipelineable operators: filters, projections, some joins, simple aggregations.
Termination at a breaker or sink: a hash table, sort, or final materialization step.

In code, a pipeline is often just a root operator plus metadata:

typedef struct {
    Operator *root;      // root operator of this pipeline
    Operator **sources;  // list of source operators (scans)
    size_t    num_sources;
} Pipeline;

void run_pipeline(Pipeline *p) {
    Batch batch;
    init_batch(&batch, DEFAULT_CAPACITY);

    while (true) {
        uint16_t n = p->root->vtable->next_batch(p->root, &batch);
        if (n == 0) break;
        // Sink side-effects here: write result, send to next stage, etc.
    }
}

When I map a plan to pipelines, I try to keep each pipeline as long as possible while respecting places that need to see all input (aggregations, sorts), because longer pipelines usually mean better cache reuse and fewer batch materializations.

Pipeline Breakers: Hash Joins, Sorts, and Aggregations

Even in a vectorized engine, not every operator can stream fully. Some need to buffer large portions of input before producing output; these are the classic pipeline breakers. Recognizing and handling them cleanly is critical for predictable performance and memory usage.

Common pipeline breakers I encounter are:

Hash joins (build side): must consume the entire build input to construct the hash table.
Global aggregations: need to see all rows to finalize aggregates (e.g., SUM, COUNT without GROUP BY).
Sorts and top-k: require materializing and ordering a full input stream.

In a vectorized query execution engine, the typical pattern is to split the plan around these breakers:

Upstream pipeline(s) feed the breaker and materialize into an internal structure.
The breaker then acts as a source operator for a downstream pipeline, producing batches from its internal state.

// Example: hash join as a pipeline breaker on the build side

typedef struct {
    Operator  base;
    Operator *build_input;   // upstream pipeline root
    Operator *probe_input;   // downstream pipeline source

    HashTable *ht;           // build-side hash table
    bool       built;
} HashJoinState;

void hash_join_open(Operator *op) {
    HashJoinState *st = (HashJoinState *)op->state;
    if (!st->built) {
        Batch batch;
        init_batch(&batch, DEFAULT_CAPACITY);
        // Consume complete build input
        while (true) {
            uint16_t n = st->build_input->vtable->next_batch(st->build_input, &batch);
            if (n == 0) break;
            hash_table_build(st->ht, &batch);
        }
        st->built = true;
    }
}

One practical lesson I’ve learned: being explicit about which operators are breakers simplifies scheduling enormously. Once breakers are identified up front, you can plan the execution order and memory budget per pipeline instead of improvising at runtime.

Parallel Execution and Scheduling Pipelines

Vectorization and pipelining get you only so far on a single core; to use modern hardware well, a vectorized query execution engine must run pipelines in parallel. The main design decisions here are: how to split work across threads and how to coordinate them safely.

The pattern that’s worked best for me is:

Horizontal partitioning of sources: each worker gets a slice of the table or index scan (by range or block).
One pipeline instance per worker: each worker thread runs the same pipeline logic on its partition.
Shared or partitioned state where needed: e.g., local aggregation hashes merged at the end.

typedef struct {
    Pipeline *pipeline;
    ScanSpec  scan_spec;   // defines this worker's data partition
} WorkerTask;

void *worker_run(void *arg) {
    WorkerTask *task = (WorkerTask *)arg;
    Pipeline   *p    = task->pipeline;

    // Configure the scan source for this worker's partition
    configure_scan_partition(p->sources[0], &task->scan_spec);

    Batch batch;
    init_batch(&batch, DEFAULT_CAPACITY);

    while (true) {
        uint16_t n = p->root->vtable->next_batch(p->root, &batch);
        if (n == 0) break;
        // Thread-local sink or partial aggregation here
    }
    return NULL;
}

In practice, I also keep these additional points in mind when designing parallel scheduling:

Avoid fine-grained locks: prefer per-thread state and a merge phase to shared structures.
Batch-aware load balancing: workers should handle roughly similar numbers of batches to avoid stragglers.
Backpressure between pipelines: if a downstream breaker is slow, upstream workers must not overproduce.

One thing I learned the hard way was that naive parallelism (e.g., one thread per operator) tends to oversubscribe the CPU and thrash caches. Designing around pipeline-level parallelism, with well-partitioned sources and mostly thread-local state, has consistently given me the best throughput for a vectorized query execution engine.

Implementing Core Vectorized Operators: Scans, Filters, Joins, and Aggregates

When I started moving from iterator-based engines to a fully vectorized query execution engine, I found that the mental model for core operators had to change. Instead of thinking in terms of per-tuple callbacks, I began to think in terms of batch kernels: tight loops over column vectors, with selection vectors and null masks defining the active rows. Scans, filters, joins, and aggregates all follow this pattern, but each uses the vectorized model a bit differently.

Vectorized Scans: Feeding the Pipeline Efficiently

In an iterator engine, a scan typically returns one tuple per next() call, often decoding storage formats on demand. In a vectorized query execution engine, the scan’s job is to produce well-formed batches of column vectors, already decoded and laid out for downstream operators.

In my experience, a good vectorized scan operator should:

Read data in block-sized chunks that align with storage (e.g., pages or column chunks).
Populate column vectors directly from on-disk or in-memory formats, minimizing copying.
Optionally push down simple predicates to avoid filling vectors with rows that will be discarded immediately.

typedef struct {
    Operator  base;
    StorageIt *it;           // iterator over storage blocks

    // Column metadata and decoding state
    ColumnDesc *cols;
    uint16_t    ncols;
} ScanState;

uint16_t scan_next_batch(Operator *op, Batch *out) {
    ScanState *st = (ScanState *)op->state;

    if (!storage_it_has_next(st->it)) {
        out->count = 0;
        return 0;
    }

    StorageBlock blk = storage_it_next(st->it);
    uint16_t n = decode_block_into_batch(&blk, st->cols, out);
    out->sel   = NULL;     // all rows active
    out->count = n;
    return n;
}

One thing I’ve learned is that scans are where you win or lose a lot of performance in a vectorized query execution engine. If your scan can produce column vectors that are already filter-friendly and SIMD-ready, the rest of the pipeline becomes much easier to optimize.

Vectorized Filters and Projections: Working with Selection Vectors

Filters and projections are where the vectorized model really shines. Instead of allocating or copying rows, they manipulate a selection vector over existing column vectors. This keeps memory traffic low and makes SIMD-friendly loops straightforward.

A typical pattern I use is:

Child operator fills a batch with contiguous column data (and possibly an existing selection vector).
Filter evaluates a predicate over the active row indices and builds a new selection vector.
Projection just remaps or derives columns without touching row-level structures.

typedef struct {
    Operator  base;
    Operator *child;
    int32_t   threshold;
} FilterState;

uint16_t filter_next_batch(Operator *op, Batch *out) {
    FilterState *st = (FilterState *)op->state;

    // Pull input from child
    uint16_t n = st->child->vtable->next_batch(st->child, out);
    if (n == 0) {
        out->count = 0;
        return 0;
    }

    uint16_t *in_sel  = out->sel;
    uint16_t *out_sel = ensure_selection_vector(out); // allocate if NULL

    int32_t *col = (int32_t *)out->cols[0].data; // e.g., filter on column 0

    uint16_t out_count = 0;
    if (in_sel) {
        // Respect existing selection vector
        for (uint16_t i = 0; i < n; i++) {
            uint16_t idx = in_sel[i];
            if (col[idx] >= st->threshold) {
                out_sel[out_count++] = idx;
            }
        }
    } else {
        for (uint16_t i = 0; i < n; i++) {
            if (col[i] >= st->threshold) {
                out_sel[out_count++] = i;
            }
        }
    }

    out->sel   = out_sel;
    out->count = out_count;
    return out_count;
}

Compared to iterator-based filters, there’s no tuple construction, no object allocation, and no per-row virtual dispatch. In my own benchmarking, even naive vectorized filters like the one above routinely outperform highly tuned tuple-at-a-time equivalents, especially once the compiler autovectorizes the hot loops.

Vectorized Joins and Aggregates: Hash Tables over Batches

Joins and aggregates are the operators where vectorization often feels the most foreign if you come from a Volcano-style background. Instead of incrementally updating state per tuple, you build and probe hash tables in bulk, using batches and selection vectors to drive the process.

For hash joins, I typically split the implementation into:

Build phase: consume all build-side batches and insert keys (and payloads) into a hash table.
Probe phase: for each probe batch, compute hash keys for the active rows, probe in bulk, and produce a result batch with joined columns.

typedef struct {
    Operator  base;
    Operator *build_input;
    Operator *probe_input;

    HashTable *ht;
    bool       built;
} HashJoinState;

static void hash_join_build(HashJoinState *st) {
    Batch batch;
    init_batch(&batch, DEFAULT_CAPACITY);

    while (true) {
        uint16_t n = st->build_input->vtable->next_batch(st->build_input, &batch);
        if (n == 0) break;
        hash_table_build_batch(st->ht, &batch);
    }
    st->built = true;
}

uint16_t hash_join_next_batch(Operator *op, Batch *out) {
    HashJoinState *st = (HashJoinState *)op->state;
    if (!st->built) {
        hash_join_build(st);
    }

    Batch probe;
    init_batch_like(&probe, out);

    // Pull a probe-side batch
    uint16_t n = st->probe_input->vtable->next_batch(st->probe_input, &probe);
    if (n == 0) {
        out->count = 0;
        return 0;
    }

    // Vectorized probe kernel: compute hashes, look up matches, write to out
    uint16_t out_count = hash_table_probe_batch(st->ht, &probe, out);
    out->count = out_count;
    return out_count;
}

Aggregates follow a similar bulk pattern. Rather than updating aggregates row by row, a vectorized aggregation kernel processes whole batches at a time, often with per-thread partial hash tables that are merged later. This is especially powerful when combined with SIMD-friendly accumulation loops over numeric columns.

For example, a simple global SUM in a vectorized query execution engine might look like this:

typedef struct {
    Operator  base;
    Operator *child;
    double    sum;
    bool      done;
} SumAggState;

uint16_t sum_agg_next_batch(Operator *op, Batch *out) {
    SumAggState *st = (SumAggState *)op->state;
    if (st->done) {
        out->count = 0;
        return 0;
    }

    Batch batch;
    init_batch(&batch, DEFAULT_CAPACITY);

    while (true) {
        uint16_t n = st->child->vtable->next_batch(st->child, &batch);
        if (n == 0) break;

        uint16_t *sel = batch.sel;
        double   *col = (double *)batch.cols[0].data;

        if (sel) {
            for (uint16_t i = 0; i < n; i++) {
                st->sum += col[sel[i]];
            }
        } else {
            for (uint16_t i = 0; i < n; i++) {
                st->sum += col[i];
            }
        }
    }

    // Emit one-row result
    out->count = 1;
    out->sel   = NULL;
    double *dst = (double *)out->cols[0].data;
    dst[0] = st->sum;

    st->done = true;
    return 1;
}

Compared to iterator-based implementations, these vectorized operators trade per-row housekeeping for batch-oriented kernels and explicit state. In my experience, once you embrace that shift and build a handful of robust kernels for scans, filters, joins, and aggregates, the bulk of a vectorized query execution engine falls into place. For more advanced join and aggregation strategies in columnar, SIMD-heavy systems, see ClickHouse – Lightning Fast Analytics for Everyone.

Null Handling, Complex Types, and UDFs in a Vectorized Engine

After I had the basic operators in place for my vectorized query execution engine, the real-world friction showed up around three things: SQL null semantics, nested/complex types, and user-defined functions. It’s tempting to bolt these on like in an iterator engine, but doing that usually destroys the nice tight loops you worked so hard to build. The trick is keeping them data-structure friendly so vectorization and SIMD can still shine.

Null Semantics with Bitmaps and Propagation Rules

Row-at-a-time engines often hide null checks inside accessor methods; vectorized engines can’t afford that overhead. In my experience, the most practical pattern is:

Per-column null bitmap: 1 bit per row, stored contiguously.
Selection vector first, nulls second: filter down to active rows before touching null state.
Operator-specific null rules: for arithmetic, comparisons, predicates, etc.

// Example: vectorized addition with null-aware semantics
void add_int32_column(const Batch *in_a,
                      const Batch *in_b,
                      Batch *out,
                      uint16_t count,
                      const uint16_t *sel) {
    int32_t  *a    = (int32_t *)in_a->cols[0].data;
    int32_t  *b    = (int32_t *)in_b->cols[0].data;
    int32_t  *dst  = (int32_t *)out->cols[0].data;
    uint8_t  *na   = in_a->cols[0].nullmask;
    uint8_t  *nb   = in_b->cols[0].nullmask;
    uint8_t  *nout = out->cols[0].nullmask;

    for (uint16_t i = 0; i < count; i++) {
        uint16_t idx = sel ? sel[i] : i;
        bool is_null = is_null_bit(na, idx) || is_null_bit(nb, idx);
        set_null_bit(nout, idx, is_null);
        if (!is_null) {
            dst[idx] = a[idx] + b[idx];
        }
    }
}

One thing I learned the hard way is that mixing null checks into every arithmetic expression inline makes the code unreadable and slow. Centralizing bitmap operations into small helpers and keeping the main loops branch-light has been far more maintainable for me.

Complex and Nested Types: Offsets, Child Vectors, and Indirection

Arrays, structs, and maps don’t fit neatly into a single flat column, but abandoning vectorization for them is a losing strategy. Instead, I treat complex types as composite column vectors:

Structs: one child vector per field, all sharing the same selection vector.
Arrays/Lists: an offsets vector plus a values vector (and maybe a lengths vector).
Maps: encoded as two aligned arrays (keys and values) with offsets.

typedef struct {
    // offsets[i]..offsets[i+1]-1 is the slice in 'values' for row i
    uint32_t *offsets;
    void     *values;      // contiguous storage for all elements
    uint8_t  *nullmask;    // nulls per row, not per element
} ListVector;

In practice, I try to keep the outer operator logic vectorized over rows, and then handle per-list or per-struct work in inner loops or specialized kernels. That way, the main pipeline still runs on batches, with only localized complexity in the complex-type operators.

UDFs: Keeping Extensibility without Killing the Hot Path

UDFs are where I’ve seen many vectorized designs fall back to row-at-a-time behavior. If every row calls into a black-box function pointer, you lose SIMD, branch predictability, and most of your gains. To avoid that, I use a few guidelines:

Batch-oriented UDF API: require UDF authors to implement a function that operates over column vectors + selection vector.
Type-specialized entry points: generate different UDF wrappers per concrete type to avoid dynamic type checks in the hot loop.
Null-aware contracts: UDFs get explicit null bitmaps or pre-filtered non-null selections.

typedef void (*UdfVectorFn)(const ColumnVector *args,
                            uint16_t nargs,
                            uint16_t count,
                            const uint16_t *sel,
                            ColumnVector *result);

typedef struct {
    Operator    base;
    Operator   *child;
    UdfVectorFn fn;
} UdfState;

uint16_t udf_next_batch(Operator *op, Batch *out) {
    UdfState *st = (UdfState *)op->state;

    uint16_t n = st->child->vtable->next_batch(st->child, out);
    if (n == 0) return 0;

    ColumnVector *args = out->cols;  // assume input columns are UDF args
    ColumnVector *res  = &out->cols[out->ncols - 1];

    st->fn(args, out->ncols - 1, n, out->sel, res);
    return n;
}

In my experience, once UDFs are batch-oriented, they behave like any other operator kernel: you can profile them, vectorize them, and even inline or JIT them if the engine supports it. That’s a huge difference from opaque row-based callbacks. For more on performance-conscious UDF designs and null semantics in analytical databases, see The UDFBench Benchmark for General-purpose UDF Queries.

Profiling and Benchmarking a Vectorized Query Execution Engine

Once I had a working vectorized query execution engine, I quickly realized that intuition alone wasn’t enough to tune it. The performance story only really came together when I combined proper profiling with a mix of realistic workloads and tight microbenchmarks around the hottest kernels.

Building a Benchmark Suite: End-to-End and Micro-Level

I’ve had the best results by separating benchmarks into two categories:

End-to-end workloads: star-schema analytics, filtered scans, group-bys, and joins that resemble real dashboards or reports. These validate that pipelines and schedulers behave under realistic data distributions and skew.
Microbenchmarks: isolated tests for filters, hash joins, aggregations, and scans on synthetic data. These expose regressions and help compare alternative kernel implementations.

A simple C-style driver for a microbenchmark around a filter kernel might look like this:

void bench_filter_operator(Operator *filter, size_t iters) {
    Batch batch;
    init_batch(&batch, DEFAULT_CAPACITY);

    uint64_t total_rows = 0;
    uint64_t t0 = now_ns();

    for (size_t i = 0; i < iters; i++) {
        while (true) {
            uint16_t n = filter->vtable->next_batch(filter, &batch);
            if (n == 0) break;
            total_rows += n;
        }
        // reset input for next iteration (omitted)
    }

    uint64_t t1 = now_ns();
    double seconds = (t1 - t0) / 1e9;
    double mrps = (double)total_rows / 1e6 / seconds;
    printf("Filter throughput: %.2f M rows/s\n", mrps);
}

In my experience, this kind of focused loop makes it obvious when a seemingly harmless change cuts throughput in half.

Hotspot Discovery with CPU Profilers and Hardware Counters

Vectorized engines tend to have fewer, fatter hot paths, which is great for tools like perf, VTune, or your platform’s sampling profiler. When I profile, I look for:

Top symbols: the specific kernels (filter, join probe, aggregate) that dominate CPU time.
Branch mispredictions and I-cache misses: signs that control flow or inlining is hurting the pipeline.
LLC misses and memory bandwidth: indicators that batches are too large or access patterns are scattered.

# Example: profiling a query execution binary with Linux perf
perf record -g -- ./run_query --plan benchmark_plan.json
perf report --sort=dso,symbol

One thing I learned the hard way is that microbenchmarks can lie if the data is too clean. I now always test with varied selectivities, skewed group keys, and mixed null distributions to see how robust an operator is under real-world conditions.

Tuning Batch Sizes, Pipelines, and Parallelism

Once hotspots are clear, tuning usually comes down to a few levers:

Batch size: too small and overhead dominates; too large and you thrash caches. I typically sweep across powers of two (e.g., 512–8192 rows) and measure throughput.
Pipeline shape: merging or splitting pipelines can change cache locality and the impact of pipeline breakers.
Degree of parallelism (DOP): increasing threads helps until you saturate memory bandwidth or hit lock contention.

// Sketch: experimenting with batch size
for (uint16_t cap = 512; cap <= 8192; cap *= 2) {
    set_global_batch_capacity(cap);
    double mrps = run_benchmark_query();
    printf("Batch %u: %.2f M rows/s\n", cap, mrps);
}

In my own tuning sessions, plotting throughput vs. batch size or DOP has often revealed a surprisingly narrow sweet spot. Once that’s dialed in, further gains usually come from tightening the inner loops of just a handful of vectorized kernels. For a deeper discussion of profiling techniques and hardware-counter-driven tuning in analytical databases, see Rethinking MIMD-SIMD Interplay for Analytical Query Processing in In-Memory Database Engines.

Coexisting with Legacy Iterator Engines and Hybrid Architectures

The first time I tried to drop a new vectorized query execution engine into an existing system, I assumed I could just flip a switch and replace the Volcano-style iterator engine. That didn’t last long. In practice, hybrid architectures are the norm: you phase vectorization in, interoperate with legacy operators, and keep safe fallbacks for tricky features and edge cases.

Bridging Operators: Row-to-Vector and Vector-to-Row Adapters

The core building block of a hybrid architecture is the adapter that lets row and vector worlds talk to each other. I usually introduce two special operators:

Row-to-vector adapter (R2V): pulls tuples from a legacy iterator and packs them into batches (column vectors + selection vector).
Vector-to-row adapter (V2R): pulls batches from a vectorized subtree and exposes them as row-at-a-time next() calls.

A simplified V2R wrapper around a vectorized subtree might look like this:

typedef struct {
    // Legacy iterator interface
    bool (*next)(void *self, Tuple *out);

    // Wrapped vectorized subtree
    Operator *vec_root;
    Batch     batch;
    uint16_t  cursor;   // current row within batch
} VecToRowIter;

bool v2r_next(void *self, Tuple *out) {
    VecToRowIter *st = (VecToRowIter *)self;

    while (true) {
        if (st->cursor < st->batch.count) {
            uint16_t idx = st->batch.sel ? st->batch.sel[st->cursor] : st->cursor;
            materialize_tuple_from_batch(&st->batch, idx, out);
            st->cursor++;
            return true;
        }
        uint16_t n = st->vec_root->vtable->next_batch(st->vec_root, &st->batch);
        if (n == 0) return false; // EOF
        st->cursor = 0;
    }
}

In my experience, these adapters are where most of the overhead in a hybrid system lives. I try hard to keep vectorized islands as large as possible so I cross the bridge only a few times per plan.

Gradual Migration: Operator-by-Operator Vectorization

Rather than rewriting the entire executor, I’ve had more success migrating operator by operator. The typical path looks like this:

Start with scan + filter + projection, since they’re relatively self-contained and easy to vectorize.
Add joins and aggregations, initially behind feature flags or query hints.
Progressively replace hot logical operators in the planner with vectorized physical counterparts.

At the planner level, this often means attaching capabilities to each physical operator implementation—e.g., “vectorized capable” vs. “iterator only”—and letting the planner choose the best combination while respecting adapters where necessary.

typedef enum { ENGINE_ITERATOR, ENGINE_VECTORIZED } EngineKind;

typedef struct {
    LogicalOp  *logical;
    EngineKind  engine;
} PhysicalOpChoice;

PhysicalOpChoice choose_physical(LogicalOp *lop) {
    if (supports_vectorized(lop)) {
        return (PhysicalOpChoice){ lop, ENGINE_VECTORIZED };
    } else {
        return (PhysicalOpChoice){ lop, ENGINE_ITERATOR };
    }
}

What’s worked well for me is to roll out vectorization for a subset of queries (or tenants), monitor regressions carefully, and then expand coverage as confidence grows.

Fallback Strategies and Capability Detection

Even with a mature vectorized query execution engine, there will be cases where falling back to the iterator engine is the right call: exotic UDFs, rare complex types, or experimental features. I usually rely on two mechanisms:

Static capability checks: at planning time, detect whether all operators in a subtree support vectorization. If not, keep that subtree purely iterator-based.
Runtime guards: if a vectorized operator encounters unsupported data shape or configuration, it can raise a “not supported” signal that triggers replanning or a retry with iterator operators.

bool plan_can_be_vectorized(const PlanNode *node) {
    if (!node) return true;
    if (!node_supports_vector(node)) return false;
    for (size_t i = 0; i < node->nchildren; i++) {
        if (!plan_can_be_vectorized(node->children[i])) return false;
    }
    return true;
}

One lesson I’ve learned is that a clean fallback path is liberating: it lets you aggressively optimize the vectorized path without having to support every odd corner case on day one. Over time, as the vectorized query execution engine gets richer, you can shrink the iterator-only zones until they’re effectively legacy compatibility shims.

Conclusion and Design Checklist for Your Vectorized Query Execution Engine

Looking back over the engines I’ve worked on, the successful ones all treated vectorization as a coherent design, not just a faster inner loop. Batches, column layouts, pipelines, and profiling strategy all had to line up. When they did, the gains over a traditional iterator engine were not subtle—they changed what workloads were feasible on a single box.

Here’s the concise checklist I keep in mind when (re)designing a vectorized query execution engine:

Batch and memory model: fixed-size batches, columnar layout, alignment for SIMD, and clear ownership/lifetime rules.
Operator contracts: pull-based next_batch API, selection vectors, null bitmaps, and predictable error/EOF semantics.
Pipelining & scheduling: explicit pipeline breakers, pipeline-level parallelism, and partitioned state instead of fine-grained locks.
Core operators: vectorized scans, filters, joins, and aggregates implemented as tight, branch-light kernels.
Nulls & complex types: bitmap-based null handling, offset/child-vector patterns for nested types, and batch-oriented UDF APIs.
Profiling & tuning: dedicated microbenchmarks, realistic end-to-end workloads, hardware-counter analysis, and systematic batch-size/DOP sweeps.
Hybrid coexistence: cheap row↔vector adapters, planner-level capability detection, and clean fallbacks to legacy iterator operators.

In my experience, if you can tick these boxes—and keep profiling as a first-class activity—you’re in a good position to evolve an iterator-bound system into a robust, high-throughput vectorized query execution engine without losing debuggability or correctness along the way.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.