Skip to content
Home » All Posts » How to Build a Robust PostgreSQL Logical Replication Slot Pipeline

How to Build a Robust PostgreSQL Logical Replication Slot Pipeline

Introduction: Why PostgreSQL Logical Replication Slots Matter for CDC

When I started building change data capture (CDC) pipelines on PostgreSQL, the hardest part wasn’t decoding the data itself – it was making the pipeline reliable. Keeping track of exactly which changes had been processed, surviving outages, and avoiding data loss or duplication quickly became the real challenge. That’s where PostgreSQL logical replication slots come in.

Logical replication slots are PostgreSQL’s built-in mechanism for guaranteeing that every committed change in the WAL (Write-Ahead Log) is available to a logical consumer until it has been safely processed. Instead of tailing WAL files in a best-effort way, a replication slot pins the required WAL segments and exposes decoded changes in a consistent, ordered stream. In a CDC setup, that stream becomes the backbone of how events flow from PostgreSQL into downstream systems like Kafka, data warehouses, or microservices.

On top of the slot, logical decoding plugins translate raw WAL records into meaningful change events (for example, INSERT, UPDATE, DELETE with row data). The pipeline then needs its own form of offset tracking – storing how far it has read in the slot, mapping that position to downstream acknowledgements, and deciding when it is safe to advance or restart. In my experience, the tight coupling between PostgreSQL logical replication slots, WAL decoding, and robust offset tracking is what separates a toy CDC prototype from a production-grade replication pipeline that can handle crashes, backpressure, and schema changes without losing or duplicating data.

Prerequisites and Assumptions for Your PostgreSQL Logical Replication Pipeline

Before wiring up PostgreSQL logical replication slots for CDC, I always make sure the basics are nailed down, otherwise debugging becomes painful later. I assume you’re running PostgreSQL 10 or newer (ideally 13+), with wal_level = logical enabled, sufficient max_replication_slots, and network access to connect using a replication-capable user. That user needs REPLICATION privileges (or superuser), plus normal SELECT access on the tables you want to capture.

On the tooling side, you’ll need a logical decoding plugin such as pgoutput (built-in) or wal2json, and a client that can speak the replication protocol and manage offsets (for example, a custom service or a CDC framework). When I bring a new environment online, I test with a lightweight script that opens a logical replication slot, reads a few changes, and then cleanly drops the slot so I know permissions, WAL config, and plugin loading all behave as expected. If you’re unsure about which decoding plugin fits your use case, Comparison of test_decoding and pglogical plugins in Amazon Aurora PostgreSQL for data migration using AWS DMS can help you line that up before you commit to a pipeline design.

Understanding PostgreSQL Logical Replication Slots and WAL Decoding

When I first dug into PostgreSQL logical replication slots, the feature only really clicked once I understood how WAL, slots, decoding plugins, and LSN-based offsets fit together as one pipeline. In this section, I’ll walk through that mental model so you can reason about correctness instead of treating the replication stream as a black box.

WAL and LSNs: The Foundation of the Stream

Everything starts with PostgreSQL’s Write-Ahead Log (WAL). Every change to data is recorded in WAL as a sequence of low-level records. Each record lives at a specific Log Sequence Number (LSN), which is effectively a byte offset in the WAL stream. You can think of the WAL as an append-only log of all committed changes, and the LSN as the precise bookmark into that log.

For CDC, LSNs are gold: they give you a stable, totally ordered position that you can safely store and later resume from. In my experience, any serious CDC implementation ends up treating LSNs as the canonical offset, even if it also maintains extra downstream offsets (like Kafka partitions or batch IDs).

What PostgreSQL Logical Replication Slots Actually Do

A logical replication slot is a server-side handle that tracks how far a logical consumer has progressed through the WAL. Once a slot is created, PostgreSQL will not discard WAL segments that might still be needed by that slot. This is what makes PostgreSQL logical replication slots so powerful for CDC: the server guarantees that changes remain available until you confirm they’re processed.

Internally, the slot holds a restart LSN and a confirmed flush LSN. As your consumer reads changes and acknowledges them, the slot position advances. If your consumer crashes and reconnects using the same slot, PostgreSQL resumes streaming from the last confirmed LSN, not from the current tip of WAL. One thing I learned the hard way was to monitor for slots that never advance; they can silently hold onto WAL and bloat disk if the consumer is misbehaving.

Logical Decoding and Output Plugins

Raw WAL records are not directly useful to most CDC pipelines. That’s where logical decoding and output plugins come into play. A decoding plugin takes WAL changes and emits a higher-level representation: tables, operations (INSERT/UPDATE/DELETE), and row data. Plugins like pgoutput or wal2json sit between the WAL and your client, turning binary WAL into structured change events.

From the client’s perspective, you subscribe to a slot with a chosen decoding plugin, then read a stream of decoded messages. Each message is still anchored to an LSN, but now you see something like “UPDATE public.users SET email = … WHERE id = 42” as JSON or a protocol-specific format. When I prototype a new pipeline, I often start with a simple decoding plugin and a minimal consumer, just to inspect the exact payloads and verify they contain everything my downstream systems need.

Mapping Offsets to LSNs in Your CDC Pipeline

The last piece of the puzzle is how your pipeline tracks progress. The core rule I follow is: treat the WAL LSN as the source of truth. Your application reads decoded changes, processes or forwards them, then persists the “last safely processed LSN” somewhere durable (for example, a metadata table or config store). That value is your CDC offset.

When you reconnect to PostgreSQL, you reuse the same logical replication slot and resume from the correct LSN, or you create a new slot starting at a known LSN if you are doing more advanced management. Downstream offsets (Kafka offsets, message IDs, batch markers) should be linked back to the LSN that produced them. In my experience, this simple mapping – LSN → downstream offset(s) – is what keeps replays, retries, and backfills predictable. If you’re looking to dig deeper into the replication protocol details and how LSN acknowledgements are exchanged, PostgreSQL logical replication protocol and LSN acknowledgements documentation is a great next step.

Configuring PostgreSQL for Logical Replication Slots

Before I ever create PostgreSQL logical replication slots in a new environment, I lock down the server configuration. A few lines in postgresql.conf and pg_hba.conf determine whether your CDC pipeline will be reliable or a constant source of WAL bloat and connection errors.

Core postgresql.conf Settings for Logical Replication

At minimum, you need logical WAL, enough slots, and sensible retention. On most clusters I manage, I start with something like this:

# postgresql.conf
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
max_replication_workers = 10
max_logical_replication_workers = 8
wal_keep_size = '1GB'          # optional safety net

wal_level = logical is non-negotiable for logical decoding. max_replication_slots should comfortably exceed your expected number of CDC consumers (including test slots). In my experience, it’s better to start a bit high and monitor usage than to debug mysterious slot-creation failures in production.

Network Access and Permissions

Next, you need a role with replication rights and proper host-based authentication. I usually create a dedicated user for CDC and grant it only what it needs:

CREATE ROLE cdc_user WITH LOGIN REPLICATION PASSWORD 'strong_password';
GRANT CONNECT ON DATABASE mydb TO cdc_user;
GRANT USAGE ON SCHEMA public TO cdc_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
  GRANT SELECT ON TABLES TO cdc_user;

Then I add an entry to pg_hba.conf to allow replication connections:

# pg_hba.conf
host    replication    cdc_user    10.0.0.0/24    md5

One thing I learned early on is to test this with a simple psql replication connection before wiring in any CDC framework; that isolates permission and networking issues from application bugs.

Enabling and Managing Output Plugins

PostgreSQL ships with pgoutput, which is enough for many pipelines, but plugins like wal2json or pgoutput-based extensions often give a nicer payload. On managed services, these are sometimes pre-installed; on self-managed clusters, you may need to install the package and restart the server so the shared library is discoverable.

Once a plugin is available, you can verify it by creating a test slot:

SELECT *
FROM pg_create_logical_replication_slot('test_slot', 'pgoutput');

If this fails, fix it before you build any higher-level pipeline logic. When I’m choosing a plugin for a new project, I usually compare captured fields, performance characteristics, and compatibility with my target Postgres version; PostgreSQL Logical Decoding Plugins: A Developer’s Guide is a good lens to evaluate those trade-offs in more detail.

Creating and Inspecting PostgreSQL Logical Replication Slots

Once the server is configured, the next step is to actually create and manage PostgreSQL logical replication slots. I like to treat this as a repeatable checklist: create a slot, verify its state, test streaming, and clean it up when I’m done. That habit has saved me from a lot of orphaned slots and unexpected WAL growth.

Creating Logical Replication Slots (SQL and psql)

You can create a logical replication slot directly in SQL using pg_create_logical_replication_slot. Here’s a simple example using the built-in pgoutput plugin:

SELECT *
FROM pg_create_logical_replication_slot('cdc_main_slot', 'pgoutput');

This returns the slot name and its initial LSN. For JSON-style payloads, you might choose wal2json instead:

SELECT *
FROM pg_create_logical_replication_slot('cdc_main_slot', 'wal2json');

In my own workflows, I often create slots from a migration or bootstrap script so that slot names are part of the deployment contract, not something created ad hoc from a random shell.

Inspecting Slot State and WAL Impact

After creating a slot, I always confirm its state using pg_replication_slots:

SELECT slot_name,
       plugin,
       slot_type,
       active,
       restart_lsn,
       confirmed_flush_lsn,
       wal_status
FROM pg_replication_slots
WHERE slot_type = 'logical';

active tells you whether a client is currently connected to the slot. restart_lsn shows where PostgreSQL must retain WAL from, which is critical for understanding how much WAL is pinned. If I see a restart LSN that isn’t advancing over time, that’s a red flag that the consumer might be stuck or misbehaving.

To keep an eye on WAL size, I sometimes cross-check with this query:

SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_lag
FROM pg_replication_slots
WHERE slot_name = 'cdc_main_slot';

This gives a quick sense of how far behind a slot is in terms of WAL bytes.

Reading From a Slot Programmatically

For CDC pipelines, the real test is streaming changes from the slot. While production systems should use the replication protocol, I often start with a small script to sanity-check that events flow as expected. Here’s a minimal Python example using psycopg2 to peek at decoded changes via pg_logical_slot_get_changes (useful for tests, not for high-throughput production):

import psycopg2

conn = psycopg2.connect("dbname=mydb user=cdc_user")
conn.autocommit = True

with conn.cursor() as cur:
    cur.execute(
        "SELECT lsn, data FROM pg_logical_slot_get_changes(%s, NULL, 10)",
        ("cdc_main_slot",),
    )
    for lsn, data in cur.fetchall():
        print(f"LSN={lsn} EVENT={data}")

When I first wire up a new slot, I’ll run something like this while doing a few INSERT/UPDATE/DELETE operations to verify that the payloads contain the columns I expect and that LSNs advance.

Safely Dropping and Cleaning Up Slots

Logical replication slots are not “free” – if you forget about them, they keep WAL around forever. I make it a rule that any temporary slot created for testing or backfills must be explicitly dropped afterward:

SELECT pg_drop_replication_slot('cdc_main_slot');

If the slot is actively in use, this will fail; you must first ensure no client is connected. For pipelines that create ephemeral slots (for example, per-tenant or per-backfill), I build slot deletion into the workflow so I don’t rely on manual cleanup. In my experience, routinely inspecting pg_replication_slots and pruning unused entries is one of the simplest ways to avoid nasty surprises with disk usage and WAL bloat in production.

Implementing Durable Offset Tracking with LSNs

In every serious CDC project I’ve worked on, durable offset tracking ended up being the difference between a demo and a production-grade PostgreSQL logical replication slots pipeline. If you treat LSNs as first-class offsets and persist them carefully, you can reason clearly about exactly-once vs at-least-once behavior instead of hoping replay logic “just works.”

Designing an Offset Store Around LSNs

The core pattern I rely on is simple: maintain a small, strongly consistent metadata store keyed by stream or consumer name, with the last processed LSN. I often keep this inside Postgres itself for simplicity and transactional guarantees:

CREATE TABLE IF NOT EXISTS cdc_offsets (
    stream_name   text PRIMARY KEY,
    last_lsn      pg_lsn NOT NULL,
    updated_at    timestamptz NOT NULL DEFAULT now()
);

Each replication consumer (for example, a specific logical slot feeding a particular pipeline) gets a unique stream_name. In my experience, naming streams explicitly – like orders_to_kafka or billing_warehouse_loader – makes later debugging much easier.

At-Least-Once Semantics with Idempotent Sinks

At-least-once delivery is usually the easiest and most robust to achieve, especially when your sink (like Kafka or an idempotent upsert process) can tolerate duplicates. The pattern I use looks like this in pseudocode:

  • Read a batch of changes from the replication stream up to LSN L.
  • Write them to the downstream system (queue, log, etc.).
  • Once the batch is fully written, update cdc_offsets.last_lsn = L in a transaction.
  • Send replication feedback to Postgres with flush_lsn = L.

The key is that the offset update happens after the downstream write completes. On restart, your consumer can read last_lsn and safely re-read from that point or just beyond it, knowing that duplicates are acceptable because the sink is idempotent. One thing I learned the hard way was to keep offset updates lightweight and frequent; large gaps between stored LSNs make crash recovery more painful.

Toward Exactly-Once: Coordinating LSNs with Downstream State

Exactly-once semantics require tighter coordination between LSN persistence and downstream side effects. In practice, what has worked best for me is to co-locate offset updates with the sink transaction whenever possible:

  • Database sink: wrap applying changes and updating the offset row in the same database transaction.
  • Queue or log sink: store the LSN together with a durable commit marker (for example, in a separate metadata topic or table) that is written atomically from the consumer’s point of view.

Here’s a sketch of the first pattern when applying events into another Postgres schema:

BEGIN;
  -- Apply decoded change (simplified example)
  INSERT INTO warehouse.orders (...) VALUES (...)
  ON CONFLICT (id) DO UPDATE SET ...;

  -- Persist new offset
  INSERT INTO cdc_offsets (stream_name, last_lsn)
  VALUES ('orders_warehouse', :lsn)
  ON CONFLICT (stream_name)
  DO UPDATE SET last_lsn = EXCLUDED.last_lsn, updated_at = now();
COMMIT;

If the transaction commits, both the target row and the last LSN are durable. If it rolls back, Postgres logical replication will simply resend those changes on the next run. In my experience, this pattern gets you very close to exactly-once behavior as long as your decoding and application logic is deterministic.

Recovery, Replays, and Monitoring Offset Health

Offset tracking isn’t complete without a clear recovery story. Whenever I design one of these pipelines, I plan for three scenarios from day one:

  • Normal restart: read the last LSN from cdc_offsets and resume streaming. The server-side slot’s confirmed LSN should match or lag slightly behind.
  • Intentional replay: manually set last_lsn back to an earlier value to reprocess a window of data; ensure your sink can handle duplicates or has conflict detection.
  • Slot re-creation: if a slot is lost, create a new one starting from the stored LSN (or a safe snapshot), then let the consumer catch up.

For visibility, I monitor three things together: the stored last_lsn in the offset table, the slot’s confirmed_flush_lsn in pg_replication_slots, and an approximate end-of-WAL LSN like pg_current_wal_lsn(). Large gaps between any of these usually mean something is stuck. If you want more patterns and trade-offs around designing offset stores, Debezium connector for PostgreSQL :: Debezium Documentation is a good resource to internalize before you commit to a particular scheme.

Troubleshooting Common PostgreSQL Logical Replication Slot Issues

Even with good design, PostgreSQL logical replication slots occasionally misbehave. When I’m on call, I’ve found that having a short, repeatable checklist for the usual failure modes turns a scary incident into a straightforward debugging session.

1. WAL Bloat and Disk Filling Up

The most common issue I see is WAL bloat: disk usage grows because a slot’s restart_lsn doesn’t advance. My first step is always to inspect pg_replication_slots:

SELECT slot_name,
       active,
       restart_lsn,
       confirmed_flush_lsn,
       wal_status,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_lag
FROM pg_replication_slots
WHERE slot_type = 'logical';

If I see a large wal_lag and active = false, I assume the consumer is either down or stuck. My usual remediation steps are:

  • Restart or fix the consumer so it reconnects and starts acknowledging LSNs.
  • For abandoned test or backfill slots, pg_drop_replication_slot to release WAL.
  • As an emergency measure, coordinate with stakeholders to drop or recreate a problematic production slot if disk pressure is critical.

One thing I learned early on is to resist the urge to “just delete everything” under pressure; identify the exact slot causing the bloat before taking action.

2. Slots Not Advancing or Stuck Consumers

Sometimes a slot is active, but the confirmed_flush_lsn barely moves. In my experience, this usually points to a consumer that is receiving changes but not sending feedback, or is blocked downstream. I check three things:

  • Application logs for errors during event processing or offset persistence.
  • Downstream systems (queues, Kafka, target DB) for backpressure or quota errors.
  • Whether the consumer code is actually calling send_feedback(flush_lsn=…) after successfully processing changes.

A quick sanity check is to temporarily reduce the batch size, or log the LSN you plan to acknowledge; if logs show progress but confirmed_flush_lsn doesn’t move, your feedback path is broken.

3. Connection, Permission, and Configuration Problems

When a pipeline fails to start at all, it’s almost always a configuration or permission issue. My checklist looks like this:

  • Confirm wal_level = logical, and that PostgreSQL has been restarted since the change.
  • Verify the role has REPLICATION privileges and proper pg_hba.conf entries for host replication.
  • Ensure max_replication_slots and max_wal_senders are high enough; if slot creation fails with capacity errors, these are the first knobs I check.
  • Check that the output plugin (for example, pgoutput or wal2json) is installed and loadable; try creating a simple test slot via SQL to confirm.
SELECT *
FROM pg_create_logical_replication_slot('troubleshoot_test', 'pgoutput');

If this test fails, I fix the server-side configuration before touching any application code. In my experience, isolating the layers like this – first Postgres, then the output plugin, then the consumer – makes troubleshooting much faster and less stressful.

Conclusion and Next Steps for Your PostgreSQL Logical Replication Pipeline

Putting everything together, a robust pipeline around PostgreSQL logical replication slots comes down to a few disciplined habits: configure the server correctly, create and name slots deliberately, stream changes via the replication protocol, and treat LSNs as first-class offsets with durable tracking. In my experience, teams that bake in monitoring and cleanup from day one almost never get surprised by WAL bloat or mysterious data gaps later.

Operational Best Practices to Keep

The practices I keep coming back to are simple but powerful:

  • Automate slot lifecycle: creation, health checks, and safe teardown for temp or backfill slots.
  • Persist offsets carefully and acknowledge LSNs only after downstream work is durable.
  • Monitor slot lag, WAL size, and consumer health together, with clear ownership for each slot.
  • Regularly test recovery by restarting consumers and verifying replay behavior.

Advanced Directions: Sharding, Multi-Tenant, and Beyond

Once the basics are solid, there are plenty of interesting next steps. I’ve had good results using per-tenant or per-domain slots in multi-tenant systems, combined with a strong naming scheme and centralized offset store. For sharded environments, you can mirror the same pattern across multiple Postgres instances, then aggregate streams into Kafka or a data lake. From there, adding schema evolution handling, replay tooling, and automated backfills turns your logical replication setup into a full-fledged CDC platform rather than “just” a replication hook.

Join the conversation

Your email address will not be published. Required fields are marked *