Skip to content
Home » All Posts » Case Study: Making Spring Batch Partitioned Steps Truly Restartable

Case Study: Making Spring Batch Partitioned Steps Truly Restartable

Introduction

When I joined this project, the team was already heavily invested in Spring Batch partitioned steps to push millions of records through nightly. On paper it looked great: parallelism, shorter SLAs, and clean separation of work. In practice, one missing piece kept biting us — reliable Spring Batch partitioned steps restartability.

Any transient failure (a node dying, a network hiccup, a bad record) meant we had to choose between rerunning the whole job or hacking around the framework to reprocess just the failed partitions. Both options were costly: longer batch windows, duplicated processing, and tense late-night releases. I’d seen restartability work well with simple chunk steps before, so it was frustrating to watch it fall apart once partitioning entered the picture.

This case study walks through how we untangled that gap: what went wrong with our original design, which Spring Batch features we had underused or misunderstood, and how we ended up with partitioned steps that could be stopped and restarted predictably, even under heavy load. My goal is to share the concrete patterns and configuration changes that finally made our partitioned jobs behave like good citizens in a production environment.

Background & Context: High-Volume ETL with Spring Batch Partitioned Steps

By the time I was asked to look at this pipeline, we were processing between 15–20 million records every night from multiple upstream systems. The job was a classic ETL: pull from a mix of relational sources, normalize, apply a stack of business rules, and load into a reporting database before business hours. A single-threaded Spring Batch job simply couldn’t finish inside our shrinking batch window.

We chose Spring Batch partitioned steps to slice the workload horizontally. Each partition owned a distinct range of IDs (or shard key), processed via a chunk-oriented step, and ran on a pool of worker threads. On our busiest days, we ran dozens of partitions in parallel across several JVMs, which, from a throughput perspective, worked extremely well.

Where things got tricky was Spring Batch partitioned steps restartability. The moment we distributed work across many partitions and threads, tracking which slices had completed, which had failed, and which were in an indeterminate state became much more subtle. I’d seen teams assume that if a single non-partitioned step was restartable, simply wrapping it in a partitioned step would inherit that property automatically. In our case, that assumption led directly to orphaned partitions, double-processing, and manual database fixes whenever a job failed mid-run.

The Problem: Non-Restartable Partitioned Steps and Data Gaps

The real trouble with our setup showed up the first time a node died halfway through a run. On paper, Spring Batch had marked the master step as failed, but the details of which partitions had actually completed were murky. Some partitions had written all their data but never updated their execution status; others had failed midway, leaving half-loaded ranges in the target tables. When we tried to restart the job, Spring Batch treated some of those partitions as new work, some as already completed, and some as if they had never existed.

In practice, this meant we kept swinging between two bad options: rerun all partitions and accept duplicate processing, or surgically mark certain step executions as complete in the repository and pray we didn’t miss anything. I remember one night where a single failed partition left a 200k-row hole in the reporting tables, while another partition ran twice and inflated metrics for a key customer. That’s when it became obvious that Spring Batch partitioned steps restartability wasn’t just a nice-to-have; it was the difference between trusting our data and constantly second-guessing every batch run.

Constraints & Goals for Spring Batch Partitioned Steps Restartability

When I started fixing restartability, I didn’t have the luxury of redesigning the whole platform. We had tight overnight SLAs, no appetite for new infrastructure, and a strong requirement to stick with the existing Spring Batch stack and database. Any changes had to be incremental, low-risk, and deployable within a couple of sprints.

Given those constraints, we defined clear goals for Spring Batch partitioned steps restartability:

  • Deterministic restarts: on failure, only incomplete partitions should re-run; completed ones must never be reprocessed.
  • No data gaps or duplicates: the combination of partitioning and chunk configuration must guarantee full coverage of the input domain without overlaps.
  • Operational simplicity: support team members should be able to restart jobs using standard Spring Batch tooling, with no manual database edits or custom scripts.

My personal success criterion was simple: if a job failed at 3 a.m., the on-call could hit restart once, go back to bed, and trust the data in the morning.

Approach & Strategy: Designing for Idempotent, Restartable Partitions

To make Spring Batch partitioned steps restartability predictable, I treated each partition like a mini-job that must be safely re-runnable on its own. That pushed me toward three core principles: clearly defined partition keys, strictly idempotent writes, and disciplined use of the ExecutionContext so Spring Batch could reliably resume where it left off.

Designing Stable Partition Keys

The first thing I fixed was how we defined partitions. Instead of ad-hoc ranges that changed from run to run, I moved to stable, non-overlapping slices based on a deterministic key (for us, an ID range). Each partition was identified by a partitionId and carried minId/maxId in its ExecutionContext. That way, on a restart, Spring Batch could re-create the exact same slices and understand which ones had already completed.

public class RangePartitioner implements Partitioner {

    private final long minId;
    private final long maxId;
    private final int partitionCount;

    public RangePartitioner(long minId, long maxId, int partitionCount) {
        this.minId = minId;
        this.maxId = maxId;
        this.partitionCount = partitionCount;
    }

    @Override
    public Map<String, ExecutionContext> partition(int gridSize) {
        Map<String, ExecutionContext> result = new HashMap<>();

        long targetSize = (maxId - minId + 1) / partitionCount;
        long start = minId;

        for (int i = 0; i < partitionCount; i++) {
            long end = (i == partitionCount - 1) ? maxId : start + targetSize - 1;

            ExecutionContext context = new ExecutionContext();
            context.putLong("minId", start);
            context.putLong("maxId", end);
            context.putInt("partitionIndex", i);

            result.put("partition-" + i, context);
            start = end + 1;
        }
        return result;
    }
}

In my experience, this kind of stable partitioning removes a whole class of off-by-one bugs and makes it much easier to reason about gaps when something goes wrong.

Making Writes Idempotent

The second pillar was idempotency: if a partition re-ran, it must not corrupt data. I shifted all writes to upsert-style operations keyed by business identifiers and added a version or checksum column where necessary. On the Spring Batch side, that meant using a writer that would update existing rows instead of blindly inserting, so reruns became safe by design. This also let me avoid elaborate compensating logic when a partition failed after partially committing its chunk.

Leveraging ExecutionContext for Safe Restarts

Next, I focused on how the ExecutionContext was used inside each partitioned step. Rather than relying only on the high-level step status, I ensured that critical state (current page, last processed ID, or cursor position) was explicitly stored and updated. That way, when Spring Batch restarted a failed partition, the reader could resume from the last known safe point instead of starting from the beginning or guessing based on timestamps.

One thing I learned the hard way was to keep the context small and deterministic: only store state that you’re actually going to use on restart, and make sure it’s written frequently enough that a crash won’t force you to replay huge chunks of data.

Putting It All Together in the Step Configuration

Finally, I wired these ideas into the partitioned step definition itself. The master step created stable partitions, each worker step used the same reader/writer configured for idempotency, and the job repository handled the rest. With this setup, a failed run could be restarted via the normal Spring Batch mechanisms, and only incomplete partitions would be picked up again.

@Bean
public Step workerStep(StepBuilderFactory stepBuilderFactory,
                       ItemReader<MyItem> reader,
                       ItemProcessor<MyItem, MyItem> processor,
                       ItemWriter<MyItem> writer) {
    return stepBuilderFactory.get("workerStep")
            .<MyItem, MyItem>chunk(500)
            .reader(reader)
            .processor(processor)
            .writer(writer) // implements idempotent upsert logic
            .faultTolerant()
            .retryLimit(3)
            .build();
}

@Bean
public Step partitionedStep(StepBuilderFactory stepBuilderFactory,
                            RangePartitioner partitioner,
                            TaskExecutor taskExecutor) {
    return stepBuilderFactory.get("partitionedStep")
            .partitioner("workerStep", partitioner)
            .step(workerStep(null, null, null, null))
            .taskExecutor(taskExecutor)
            .build();
}

Once I had these pieces in place, restarts became boring in the best possible way: the job repository knew exactly which partitions had failed, idempotent writers made reruns safe, and the ExecutionContext ensured we didn’t lose our position. From an operations perspective, that was the turning point where our Spring Batch partitioned steps started behaving like a robust, restartable ETL backbone instead of a nightly fire drill.

Approach & Strategy: Designing for Idempotent, Restartable Partitions - image 1

For teams looking to deepen this approach, I’d also recommend exploring patterns for resilient batch processing and idempotent writes in distributed systems On the limits of incrementality – dbt Community Forum.

Implementation: From Fragile Partitions to Reliable Restarts

Once the strategy was clear, I focused on turning it into concrete Spring Batch configuration. My goal was that anyone familiar with Spring Batch could read the job config and immediately see how Spring Batch partitioned steps restartability was enforced end to end.

1. Partitioning Logic that Survives Restarts

I started by replacing our ad-hoc partitioning with a deterministic, range-based Partitioner. This ensured that a restart would always recreate identical partitions for the same input window. Each partition carried its minId, maxId, and an index in the ExecutionContext, which made debugging much easier.

@Component
public class StableRangePartitioner implements Partitioner {

    @Value("${job.minId}")
    private long minId;

    @Value("${job.maxId}")
    private long maxId;

    @Value("${job.partition.count:8}")
    private int partitionCount;

    @Override
    public Map<String, ExecutionContext> partition(int gridSize) {
        Map<String, ExecutionContext> result = new LinkedHashMap<>();

        long targetSize = (maxId - minId + 1) / partitionCount;
        long start = minId;

        for (int i = 0; i < partitionCount; i++) {
            long end = (i == partitionCount - 1) ? maxId : start + targetSize - 1;

            ExecutionContext context = new ExecutionContext();
            context.putLong("minId", start);
            context.putLong("maxId", end);
            context.putInt("partitionIndex", i);

            result.put("partition-" + i, context);
            start = end + 1;
        }
        return result;
    }
}

In my experience, using configuration properties for minId/maxId also made it trivial to rerun just a subset of data when we needed targeted reprocessing.

2. Reader & Writer Configured for Safe Re-Runs

Next, I wired the reader to filter strictly by the partition’s ID range so each slice was self-contained and non-overlapping. The writer implemented idempotent upserts, keyed by a business ID, so replaying a partition would converge to the same final state instead of creating duplicates.

@Bean
@StepScope
public JdbcPagingItemReader<MyItem> partitionedReader(DataSource dataSource,
                                                      @Value("#{stepExecutionContext[minId]}") long minId,
                                                      @Value("#{stepExecutionContext[maxId]}") long maxId) {
    JdbcPagingItemReader<MyItem> reader = new JdbcPagingItemReader<>();
    reader.setDataSource(dataSource);
    reader.setPageSize(500);
    reader.setRowMapper(new MyItemRowMapper());

    SqlPagingQueryProviderFactoryBean qp = new SqlPagingQueryProviderFactoryBean();
    qp.setDataSource(dataSource);
    qp.setSelectClause("select *");
    qp.setFromClause("from source_table");
    qp.setWhereClause("where id between :minId and :maxId");
    qp.setSortKey("id");

    reader.setQueryProvider(qp.getObject());
    Map<String, Object> params = new HashMap<>();
    params.put("minId", minId);
    params.put("maxId", maxId);
    reader.setParameterValues(params);
    return reader;
}

@Bean
public ItemWriter<MyItem> idempotentWriter(JdbcTemplate jdbcTemplate) {
    return items -> {
        for (MyItem item : items) {
            jdbcTemplate.update(
                "merge into target_table t using (select ? as id, ? as value from dual) s " +
                "on (t.id = s.id) " +
                "when matched then update set t.value = s.value " +
                "when not matched then insert (id, value) values (s.id, s.value)",
                item.getId(), item.getValue()
            );
        }
    };
}

I’ve found that once the writer is truly idempotent, the fear of restarts largely disappears, because you know “run it again” can’t make things worse.

3. Wiring the Partitioned Step for Controlled Concurrency

With the partitioner and step-scoped reader in place, I defined the worker and master steps. The master step used a TaskExecutor to parallelize partitions but kept concurrency under control so we didn’t overwhelm the database or job repository.

@Bean
public Step workerStep(StepBuilderFactory steps,
                       ItemReader<MyItem> partitionedReader,
                       ItemProcessor<MyItem, MyItem> processor,
                       ItemWriter<MyItem> idempotentWriter) {
    return steps.get("workerStep")
            .<MyItem, MyItem>chunk(500)
            .reader(partitionedReader)
            .processor(processor)
            .writer(idempotentWriter)
            .faultTolerant()
            .retryLimit(3)
            .build();
}

@Bean
public TaskExecutor partitionTaskExecutor() {
    SimpleAsyncTaskExecutor executor = new SimpleAsyncTaskExecutor("partition-");
    executor.setConcurrencyLimit(8);
    return executor;
}

@Bean
public Step partitionedStep(StepBuilderFactory steps,
                            StableRangePartitioner partitioner,
                            Step workerStep,
                            TaskExecutor partitionTaskExecutor) {
    return steps.get("partitionedStep")
            .partitioner("workerStep", partitioner)
            .step(workerStep)
            .taskExecutor(partitionTaskExecutor)
            .build();
}

From there, restarts were handled naturally by Spring Batch: only partitions with non-completed executions were rescheduled, and because the partitions were stable, there was no ambiguity.

4. Monitoring, Metrics, and Operational Checks

The last piece was making the behavior visible. I added simple metrics and logs keyed by partitionIndex and ID range so the on-call could quickly answer three questions after a failure: which partitions ran, which failed, and whether there were any obvious gaps.

@Component
public class PartitionLoggingListener extends StepExecutionListenerSupport {

    private static final Logger log = LoggerFactory.getLogger(PartitionLoggingListener.class);

    @Override
    public void beforeStep(StepExecution stepExecution) {
        ExecutionContext ctx = stepExecution.getExecutionContext();
        log.info("Starting partition index={}, minId={}, maxId={}",
                ctx.getInt("partitionIndex"),
                ctx.getLong("minId"),
                ctx.getLong("maxId"));
    }

    @Override
    public ExitStatus afterStep(StepExecution stepExecution) {
        ExecutionContext ctx = stepExecution.getExecutionContext();
        log.info("Finished partition index={}, readCount={}, writeCount={}, status={}",
                ctx.getInt("partitionIndex"),
                stepExecution.getReadCount(),
                stepExecution.getWriteCount(),
                stepExecution.getStatus());
        return stepExecution.getExitStatus();
    }
}

Once these logs and basic metrics were in place, it became much easier to prove that every ID range had been processed exactly once, even after restarts. That visibility was what finally convinced the wider team that our Spring Batch partitioned steps restartability story was solid.

Implementation: From Fragile Partitions to Reliable Restarts - image 1

If you want to go further, it’s worth exploring best practices for monitoring and alerting around Spring Batch jobs in production Spring Batch — Monitoring and Metrics | by Bayonne Sensei – Medium.

Results: More Reliable Spring Batch Partitioned Steps and Faster Recovery

After rolling out the new design, the impact on Spring Batch partitioned steps restartability showed up almost immediately in our overnight runs. We didn’t change data volumes or hardware, but the way failures behaved was completely different.

Fewer Incidents and Predictable Restarts

Operationally, the number of “urgent” batch incidents dropped sharply. When something did fail—a database blip or a node restart—the on-call could simply rerun the job and know that only the incomplete partitions would execute. In my experience, that shift from “debug every failure” to “restart with confidence” was the biggest quality-of-life improvement for the team.

Faster Recovery and Shorter Batch Windows

Because we no longer had to rerun all partitions, recovery time shrank dramatically. If 2 out of 16 partitions failed, we reprocessed just those slices, which often cut recovery from hours to minutes. The job consistently finished within the nightly window, even on heavy days, and we stopped padding the schedule to compensate for unpredictable reruns.

Trustworthy Data and Less Manual Intervention

Most importantly, data quality incidents related to the batch pipeline essentially disappeared. The combination of deterministic partitioning and idempotent writes eliminated the data gaps and duplicates we used to chase every week. Manual database fixes became rare, and audit questions were much easier to answer because we could show which partitions ran, when, and with what outcome. For teams in similar situations, it can be worth comparing these kinds of improvements against broader ETL reliability benchmarks ETL Pipeline best practices for reliable data workflows | dbt Labs.

What Didn’t Work: Dead Ends in Spring Batch Partitioned Steps Restartability

Before I landed on the final design for Spring Batch partitioned steps restartability, I burned time on a few approaches that looked promising on paper but failed badly in production.

Naive Retries and Coarse-Grained Restarts

My first instinct was to lean on Spring Batch’s fault-tolerance: crank up retryLimit, add retryable exceptions, and simply rerun the whole job when things went wrong. That helped with transient errors but did nothing for partial commits. If a partition failed halfway, a full-job restart meant reprocessing every partition, which reintroduced duplicates and extended the batch window. Coarse-grained checkpoints (for example, relying only on step completion status) also backfired: a partition that had written thousands of rows but failed at the last chunk was still flagged as “failed,” leaving no clean way to resume without overprocessing.

Non-Idempotent Updates and Timestamp-Based Logic

Another dead end was trying to “outsmart” restarts with timestamp filters and delete-then-insert patterns. I briefly experimented with deleting target rows for a partition before reloading them, but that made failures more dangerous—if the job died after the delete and before the insert, we were left with data gaps. Timestamp-based logic wasn’t reliable either: clock skew and late-arriving records made it easy to miss or double-count data. What I eventually learned is that without truly idempotent writes and stable partition keys, every clever workaround just shifted the risk somewhere else instead of removing it.

Lessons Learned & Recommendations for Spring Batch Partitioned Steps Restartability

Looking back at this project, the biggest surprise for me was how much reliability came from a few disciplined choices rather than fancy frameworks. If you’re aiming to improve Spring Batch partitioned steps restartability, these are the practices I’d prioritize.

Prioritize Stable Partitioning and Idempotent Writes

First, make your partitioning boring and predictable. Use deterministic ranges or keys that don’t change between runs, and store them in the ExecutionContext. Then, obsess over idempotency: treat every writer as if it will be called multiple times for the same data. Upserts keyed by a business ID, version columns, and conflict-safe SQL patterns are worth the effort. In my experience, once these two pieces are in place, most restart problems become operational rather than architectural.

Lean on ExecutionContext, but Keep It Focused

Second, use the ExecutionContext intentionally. Persist only the state you truly need to restart a partition (ranges, cursors, last processed ID), and keep it small and deterministic. I’ve found that overly clever context state—like semi-derived flags or large serialized objects—tends to rot and break under edge cases. Simple, explicit context keys are easier to inspect, test, and reason about when a partition fails at 3 a.m.

Lessons Learned & Recommendations for Spring Batch Partitioned Steps Restartability - image 1

Test Restarts Like a First-Class Scenario

Finally, don’t treat restarts as an afterthought in QA. I now make a point of designing tests that deliberately kill workers mid-run: crash the JVM, force database timeouts, or stop the job halfway and then restart. Watching how partitions behave under these conditions surfaced issues I never saw in happy-path tests. For teams starting out, it’s helpful to compare your own restart test cases with community checklists for resilient Spring Batch pipelines Spring Batch Reference Documentation – Restarting Jobs.

Conclusion / Key Takeaways

Making Spring Batch partitioned steps restartability a first-class goal completely changed how our nightly jobs behaved. Instead of fragile, all-or-nothing reruns, we ended up with predictable, partition-level restarts that operations could trust.

The practices that really moved the needle were simple but strict: stable partition keys, idempotent writers, and focused use of the ExecutionContext to capture exactly what each partition needs to resume. Combined with controlled concurrency and basic partition-level monitoring, those choices turned restarts from a risky maneuver into a routine button press. If I were starting a new Spring Batch project tomorrow, I’d bake these patterns in from day one rather than trying to retrofit them under production pressure.

Join the conversation

Your email address will not be published. Required fields are marked *