Scaling a Financial Reconciliation Pipeline With Serverless
We built an event-driven reconciliation pipeline on AWS using Step Functions, Lambda and DynamoDB. It worked great at low to moderate volumes, but large batch files exposed two bottlenecks: Lambda’s 15-minute timeout and DynamoDB hot partition keys. We addressed them with a hybrid execution pattern (Lambda for small files and ECS Fargate for large files) and a deterministic write sharding (plus a bounded fan-out read path for rollups).
Context
In card and payments systems, authorization is real-time, but settlement often arrives later in batch files. This gap is why reconciliation is frequently file-driven: Ingest what the network reports, compare it with internal records and investigate differences.
Our pipeline is asynchronous. We don’t need every file processed immediately, but we do need steady progress without throttling spirals, retry amplification or queue buildup that starves other workloads.
At a high level, the flow looked like this:
- Files land in S3 (via SFTP sync/file drops)
- An event is emitted (EventBridge) and queued (SQS)
- Step Functions orchestrates parsing, normalization, matching and rollups
- DynamoDB stores transaction-level records and reconciliation state
For small programs, it was a very comfortable setup: Low ops burden, low idle cost and easy parallelism.
What Broke at Higher Volume
When we onboarded larger programs, volume increased to ~1 million transactions/day and often arrived in large batch files. Two issues showed up consistently.
1. Lambda’s 15-Minute Ceiling
The heavy work is not a single query. It’s parsing fixed-width files, normalizing records, applying matching logic and writing state in a way that’s safe under retries.
For a file with tens of thousands of records, Lambda was fine. For a file with hundreds of thousands of records, processing time could exceed Lambda’s max runtime (15 minutes).
What that looked like in practice:
- A Step Functions task would time out and fail.
- Retries would fire based on our retry policy.
- Without careful checkpointing, retries could reprocess large portions of the file, increasing load on downstream systems and extending backlog.
The issue was not ‘serverless is bad’. The issue was that we had outliers that didn’t fit Lambda’s runtime model.
2. DynamoDB Hot Partition Keys
Our original table design matched our primary query pattern:
- “Give me all transactions for a program on a file date.” A straightforward key looked like:
- PK: ProgramID#FileDate
- SK: TransactionID
That’s a reasonable model, but it concentrates writes and reads when a single program produces a high volume on a single date. During ingestion bursts, we saw throttling even when overall table capacity was not the problem; the distribution was.
Why Didn’t GSI Solve It?
We tried moving the access pattern to a global secondary index (GSI) and querying the index instead.
It didn’t remove the ingestion bottleneck. Index maintenance is part of the write path. If the GSI can’t keep up (because of its own capacity limits or uneven key distribution), DynamoDB can apply back-pressure and base-table writes can be throttled by the index. In other words, you generally can’t ‘index your way out’ of a hot write pattern. You have to fix write distribution at the partition key level.
The Fix
We didn’t replace the architecture. We made two targeted changes:
- Route outlier files to longer-running compute
- Distribute writes across shard key values
1. Hybrid Execution With a Choice State: Lambda + ECS Fargate
The simplest observation was that most files were small, and a small minority were very large. Treating them the same caused the large ones to dominate failure modes.
We introduced a Step Functions Choice state that routes based on file size (in some cases, expected record count):
- If file size is below a threshold: Process with Lambda
- If file size is above the threshold: Process with ECS Fargate
We kept a single Go codebase that could run in both environments. Lambda handled the common case with low overhead. Fargate handled long-running work without fighting the Lambda timeout.
This also made it easier to set different operational controls:
- Different concurrency limits for Lambda versus Fargate
- Different retry strategies for ‘small’ versus ‘large’ work
- Ability to tune CPU/memory for the heavy parser without overprovisioning the common case
2. Deterministic Write Sharding to Avoid Hot Keys
To fix the DynamoDB skew, we changed the primary key strategy from:
ProgramID#FileDate to ProgramID#FileDate#ShardN
Shard assignment is deterministic. We derived ShardN from a stable attribute (TransactionID) using a hash/modulo.
Example:
- ShardN = hash(TransactionID) % 64
- PK = ProgramID#FileDate#ShardN
- SK = TransactionID
This spreads writes across multiple shard key values, so one program/date doesn’t collapse all traffic into a single partition key.
A note on shard count: There’s no magic number. More shards improve write distribution but increase read fan-out when you need a full-day view. We picked a shard count that reduced throttling in peak ingestion while keeping rollup reads manageable.
Read Path After Sharding: Bounded Fan-Out + Rollups
After sharding, “fetch all transactions for a program/date” becomes a fan-out across shards.
Doing that naively (pulling all raw items from every shard) is expensive. Instead, we used a two-step approach:
- Per-Shard Summaries: Each shard worker computes a small summary (counts, sums and whatever buckets are needed for reconciliation).
- Gather and Finalize the Daily Rollup: A reducer combines shard summaries into a single rollup keyed by (ProgramID, FileDate) and compares internal totals with settlement totals.
This keeps the read path focused on aggregations, not raw record movement.
A few Operational Practices That Mattered
Retries: Backoff Plus Jitter
Throttling and transient failures happen in bursty ingestion. The goal is to avoid synchronized retries that amplify load.
We used exponential backoff and jitter; for batch APIs, we retried only unprocessed items rather than replaying everything.
Correctness: Idempotency Everywhere it Matters
Backoff protects throughput. Idempotency protects correctness.
Anywhere a retry could double-count or advance state twice, we used stable idempotency keys and conditional writes. This mattered most in rollups and any workflow step that produces a durable side effect.
What We Monitored After the Change
After we rolled out the hybrid routing and sharding, we monitored a small set of signals to confirm the system was behaving under burst load. We watched DynamoDB throttling and retry rates (on the base table and any indexes), looked for shard skew where a small number of shards stayed consistently hotter than the rest and tracked end-to-end pipeline health through queue depth and DLQ volume during spikes. On the correctness side, we kept an eye on rollup idempotency by validating that re-runs did not change totals unless inputs actually changed.
Closing
The main lesson for me was that a ‘natural’ key can be great for queries and still fail under spiky batch ingestion. Deterministic sharding fixed the write distribution problem, and a bounded fan-out rollup approach kept the daily reconciliation view intact. If you are processing large batch files asynchronously, the combination that tends to hold up is routing outliers to compute that matches the runtime profile (Lambda plus containers is a pragmatic mix), fixing key distribution at the write path, capping concurrency intentionally, using backoff with jitter and making reducers and rollups idempotent.


