ct/reconciler: parallelize LRO updates by wdberkeley · Pull Request #30090 · redpanda-data/redpanda

wdberkeley · 2026-04-07T21:35:38Z

Each LRO update is an independent raft replicate against a different partition's ctp_stm, but they were issued sequentially. In a potato topic saturation workload, waiting for the sequential LRO updates occupied up to 50% of each cycle.

Advance LROs with bounded parallelism as a mitigation. In the same test, this increased reconciliation throughput by 30%.

Backports Required

Release Notes

none

Each LRO update is an independent raft replicate against a different partition's ctp_stm, but they were issued sequentially. In a potato topic saturation workload, waiting for the sequential LRO updates occupied up to 50% of each cycle. Advance LROs with bounded parallelism as a mitigation. In the same test, this increased reconciliation throughput by 30%.

wdberkeley · 2026-04-07T21:36:32Z

If we backport potato topics this is 100% worth backporting. Are we going to backport a new feature? Probably not. Still worth backporting? It's a small improvement, why not.

wdberkeley · 2026-04-07T21:38:42Z

This Claude summary of the benchrunner comparison I did is also nice:

Result: clean win, and zero raft append_entries timeouts in this window (vs the flood we saw during baseline).

  Post-fix vs baseline (broker 0, steady-state cycles, n_lro ≥ 5)

  ┌───────────────────────────────┬─────────────┬─────────────┬───────────────────────┐
  │            metric             │  baseline   │  post-fix   │      improvement      │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ LRO phase per cycle (median)  │ 13 544 ms   │ 2 406 ms    │ 5.6× faster           │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ LRO phase per cycle (mean)    │ 12 775 ms   │ 2 453 ms    │ 5.2×                  │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ LRO phase per cycle (max)     │ 18 663 ms   │ 6 044 ms    │ 3.1×                  │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ Per-LRO amortized (median)    │ 1 411 ms    │ 228 ms      │ 6.2×                  │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ LRO updates per cycle         │ 9.5         │ 10.0        │ (same workload shape) │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ Cycle period (median)         │ 22.6 s      │ 14.6 s      │ 35 % shorter          │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ Steady-state cycles in window │ 32 (≈3 min) │ 71 (≈5 min) │ 2.4× rate of cycles   │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ Total LRO time                │ 408.8 s     │ 174.2 s     │ —                     │
  └───────────────────────────────┴─────────────┴─────────────┴───────────────────────┘

  A few useful derived numbers:

  - LRO share of cycle dropped from ~56 % → ~16 %. LRO is no longer the dominant cost; whatever remains in the cycle (object build / S3 / metastore add) is now the long pole.
  - Steady-state cycles per second per shard: baseline 32 / (4·180) ≈ 0.044 cyc/s; post-fix 71 / (4·300) ≈ 0.059 cyc/s → ~1.35× more cycles/s/shard.
  - The expected post-fix cycle period was 10 s sleep + ~2.4 s LRO + small overhead ≈ 12.5 s. We measured 14.6 s. The extra ~2 s is now in the non-LRO portion of the cycle and is the next thing worth profiling if you want more headroom.
  - Per-replicate raft cost dropped too (1.4 s → ~0.24 s amortized). Some of that is presumably because parallelism shrunk the overall in-flight raft pressure (we also stopped seeing append_entries timeouts), so individual replicates run on a less-loaded cluster. So the speedup is partially
  direct (parallelism removes serialization) and partially indirect (less critical-path raft contention helped per-replicate latency too).

  A small thing to note: a handful of post-fix cycles have n_lro = 1 and sub-millisecond phase times (e.g. shard 0 around 21:31:40, shard 2 around 21:31:42). Those are tail-of-window "drain" cycles where almost no partitions had data — I excluded them from the steady-state stats above (16 of
  87) so they don't make the post-fix numbers look unfairly good.

  The patch does what the commit message promised.

Copilot

Pull request overview

This PR improves cloud-topics reconciliation throughput by parallelizing LRO (last reconciled offset) updates after successfully committing newly built L1 objects to the metastore. This reduces time spent waiting on sequential per-partition raft-replicated STM updates.

Changes:

Collect all per-source commit_info entries across committed objects into a single list.
Update LROs using ss::max_concurrent_for_each with a fixed concurrency cap (32) instead of issuing updates sequentially.

andrwng · 2026-04-07T22:38:11Z

Nice!

LRO share of cycle dropped from ~56 % → ~16 %. LRO is no longer the dominant cost; whatever remains in the cycle (object build / S3 / metastore add) is now the long pole.

Is this to say that the LRO update was the long pole even with metastore latencies in seconds? Or did your reporting in Slack about the metastore latency include this fix already?

wdberkeley · 2026-04-07T23:10:05Z

Is this to say that the LRO update was the long pole even with metastore latencies in seconds? Or did your reporting in Slack about the metastore latency include this fix already?

This was the worst thing. Even with the 30% speedup RC still can't keep pace with ingest, so still need to address metastore stuff.

vbotbuildovich · 2026-04-07T23:10:24Z

/backport v26.1.x

wdberkeley requested review from andrwng, Copilot and nvartolomei April 7, 2026 21:35

github-actions bot added the area/redpanda label Apr 7, 2026

Copilot AI reviewed Apr 7, 2026

View reviewed changes

andrwng approved these changes Apr 7, 2026

View reviewed changes

wdberkeley merged commit b4b2ecb into dev Apr 7, 2026
24 checks passed

wdberkeley deleted the rc-more-parallelism branch April 7, 2026 23:10

vbotbuildovich mentioned this pull request Apr 7, 2026

[v26.1.x] ct/reconciler: parallelize LRO updates #30091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ct/reconciler: parallelize LRO updates#30090

ct/reconciler: parallelize LRO updates#30090
wdberkeley merged 1 commit intodevfrom
rc-more-parallelism

wdberkeley commented Apr 7, 2026

Uh oh!

wdberkeley commented Apr 7, 2026

Uh oh!

wdberkeley commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

andrwng commented Apr 7, 2026

Uh oh!

wdberkeley commented Apr 7, 2026

Uh oh!

Uh oh!

vbotbuildovich commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wdberkeley commented Apr 7, 2026

Backports Required

Release Notes

Uh oh!

wdberkeley commented Apr 7, 2026

Uh oh!

wdberkeley commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

andrwng commented Apr 7, 2026

Uh oh!

wdberkeley commented Apr 7, 2026

Uh oh!

Uh oh!

vbotbuildovich commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants