Skip to content

ct/reconciler: parallelize LRO updates#30090

Merged
wdberkeley merged 1 commit intodevfrom
rc-more-parallelism
Apr 7, 2026
Merged

ct/reconciler: parallelize LRO updates#30090
wdberkeley merged 1 commit intodevfrom
rc-more-parallelism

Conversation

@wdberkeley
Copy link
Copy Markdown
Contributor

Each LRO update is an independent raft replicate against a different partition's ctp_stm, but they were issued sequentially. In a potato topic saturation workload, waiting for the sequential LRO updates occupied up to 50% of each cycle.

Advance LROs with bounded parallelism as a mitigation. In the same test, this increased reconciliation throughput by 30%.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Each LRO update is an independent raft replicate against a different
partition's ctp_stm, but they were issued sequentially. In a potato
topic saturation workload, waiting for the sequential LRO updates
occupied up to 50% of each cycle.

Advance LROs with bounded parallelism as a mitigation. In the same test,
this increased reconciliation throughput by 30%.
@wdberkeley
Copy link
Copy Markdown
Contributor Author

If we backport potato topics this is 100% worth backporting. Are we going to backport a new feature? Probably not. Still worth backporting? It's a small improvement, why not.

@wdberkeley
Copy link
Copy Markdown
Contributor Author

This Claude summary of the benchrunner comparison I did is also nice:

Result: clean win, and zero raft append_entries timeouts in this window (vs the flood we saw during baseline).

  Post-fix vs baseline (broker 0, steady-state cycles, n_lro ≥ 5)

  ┌───────────────────────────────┬─────────────┬─────────────┬───────────────────────┐
  │            metric             │  baseline   │  post-fix   │      improvement      │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ LRO phase per cycle (median)  │ 13 544 ms   │ 2 406 ms    │ 5.6× faster           │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ LRO phase per cycle (mean)    │ 12 775 ms   │ 2 453 ms    │ 5.2×                  │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ LRO phase per cycle (max)     │ 18 663 ms   │ 6 044 ms    │ 3.1×                  │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ Per-LRO amortized (median)    │ 1 411 ms    │ 228 ms      │ 6.2×                  │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ LRO updates per cycle         │ 9.5         │ 10.0        │ (same workload shape) │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ Cycle period (median)         │ 22.6 s      │ 14.6 s      │ 35 % shorter          │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ Steady-state cycles in window │ 32 (≈3 min) │ 71 (≈5 min) │ 2.4× rate of cycles   │
  ├───────────────────────────────┼─────────────┼─────────────┼───────────────────────┤
  │ Total LRO time                │ 408.8 s     │ 174.2 s     │ —                     │
  └───────────────────────────────┴─────────────┴─────────────┴───────────────────────┘

  A few useful derived numbers:

  - LRO share of cycle dropped from ~56 % → ~16 %. LRO is no longer the dominant cost; whatever remains in the cycle (object build / S3 / metastore add) is now the long pole.
  - Steady-state cycles per second per shard: baseline 32 / (4·180) ≈ 0.044 cyc/s; post-fix 71 / (4·300) ≈ 0.059 cyc/s → ~1.35× more cycles/s/shard.
  - The expected post-fix cycle period was 10 s sleep + ~2.4 s LRO + small overhead ≈ 12.5 s. We measured 14.6 s. The extra ~2 s is now in the non-LRO portion of the cycle and is the next thing worth profiling if you want more headroom.
  - Per-replicate raft cost dropped too (1.4 s → ~0.24 s amortized). Some of that is presumably because parallelism shrunk the overall in-flight raft pressure (we also stopped seeing append_entries timeouts), so individual replicates run on a less-loaded cluster. So the speedup is partially
  direct (parallelism removes serialization) and partially indirect (less critical-path raft contention helped per-replicate latency too).

  A small thing to note: a handful of post-fix cycles have n_lro = 1 and sub-millisecond phase times (e.g. shard 0 around 21:31:40, shard 2 around 21:31:42). Those are tail-of-window "drain" cycles where almost no partitions had data — I excluded them from the steady-state stats above (16 of
  87) so they don't make the post-fix numbers look unfairly good.

  The patch does what the commit message promised.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves cloud-topics reconciliation throughput by parallelizing LRO (last reconciled offset) updates after successfully committing newly built L1 objects to the metastore. This reduces time spent waiting on sequential per-partition raft-replicated STM updates.

Changes:

  • Collect all per-source commit_info entries across committed objects into a single list.
  • Update LROs using ss::max_concurrent_for_each with a fixed concurrency cap (32) instead of issuing updates sequentially.

@andrwng
Copy link
Copy Markdown
Contributor

andrwng commented Apr 7, 2026

Nice!

  • LRO share of cycle dropped from ~56 % → ~16 %. LRO is no longer the dominant cost; whatever remains in the cycle (object build / S3 / metastore add) is now the long pole.

Is this to say that the LRO update was the long pole even with metastore latencies in seconds? Or did your reporting in Slack about the metastore latency include this fix already?

@wdberkeley
Copy link
Copy Markdown
Contributor Author

Is this to say that the LRO update was the long pole even with metastore latencies in seconds? Or did your reporting in Slack about the metastore latency include this fix already?

This was the worst thing. Even with the 30% speedup RC still can't keep pace with ingest, so still need to address metastore stuff.

@wdberkeley wdberkeley merged commit b4b2ecb into dev Apr 7, 2026
24 checks passed
@wdberkeley wdberkeley deleted the rc-more-parallelism branch April 7, 2026 23:10
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v26.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants