Skip to content

Cloud Topics: Make L0 GC tests deterministic#30066

Open
oleiman wants to merge 2 commits intodevfrom
ct/noticket/make-gc-tests-deterministic
Open

Cloud Topics: Make L0 GC tests deterministic#30066
oleiman wants to merge 2 commits intodevfrom
ct/noticket/make-gc-tests-deterministic

Conversation

@oleiman
Copy link
Copy Markdown
Member

@oleiman oleiman commented Apr 3, 2026

manual_clock and so on

Saw exactly one timing flake in CI and that was enough for me.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

@oleiman oleiman self-assigned this Apr 3, 2026
@oleiman oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Apr 3, 2026
Copilot AI review requested due to automatic review settings April 3, 2026 05:46
@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

PR Review: Make GC tests deterministic

Summary

Good improvement. Switching from real-clock sleeps and Eventually/RPTEST_REQUIRE_EVENTUALLY_CORO polling to ss::manual_clock + tick/tick_until eliminates the primary source of flakiness in these tests. The approach is sound — the level_zero_gc_t template is already parameterized on Clock, so this is a natural use of the existing design.

What looks good

  • Zero-jitter function for manual_clock tests eliminates non-determinism from the jitter path
  • Explicit EXPECT_EQ assertions after tick_until are better than the old Eventually which returned a testing::AssertionResult that could be silently ignored
  • Removal of the Eventually helper — good cleanup, the tick-based approach is strictly better
  • Comment-trimming — removing obvious comments that restated what the code does

Issues raised (see inline comments)

  1. tick_until fails silently on timeout (nit): When max_ticks is exhausted, the function just falls through. Adding ADD_FAILURE() at the end would make timeout failures immediately obvious in test output rather than requiring the reader to infer it from the subsequent EXPECT_EQ.

  2. Code duplication of tick/tick_until: These helpers are copy-pasted across 4 test fixtures (3 in level_zero_gc_tests.cc, 1 coroutine variant in level_zero_gc_mt_test.cc). Consider extracting to a shared utility.

  3. LevelZeroGCScaleOutTest::tick still uses real seastar::sleep: The mock's I/O cost simulation uses lowres_clock-based seastar::sleep, so these tests aren't fully deterministic yet. The comment explains the situation well, but it's worth considering whether the mock could also be templated on the clock type in a follow-up.

Verdict

The PR achieves its stated goal for the non-ScaleOut tests. The ScaleOut tests are improved but retain some real-time dependency. No correctness or safety concerns. The suggestions above are nits/improvements — none are blockers.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Level Zero GC unit/integration tests to run deterministically by driving the GC with ss::manual_clock and replacing real-time “eventually” polling with explicit clock advancement + task-queue draining.

Changes:

  • Instantiate GC under test as level_zero_gc_t<ss::manual_clock> (with zero jitter) to control scheduling deterministically.
  • Replace Eventually(...)/real sleeps with tick() and tick_until(...) helpers that advance the manual clock and drain Seastar task queues.
  • Update multi-shard GC tests to use the same manual-clock-driven approach.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/v/cloud_topics/level_zero/gc/tests/level_zero_gc_tests.cc Switches single-shard GC tests to manual clock + tick helpers; adjusts scale-out tests with a special tick to account for mock I/O delays.
src/v/cloud_topics/level_zero/gc/tests/level_zero_gc_mt_test.cc Switches sharded/multi-thread GC tests to manual clock + tick helpers and injects zero jitter.

@oleiman oleiman marked this pull request as draft April 3, 2026 06:31
@oleiman oleiman removed the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Apr 3, 2026
@oleiman oleiman force-pushed the ct/noticket/make-gc-tests-deterministic branch from 8d7c1a3 to 77548e0 Compare April 3, 2026 06:32
@oleiman oleiman requested a review from Copilot April 3, 2026 06:34
@oleiman oleiman changed the title Ct/noticket/make gc tests deterministic Cloud Topics: Make L0 GC tests deterministic Apr 3, 2026
@oleiman oleiman marked this pull request as ready for review April 3, 2026 06:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

@oleiman oleiman force-pushed the ct/noticket/make-gc-tests-deterministic branch from 77548e0 to cee5668 Compare April 3, 2026 07:10
@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Apr 3, 2026

CI Failures: ducktape-release ducktape-debug didn't run

@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Apr 3, 2026

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Retry command for Build#82743

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/upgrade_test.py::RedpandaInstallerTest.test_install_by_line

@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Apr 3, 2026

/ci-repeat 1

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Apr 3, 2026

Retry command for Build#82744

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/upgrade_test.py::RedpandaInstallerTest.test_install_by_line

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#82744
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ControllerLogLimitMirrorMakerTests test_mirror_maker_with_limits null integration https://buildkite.com/redpanda/redpanda/builds/82744#019d5440-f13f-4350-901b-77997354c68d 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0323, p0=0.4810, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits
FLAKY(PASS) SimpleEndToEndTest test_relaxed_acks {"write_caching": false} integration https://buildkite.com/redpanda/redpanda/builds/82744#019d5440-f145-4cdd-93eb-3fdb5cc1a5c8 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0034, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SimpleEndToEndTest&test_method=test_relaxed_acks
FLAKY(PASS) RecreateTopicMetadataTest test_recreated_topic_metadata_are_valid {"replication_factor": 5} integration https://buildkite.com/redpanda/redpanda/builds/82744#019d5441-d384-4738-95e6-17a4cc767dcb 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RecreateTopicMetadataTest&test_method=test_recreated_topic_metadata_are_valid
FAIL RedpandaInstallerTest test_install_by_line null integration https://buildkite.com/redpanda/redpanda/builds/82744#019d5440-f142-4502-9a37-01545fd29d11 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaInstallerTest&test_method=test_install_by_line
FAIL RedpandaInstallerTest test_install_by_line null integration https://buildkite.com/redpanda/redpanda/builds/82744#019d5441-d386-46bf-984a-8e167280d63b 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaInstallerTest&test_method=test_install_by_line
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/82744#019d5440-f143-4b59-9fa2-340f9eae40e3 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0749, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.2083, p1=0.0967, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

@oleiman oleiman requested review from dotnwat and wdberkeley April 3, 2026 19:59
@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Apr 3, 2026

/ci-repeat 1
skip-redpanda-build
skip-units
tests/rptest/tests/upgrade_test.py::RedpandaInstallerTest.test_install_by_line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants