Skip to content

Avoid reusing db_stress listeners across open retries#14765

Closed
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:2026_05_20_oncall_T272201927
Closed

Avoid reusing db_stress listeners across open retries#14765
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:2026_05_20_oncall_T272201927

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

@xingbowang xingbowang commented May 20, 2026

Summary

  • Rebuild db_stress event listeners through a shared InitializeListenersForOpen() helper.
  • Reinitialize listeners before retrying DB::Open() after injected open or open-compaction failures.

Context

  • A stress test failed with "Concurrent compaction of SST file detected".
  • Root cause: StressTest::Open() built DbStressListener once before its DB::Open() retry loop. With open fault injection, a DB::Open() attempt can create a DBImpl, schedule background compaction, and then fail during late open work such as persisting OPTIONS. During teardown of that failed DBImpl, DBImpl's shutdown flag can suppress later compaction callbacks, leaving listener-local compaction bookkeeping stale.
  • Diagnosis: Sandcastle DB LOGs showed file 16821 flushed, then a failed open attempt with an injected read error and a background compaction picking 16821. The crash was in DbStressListener::OnCompactionBegin, so this was stale db_stress listener state across open attempts rather than DBImpl allowing a real concurrent compaction of the same SST.
  • Fix: factor listener construction into InitializeListenersForOpen() and call it before each DB::Open() attempt, including the retry path after open/open-compaction failure. Each DBImpl open attempt now gets fresh listener state.
  • Verification: make clean; make db_stress -j192; make check-sources; git diff --check.

Test Plan

  • make clean
  • make db_stress -j192
  • make check-sources
  • git diff --check

@meta-cla meta-cla Bot added the CLA Signed label May 20, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

⚠️ clang-tidy: 1 warning(s) on changed lines

Completed in 68.6s.

Summary by check

Check Count
concurrency-mt-unsafe 1
Total 1

Details

db_stress_tool/db_stress_test_base.cc (1 warning(s))
db_stress_tool/db_stress_test_base.cc:149:7: warning: function is not thread safe [concurrency-mt-unsafe]

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 84045ee


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 84045ee


Summary

Clean, well-motivated refactoring of db_stress listener initialization into a reusable helper, correctly addressing stale listener state across DB::Open() retries. The change is test-tool-only and low risk.

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. FLAGS_db vs db_path (local variable) inconsistency — db_stress_test_base.cc:120
  • Issue: The original inline code at line 3988 constructs DbStressListener with db_path (a local const auto& db_path = GetDbPath() at line 3823). The new InitializeListenersForOpen() helper uses FLAGS_db directly. While GetDbPath() returns FLAGS_db today (line 68), this creates a coupling: if GetDbPath() is ever changed to return something other than FLAGS_db (e.g., a modified path), the helper would silently diverge from intent.
  • Suggested fix: Pass db_path as a parameter to InitializeListenersForOpen(), or call GetDbPath() inside the helper, to keep the indirection consistent with the rest of Open().

🟢 LOW / NIT

L1. Listener lifetime overlap with failed DBImpl — db_stress_test_base.cc:4121
  • Issue: When DB::Open() fails, the failed DBImpl may still hold shared_ptr references to the old listeners (via options_.listeners). The next InitializeListenersForOpen() call clears options_.listeners, dropping the StressTest-side references. The actual listener objects are destroyed only when the failed DBImpl fully destructs. This is safe because shared_ptr handles the lifetime correctly, but worth noting for future maintainers that old and new listener objects may briefly coexist.
L2. Method visibility — db_stress_test_base.h
  • Issue: The new method is declared under private: but RegisterAdditionalListeners() (which it calls) is protected virtual. This is fine for the current usage since InitializeListenersForOpen is only called from within StressTest::Open(), but if a subclass ever needs to customize the full listener initialization (not just add listeners), it would need to be made protected or virtual.
L3. No NotifyShuttingDown before re-initialization on retry — db_stress_test_base.cc:4121
  • Issue: On the retry path, InitializeListenersForOpen is called without first calling NotifyListenerShuttingDown() on the previous listeners. This is likely fine because the failed DBImpl handles its own shutdown, and the old listeners are about to be dropped. But the pattern in CleanUp() (line 110-113) does call NotifyListenerShuttingDown() before dropping listeners. The asymmetry is acceptable here since the DB open failed and normal shutdown notifications would have been suppressed.

Cross-Component Analysis

Context Affected? Notes
MultiOpsTxnsStressTest Yes, safe RegisterAdditionalListeners() is virtual and overridden to add MultiOpsTxnsStressListener. The helper calls it correctly, so fresh listeners are created for that subclass too.
Transaction DB path No The retry loop with InitializeListenersForOpen is inside the !FLAGS_use_txn branch. Transaction DB opens don't use fault injection retries.
ReadOnly DB No ReadOnly opens go through the same while(true) loop and benefit from the fix.

Positive Observations

  • The fix correctly identifies the root cause: stale per-DBImpl listener state carried across open retries.
  • Extracting InitializeListenersForOpen reduces code duplication and makes the retry path clearly match the initial path.
  • The options_.listeners.clear() at the top of the helper ensures no accumulation of listeners across retries.
  • Good commit message with detailed diagnosis of the original failure.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Root cause: StressTest::Open() built DbStressListener once before its DB::Open() retry loop. With open fault injection, a DB::Open() attempt can create a DBImpl, schedule background compaction, and then fail during late open work such as persisting OPTIONS. During teardown of that failed DBImpl, DBImpl's shutdown flag can suppress later compaction callbacks, leaving listener-local compaction bookkeeping stale.

Diagnosis: Sandcastle DB LOGs showed file 16821 flushed, then a failed open attempt with an injected read error and a background compaction picking 16821. The crash was in DbStressListener::OnCompactionBegin, so this was stale db_stress listener state across open attempts rather than DBImpl allowing a real concurrent compaction of the same SST.

Fix: factor listener construction into InitializeListenersForOpen() and call it before each DB::Open() attempt, including the retry path after open/open-compaction failure. Each DBImpl open attempt now gets fresh listener state.

Verification: make clean; make db_stress -j192; make check-sources; git diff --check.
@xingbowang xingbowang force-pushed the 2026_05_20_oncall_T272201927 branch from 84045ee to 3edfee1 Compare May 21, 2026 15:52
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 21, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D105969381.

@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 3edfee1


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 3edfee1


Summary

Clean, well-motivated fix for a real stress-test crash. The refactoring is correct and minimal. The InitializeListenersForOpen helper is semantically equivalent to the original inline code and is properly placed in the retry path.

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

None.

🟢 LOW / NIT

L1. Second retry path also reinitializes listeners before continue -- confirm placement is optimal — db_stress_test_base.cc:4131
  • Issue: The new InitializeListenersForOpen call in the retry path is placed just before continue, which means fresh listeners are created and then the loop immediately does another DB::Open(). This is correct. The call happens after the fault injection cleanup (deleting files, dropping unsynced data) but before the next DB::Open(). This ordering is fine since listener state is independent of filesystem state.
  • Suggested fix: No fix needed -- just a note for future maintainers.
L2. InitializeListenersForOpen uses GetDbPath() instead of local db_pathdb_stress_test_base.cc:133
  • Issue: The original inline code used the local const auto& db_path = GetDbPath(). The new helper calls GetDbPath() directly. These are functionally identical (GetDbPath() returns FLAGS_db). This is actually slightly better since it avoids coupling to a local variable.
  • Suggested fix: No fix needed.
L3. Declaration in private section — db_stress_test_base.h:80
  • Issue: InitializeListenersForOpen is private while it calls RegisterAdditionalListeners() which is protected virtual. This is fine in C++ and private is the correct access level since no subclass should call this helper independently.
  • Suggested fix: No change needed.

Cross-Component Analysis

Context Applies? Assessment
TransactionDB path NO Retry loop is inside !FLAGS_use_txn branch. Unaffected.
MultiOpsTxnsStressTest YES RegisterAdditionalListeners() correctly called inside the helper, so multi-ops txn listener also refreshed on retry.
BlobDB path YES Both BlobDB and regular DB::Open share options_.listeners. Correct.
ReadOnly open YES DB::OpenForReadOnly also uses options_.listeners. Correct.

Positive Observations

  • Excellent root cause analysis in the PR description.
  • Minimal, focused refactoring -- extracts exactly the right code.
  • No behavioral change on the happy path.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 21, 2026

@xingbowang merged this pull request in e42af37.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant