Narrow a race in BlobFile::UnlinkSstFile#14730
Conversation
✅ clang-tidy: No findings on changed linesCompleted in 189.1s. |
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 9d2026c ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 9d2026c SummaryThis PR fixes a real race condition where High-severity findings (2):
Full review (click to expand)Findings🔴 HIGHH1. Pre-existing race can cause silent data loss —
|
| Context | Does code execute? | Assumptions hold? | Action needed? |
|---|---|---|---|
| GC disabled | NO (uses BlobDBListener) | N/A | Safe |
| CompactFiles (manual) | YES | Same race possible | Best-effort handles it |
| TTL blob files | NO (separate obsolescence path) | N/A | Safe |
| DB::Open | NO (uses NoLock variant) | N/A | Safe |
| ReadOnly DB | NO (no writes) | N/A | Safe |
Assumption Stress Test:
- Claim: "Unlink miss is benign" — TRUE for the unlink itself (SST was never in the set). FALSE for the downstream effect:
MarkUnreferencedBlobFilesObsoletecan now prematurely obsolete the blob file. - Claim: "Reordering fixes the ordering" — TRUE within a single listener callback. FALSE across concurrent listener callbacks on different threads.
- Claim: "Consistent with existing graceful handling" — TRUE.
UnlinkSstFromBlobFilealready handles missing blob files gracefully. MakingUnlinkSstFilesimilarly graceful is architecturally consistent.
Positive Observations
- The PR correctly identifies a real race condition that causes crashes under ASAN/UBSAN.
- The best-effort pattern for
UnlinkSstFileis consistent with the existingUnlinkSstFromBlobFilepattern for missing blob files. - The listener reordering is a good improvement that ensures GC bookkeeping runs before DB property queries, reducing the race window.
- The two new tests are well-designed: one verifies ordering, one verifies graceful handling of the race.
- The
LinkSstFileassert does NOT have a symmetric race risk because SST file numbers are unique and atomically allocated. - Using gtest assertions in sync point callbacks is an established pattern in RocksDB's test suite (widely used in transaction tests and elsewhere).
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
🟡 Codex Code ReviewRequested by @mszeszko-meta ❌ Codex review failed before producing findings. Agent Communication ProtocolDuring Initial Review (Phase 3)
During Debate (Phase 4)
Message format exampleReview Anti-PatternsThese recurring failure modes lead to missed bugs. Each is detailed in the
Pull Request Information
PR DescriptionMotivationMake Test Plan
Diff to Reviewdiff --git a/utilities/blob_db/blob_db_impl.cc b/utilities/blob_db/blob_db_impl.cc
void BlobDBImpl::InitializeBlobFileToSstMapping( void BlobDBImpl::UpdateLiveSSTSize() {
#include "rocksdb/listener.h" @@ -51,17 +52,23 @@ class BlobDBListenerGC : public BlobDBListener {
void OnCompactionCompleted(DB* db, const CompactionJobInfo& info) override {
diff --git a/utilities/blob_db/blob_db_test.cc b/utilities/blob_db/blob_db_test.cc #include +TEST_F(BlobDBTest, ProcessFlushJobInfoBeforeUpdateLiveSSTSize) {
+TEST_F(BlobDBTest, UnlinkSstBeforeFlushLink) {
TEST_F(BlobDBTest, ShutdownWait) { // Unlink an SST file whose oldest blob file reference points to this file.
// the following functions are atomic, and don't need 2026-05-13T00:47:27.585185Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses |
Motivation
Make
BlobFile::UnlinkSstFilebest-effort to fix a race between flush listener notification and compaction scheduling. WhenNotifyOnFlushCompletedreleases the DB mutex to call the listener, there is a window where a compaction can start on the flushed SST beforeProcessFlushJobInfolinks it to its blob file. UnderASAN/UBSANslowdown this window widens enough forProcessCompactionJobInfoto attempt unlinking an SST that was never linked, crashing on the assert. The fix makesUnlinkSstFilereturn a bool and log a warning on miss, consistent withUnlinkSstFromBlobFilewhich already handles missing blob files gracefully.Test Plan
./blob_db_test --gtest_filter="*UnlinkSstBeforeFlushLink*"— deterministic repro: hooks the flush listener beforeProcessFlushJobInfoand firesProcessCompactionJobInfoon the unlinked SST; crashes without the fix, passes with it.