Skip to content

Narrow a race in BlobFile::UnlinkSstFile#14730

Draft
mszeszko-meta wants to merge 1 commit into
facebook:mainfrom
mszeszko-meta:fix-unlink-sst-file
Draft

Narrow a race in BlobFile::UnlinkSstFile#14730
mszeszko-meta wants to merge 1 commit into
facebook:mainfrom
mszeszko-meta:fix-unlink-sst-file

Conversation

@mszeszko-meta
Copy link
Copy Markdown
Contributor

Motivation

Make BlobFile::UnlinkSstFile best-effort to fix a race between flush listener notification and compaction scheduling. When NotifyOnFlushCompleted releases the DB mutex to call the listener, there is a window where a compaction can start on the flushed SST before ProcessFlushJobInfo links it to its blob file. Under ASAN/UBSAN slowdown this window widens enough for ProcessCompactionJobInfo to attempt unlinking an SST that was never linked, crashing on the assert. The fix makes UnlinkSstFile return a bool and log a warning on miss, consistent with UnlinkSstFromBlobFile which already handles missing blob files gracefully.

Test Plan

./blob_db_test --gtest_filter="*UnlinkSstBeforeFlushLink*" — deterministic repro: hooks the flush listener before ProcessFlushJobInfo and fires ProcessCompactionJobInfo on the unlinked SST; crashes without the fix, passes with it.

@meta-cla meta-cla Bot added the CLA Signed label May 11, 2026
@github-actions
Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 189.1s.

@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 9d2026c


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 9d2026c


Summary

This PR fixes a real race condition where UnlinkSstFile asserts when a compaction processes an SST before the flush listener has linked it. The fix is sound for the immediate crash, but the underlying race has a deeper consequence: when GC rewrites blobs during compaction, the blob file can be prematurely marked obsolete and later deleted, causing silent data loss. The reordering mitigates this but does not eliminate it.

High-severity findings (2):

  • [blob_db_impl.cc:577] Pre-existing race: if ProcessCompactionJobInfo runs before ProcessFlushJobInfo and GC rewrites blobs, the source blob file can be prematurely marked obsolete (empty linked_sst_files_ + stale flush_sequence_), leading to data loss when the blob file is later deleted. The PR converts the crash to a silent warning, which makes this scenario non-crashing but still data-losing.
  • [blob_db_listener.h:55] The reordering narrows the race window but does not eliminate it. A compaction that completes on a different thread between DB mutex release and flush listener execution can still trigger the race.
Full review (click to expand)

Findings

🔴 HIGH

H1. Pre-existing race can cause silent data loss — blob_db_impl.cc:577
  • Issue: When a compaction with GC completes before ProcessFlushJobInfo links the flushed SST, ProcessCompactionJobInfo silently misses the unlink (with the fix) and MarkUnreferencedBlobFilesObsolete may mark the source blob file obsolete because (a) linked_sst_files_ is empty (SST was never linked) and (b) flush_sequence_ hasn't been updated yet by the pending ProcessFlushJobInfo. Later, ProcessFlushJobInfo links the SST to the now-obsolete blob file, but DeleteObsoleteFiles only checks snapshot visibility, not linked_sst_files_, so the blob file can be deleted from disk while live SSTs still reference it.
  • Root cause: The race window between version install and listener notification is a fundamental RocksDB design choice. The PR correctly fixes the assert crash, but the deeper data consistency issue (premature blob file obsolescence) is pre-existing and not fully addressed.
  • Mitigating factors: (1) The race requires a compaction to pick up a brand-new SST, complete entirely, AND fire its listener before the flush listener runs — extremely unlikely in production but reproducible under ASAN/UBSAN. (2) The GC must rewrite blobs so the compaction output references a different blob file. (3) The MarkUnreferencedBlobFilesObsoleteImpl iteration stops at the first non-obsoletable file, limiting exposure to only the oldest blob files.
  • Suggested fix: Consider adding a guard in LinkSstToBlobFileImpl that skips linking to an already-obsolete blob file, or add a check in DeleteObsoleteFiles to verify linked_sst_files_ is empty before deletion. Alternatively, ensure the flush_sequence_ check in MarkBlobFileObsoleteIfNeeded accounts for pending flushes.
H2. Race window narrowed but not eliminated — blob_db_listener.h:55
  • Issue: The listener reordering ensures ProcessFlushJobInfo runs before UpdateLiveSSTSize within the same listener callback, but the race is between TWO DIFFERENT listener callbacks (flush vs compaction) on different threads. The reordering does not prevent a compaction listener from running before the flush listener starts.
  • Root cause: Both listeners fire after DB mutex release. Thread scheduling determines which runs first.
  • Suggested fix: The best-effort unlink (the other part of the fix) is the correct approach for handling this race. The reordering is still valuable for the ProcessFlushJobInfoBeforeUpdateLiveSSTSize ordering guarantee but should not be relied upon to prevent the cross-listener race.

🟡 MEDIUM

M1. Test UnlinkSstBeforeFlushLink lacks explicit assertions on final state — blob_db_test.cc:1766
  • Issue: The test simulates the race by injecting a compaction via sync point callback, but only verifies the process doesn't crash. It doesn't verify the final blob file state (e.g., checking linked SST count, blob file obsolescence status).
  • Suggested fix: Add assertions after the flush completes to verify the blob file's linked_sst_files_ and obsolescence state are correct.
M2. LinkSstFile has no guard against linking to an obsolete blob file — blob_file.h:127
  • Issue: LinkSstToBlobFileImpl does not check blob_file->Obsolete() before linking. In the race scenario (H1), this allows an SST to be linked to a blob file that's already in the obsolete_files_ list.
  • Suggested fix: Add an obsolete check in LinkSstToBlobFileImpl and log a warning if the blob file is already obsolete.

🟢 LOW / NIT

L1. sync_point.h included in production header — blob_db_listener.h:11
  • Issue: test_util/sync_point.h is now included in a production header file. While the macros compile to nothing in release builds (NDEBUG), this is less common for header files than .cc files.
  • Mitigating factor: Other production files (including some headers) already include this header. This follows established patterns.
L2. Comment explaining race tolerance in UnlinkSstFileblob_file.h:133
  • Issue: The function's doc comment explains the return value but doesn't explain WHY a miss is expected (the race condition). Future maintainers may not understand why this is best-effort.
  • Suggested fix: Add a brief comment noting the race between flush/compaction listeners.
L3. Test setup duplication — blob_db_test.cc:1722+
  • Issue: Both new tests have identical BlobDBOptions/Options setup. Could extract a helper.
  • Mitigating factor: This is consistent with existing test style in the file.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
GC disabled NO (uses BlobDBListener) N/A Safe
CompactFiles (manual) YES Same race possible Best-effort handles it
TTL blob files NO (separate obsolescence path) N/A Safe
DB::Open NO (uses NoLock variant) N/A Safe
ReadOnly DB NO (no writes) N/A Safe

Assumption Stress Test:

  • Claim: "Unlink miss is benign" — TRUE for the unlink itself (SST was never in the set). FALSE for the downstream effect: MarkUnreferencedBlobFilesObsolete can now prematurely obsolete the blob file.
  • Claim: "Reordering fixes the ordering" — TRUE within a single listener callback. FALSE across concurrent listener callbacks on different threads.
  • Claim: "Consistent with existing graceful handling" — TRUE. UnlinkSstFromBlobFile already handles missing blob files gracefully. Making UnlinkSstFile similarly graceful is architecturally consistent.

Positive Observations

  • The PR correctly identifies a real race condition that causes crashes under ASAN/UBSAN.
  • The best-effort pattern for UnlinkSstFile is consistent with the existing UnlinkSstFromBlobFile pattern for missing blob files.
  • The listener reordering is a good improvement that ensures GC bookkeeping runs before DB property queries, reducing the race window.
  • The two new tests are well-designed: one verifies ordering, one verifies graceful handling of the race.
  • The LinkSstFile assert does NOT have a symmetric race risk because SST file numbers are unique and atomically allocated.
  • Using gtest assertions in sync point callbacks is an established pattern in RocksDB's test suite (widely used in transaction tests and elsewhere).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@mszeszko-meta mszeszko-meta changed the title Fix a race in BlobFile::UnlinkSstFile Narrow a race in BlobFile::UnlinkSstFile May 11, 2026
@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Requested by @mszeszko-meta


Codex review failed before producing findings.

├── serialization-reviewer     (general-purpose agent)
│   └── test-reviewer              (general-purpose agent)
│
├── Phase 4: Debate (agents message each other)
│   ├── correctness ↔ invariant-adversary, serialization
│   ├── cross-component ↔ correctness, caller-audit
│   ├── invariant-adversary ↔ correctness, cross-component
│   ├── caller-audit ↔ invariant-adversary, cross-component
│   ├── performance ↔ api, cross-component
│   ├── api ↔ correctness, performance
│   ├── serialization ↔ correctness, invariant-adversary
│   ├── test-coverage ↔ serialization, caller-audit
│   └── design ↔ cross-component, invariant-adversary
│

Agent Communication Protocol

During Initial Review (Phase 3)

  • Each agent writes findings to its own file
  • Each agent sends a summary message to team lead when done
  • Agents do NOT communicate with each other yet

During Debate (Phase 4)

  • Team lead sends each agent a message with instructions to:
    1. Read the assigned agents' findings files
    2. Send critique messages to those agents
    3. Respond to incoming critiques
    4. Update their own findings file with revisions
  • Each critique message should include:
    • Which finding they're addressing (e.g., "Correctness F1")
    • Whether they AGREE, DISAGREE, or want to REFINE
    • Their reasoning with code evidence
    • Suggested severity adjustment (if any)

Message format example

To: correctness-reviewer
Re: Your Finding F1

AGREE/DISAGREE/REFINE - [reasoning with code evidence].
[Suggested severity adjustment if any.]

Review Anti-Patterns

These recurring failure modes lead to missed bugs. Each is detailed in the
referenced section; this table is a quick-reference checklist.

Anti-Pattern Fix Reference
Return-Value Tunnel Vision Trace callee MUTATIONS, not just returns Section 2c
Default-Configuration Bias Enumerate all execution contexts Section 2h
Assert-as-Proof Try to BREAK every assert Section 2i
Write-Path-Only Analysis Trace data readers, not just writers Section 2g
Confirmation-Seeking Research Use adversarial prompts ("find where X fails") Invariant Adversary agent
Data-Flow vs Control-Flow Confusion Separate who CALLS from who READS the data Section 2g

Pull Request Information

  • Title: Narrow a race in BlobFile::UnlinkSstFile
  • Author: mszeszko-meta
  • Changes: 4 files changed, 97 insertions(+), 10 deletions(-)

PR Description

Motivation

Make BlobFile::UnlinkSstFile best-effort to fix a race between flush listener notification and compaction scheduling. When NotifyOnFlushCompleted releases the DB mutex to call the listener, there is a window where a compaction can start on the flushed SST before ProcessFlushJobInfo links it to its blob file. Under ASAN/UBSAN slowdown this window widens enough for ProcessCompactionJobInfo to attempt unlinking an SST that was never linked, crashing on the assert. The fix makes UnlinkSstFile return a bool and log a warning on miss, consistent with UnlinkSstFromBlobFile which already handles missing blob files gracefully.

Test Plan

./blob_db_test --gtest_filter="*UnlinkSstBeforeFlushLink*" — deterministic repro: hooks the flush listener before ProcessFlushJobInfo and fires ProcessCompactionJobInfo on the unlinked SST; crashes without the fix, passes with it.


Diff to Review

diff --git a/utilities/blob_db/blob_db_impl.cc b/utilities/blob_db/blob_db_impl.cc
index 677fa60df..bd5d3b979 100644
--- a/utilities/blob_db/blob_db_impl.cc
+++ b/utilities/blob_db/blob_db_impl.cc
@@ -460,14 +460,22 @@ void BlobDBImpl::UnlinkSstFromBlobFile(uint64_t sst_file_number,
BlobFile* const blob_file = it->second.get();
assert(blob_file);

  • bool unlinked;
    {
    WriteLock file_lock(&blob_file->mutex_);
  • blob_file->UnlinkSstFile(sst_file_number);
  • unlinked = blob_file->UnlinkSstFile(sst_file_number);
    }
  • ROCKS_LOG_INFO(db_options_.info_log,
  •             "Blob file %" PRIu64 " unlinked from SST file %" PRIu64,
    
  •             blob_file_number, sst_file_number);
    
  • if (unlinked) {
  • ROCKS_LOG_INFO(db_options_.info_log,
  •               "Blob file %" PRIu64 " unlinked from SST file %" PRIu64,
    
  •               blob_file_number, sst_file_number);
    
  • } else {
  • ROCKS_LOG_WARN(db_options_.info_log,
  •               "SST file %" PRIu64 " not found in blob file %" PRIu64
    
  •               " while trying to unlink",
    
  •               sst_file_number, blob_file_number);
    
  • }
    }

void BlobDBImpl::InitializeBlobFileToSstMapping(
@@ -1184,6 +1192,7 @@ void BlobDBImpl::GetCompactionContext(BlobCompactionContext* context,
}

void BlobDBImpl::UpdateLiveSSTSize() {

  • TEST_SYNC_POINT("BlobDBImpl::UpdateLiveSSTSize:Begin");
    uint64_t live_sst_size = 0;
    bool ok = GetIntProperty(DB::Properties::kLiveSstFilesSize, &live_sst_size);
    if (ok) {
    diff --git a/utilities/blob_db/blob_db_listener.h b/utilities/blob_db/blob_db_listener.h
    index 822f71363..c9eeb59bd 100644
    --- a/utilities/blob_db/blob_db_listener.h
    +++ b/utilities/blob_db/blob_db_listener.h
    @@ -8,6 +8,7 @@
    #include

#include "rocksdb/listener.h"
+#include "test_util/sync_point.h"
#include "util/mutexlock.h"
#include "utilities/blob_db/blob_db_impl.h"

@@ -51,17 +52,23 @@ class BlobDBListenerGC : public BlobDBListener {
const char* Name() const override { return kClassName(); }
static const char* kClassName() { return "BlobDBListenerGC"; }
void OnFlushCompleted(DB* db, const FlushJobInfo& info) override {

  • BlobDBListener::OnFlushCompleted(db, info);
  • assert(blob_db_impl_);
  • TEST_SYNC_POINT_CALLBACK("BlobDBListenerGC::OnFlushCompleted:BeforeProcess",
  •                         const_cast<FlushJobInfo*>(&info));
    
  • // Keep GC bookkeeping ahead of any DB property queries in the base
  • // listener; later compactions can rely on these mappings.
    blob_db_impl_->ProcessFlushJobInfo(info);
  • BlobDBListener::OnFlushCompleted(db, info);
    }

void OnCompactionCompleted(DB* db, const CompactionJobInfo& info) override {

  • BlobDBListener::OnCompactionCompleted(db, info);
  • assert(blob_db_impl_);
  • // Keep GC bookkeeping ahead of any DB property queries in the base
  • // listener; later compactions can rely on these mappings.
    blob_db_impl_->ProcessCompactionJobInfo(info);
  • BlobDBListener::OnCompactionCompleted(db, info);
    }
    };

diff --git a/utilities/blob_db/blob_db_test.cc b/utilities/blob_db/blob_db_test.cc
index 5d3674f09..4538dc735 100644
--- a/utilities/blob_db/blob_db_test.cc
+++ b/utilities/blob_db/blob_db_test.cc
@@ -6,6 +6,7 @@
#include "utilities/blob_db/blob_db.h"

#include
+#include
#include
#include
#include
@@ -1719,6 +1720,72 @@ TEST_F(BlobDBTest, MaintainBlobFileToSstMapping) {
}
}

+TEST_F(BlobDBTest, ProcessFlushJobInfoBeforeUpdateLiveSSTSize) {

  • BlobDBOptions bdb_options;
  • bdb_options.enable_garbage_collection = true;
  • bdb_options.disable_background_tasks = true;
  • Options options;
  • options.disable_auto_compactions = true;
  • Open(bdb_options, options);
  • ASSERT_OK(Put("key", "value"));
  • std::atomic checked{false};
  • SyncPoint::GetInstance()->SetCallBack(
  •  "BlobDBImpl::UpdateLiveSSTSize:Begin", [&](void* /* arg */) {
    
  •    if (checked.exchange(true)) {
    
  •      return;
    
  •    }
    
  •    auto blob_files = blob_db_impl()->TEST_GetBlobFiles();
    
  •    EXPECT_EQ(blob_files.size(), 1);
    
  •    if (blob_files.size() == 1) {
    
  •      EXPECT_EQ(blob_files[0]->GetLinkedSstFiles().size(), 1);
    
  •    }
    
  •  });
    
  • SyncPoint::GetInstance()->EnableProcessing();
  • ASSERT_OK(blob_db_->Flush(FlushOptions()));
  • SyncPoint::GetInstance()->DisableProcessing();
  • ASSERT_TRUE(checked.load());
    +}

+TEST_F(BlobDBTest, UnlinkSstBeforeFlushLink) {

  • BlobDBOptions bdb_options;
  • bdb_options.enable_garbage_collection = true;
  • bdb_options.disable_background_tasks = true;
  • Options options;
  • options.disable_auto_compactions = true;
  • Open(bdb_options, options);
  • ASSERT_OK(Put("key", "value"));
  • SyncPoint::GetInstance()->SetCallBack(
  •  "BlobDBListenerGC::OnFlushCompleted:BeforeProcess", [&](void* arg) {
    
  •    auto* info = static_cast<FlushJobInfo*>(arg);
    
  •    if (info->oldest_blob_file_number == kInvalidBlobFileNumber) {
    
  •      return;
    
  •    }
    
  •    // Simulate a compaction that consumed this SST before the flush
    
  •    // listener had a chance to link it. This is the race window that
    
  •    // exists between version install and listener notification.
    
  •    CompactionJobInfo cinfo{};
    
  •    cinfo.input_file_infos.emplace_back(CompactionFileInfo{
    
  •        0, info->file_number, info->oldest_blob_file_number});
    
  •    cinfo.output_file_infos.emplace_back(CompactionFileInfo{
    
  •        1, info->file_number + 1, info->oldest_blob_file_number});
    
  •    blob_db_impl()->TEST_ProcessCompactionJobInfo(cinfo);
    
  •  });
    
  • SyncPoint::GetInstance()->EnableProcessing();
  • ASSERT_OK(blob_db_->Flush(FlushOptions()));
  • SyncPoint::GetInstance()->DisableProcessing();
    +}

TEST_F(BlobDBTest, ShutdownWait) {
BlobDBOptions bdb_options;
bdb_options.ttl_range_secs = 100;
diff --git a/utilities/blob_db/blob_file.h b/utilities/blob_db/blob_file.h
index 4110234d0..63f9e5a0f 100644
--- a/utilities/blob_db/blob_file.h
+++ b/utilities/blob_db/blob_file.h
@@ -130,10 +130,14 @@ class BlobFile {
}

// Unlink an SST file whose oldest blob file reference points to this file.

  • void UnlinkSstFile(uint64_t sst_file_number) {
  • // Returns true if the SST was found and unlinked, false otherwise.
  • bool UnlinkSstFile(uint64_t sst_file_number) {
    auto it = linked_sst_files_.find(sst_file_number);
  • assert(it != linked_sst_files_.end());
  • if (it == linked_sst_files_.end()) {
  •  return false;
    
  • }
    linked_sst_files_.erase(it);
  • return true;
    }

// the following functions are atomic, and don't need

2026-05-13T00:47:27.585185Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses
2026-05-13T00:47:27.814683Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses
2026-05-13T00:47:28.245986Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses
ERROR: Reconnecting... 2/5
2026-05-13T00:47:28.914217Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses
ERROR: Reconnecting... 3/5
2026-05-13T00:47:29.921329Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses
ERROR: Reconnecting... 4/5
2026-05-13T00:47:31.782629Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses
ERROR: Reconnecting... 5/5
2026-05-13T00:47:35.140221Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses
ERROR: Reconnecting... 1/5
ERROR: Reconnecting... 2/5
ERROR: Reconnecting... 3/5
ERROR: Reconnecting... 4/5
ERROR: Reconnecting... 5/5
ERROR: unexpected status 401 Unauthorized: Missing bearer or basic authentication in header, url: https://api.openai.com/v1/responses, cf-ray: 9fada73d6848317e-IAD, request id: req_77560145bede45b3808f8c65098d4acd
ERROR: unexpected status 401 Unauthorized: Missing bearer or basic authentication in header, url: https://api.openai.com/v1/responses, cf-ray: 9fada73d6848317e-IAD, request id: req_77560145bede45b3808f8c65098d4acd


---

<details>
<summary>ℹ️ About this response</summary>

Generated by Codex CLI.
Review methodology: `claude_md/code_review.md`

**Limitations:**
- Codex may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions

**Commands:**
- `/codex-review [context]` — Request a code review
- `/codex-query <question>` — Ask about the PR or codebase
</details>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant