Fix table cache entry leak in VersionBuilder::UnrefFile on failed LogAndApply (#14720)#14720
Fix table cache entry leak in VersionBuilder::UnrefFile on failed LogAndApply (#14720)#14720mszeszko-meta wants to merge 1 commit into
Conversation
|
@mszeszko-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99759696. |
✅ clang-tidy: No findings on changed linesCompleted in 72.4s. |
b0806be to
32a8813
Compare
…AndApply (facebook#14720) Summary: ### Context `VersionSet::LogAndApply` loads table handlers before the `MANIFEST` update is durable. If that `MANIFEST` update later fails, newly added files can be discarded by `VersionBuilder` without ever being installed in a `Version`. `VersionBuilder::UnrefFile` released the `FileMetaData` `pinned_reader` handle, but releasing that handle only dropped the reference; it did not erase the cache key, so an orphaned table-cache entry could survive for a file that is neither live nor quarantined. When metadata read fault injection prevents the obsolete-file scan from cleaning up the orphan, DB close can trip `TEST_VerifyNoObsoleteFilesCached` with File N is not live nor quarantined. ### Fix After releasing a pinned_reader for a `FileMetaData` whose refcount reaches zero, explicitly evict that file number from the table cache. The new `VersionBuilderDBTest.UnrefFileEvictsPinnedTableReader` regression test exercises this directly by loading a real SST through `VersionBuilder::LoadTableHandlers`, destroying the builder, and asserting that the table cache no longer contains the file. With the eviction removed, the test fails deterministically with `leaked_cache_entry`=`true`. Differential Revision: D99759696
32a8813 to
d089ff2
Compare
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit d089ff2 SummaryClean, well-targeted bug fix that prevents table cache entry leaks when a VersionBuilder is destroyed without installing files into a Version (e.g., failed LogAndApply). The fix is safe because refs=0 guarantees the file was never installed, making cache eviction always correct. The regression test is thorough and deterministic. High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🔴 HIGHNone. 🟡 MEDIUMNone. 🟢 LOW / NITL1. Narrow race window between Release() and Evict() —
|
| Context | Executes? | Safe? | Reasoning |
|---|---|---|---|
| Normal SaveTo() path | No (refs > 0) | N/A | Eviction not reached |
| Failed LogAndApply | Yes (refs = 0) | Yes | File never installed in Version |
| CreateOrReplaceSavePoint | Eventual (last copy) | Yes | Correct ref counting prevents premature eviction |
| table_cache_ = nullptr | Guarded | Yes | Null check prevents crash |
| Concurrent LogAndApply | No collision | Yes | File numbers are unique (atomic counter) |
| WritePreparedTxnDB | Standard path | Yes | No special VersionBuilder usage |
| Secondary/ReadOnly DB | Standard path | Yes | Same VersionBuilder patterns |
Assumption stress-test results:
- "refs=0 means never installed" — HOLDS: SaveTo() → AddFile() increments refs; if SaveTo fails, files were never installed. Version destruction in LogAndApply failure happens while VersionBuilder still holds refs.
- "Eviction won't remove needed entries" — HOLDS: File numbers are unique (atomic allocation), and refs=0 files are by definition not referenced by any live Version.
- "Cache::Erase is safe with active handles" — HOLDS: LRU cache Erase removes from hash table but preserves entries for existing handle holders.
Positive Observations
- Minimal, surgical fix: Only 9 lines of production code, directly addressing the root cause.
- Correct operation ordering: Release pinned handle before evicting from cache prevents use-after-free.
- Follows established patterns:
TableCache::Evict()is used in similar cleanup scenarios elsewhere (db_impl_files.cc). - Thorough regression test: Creates real SST, exercises actual cache pinning, and verifies eviction deterministically. Proper cleanup on failure path (manual eviction + file deletion before assertion).
- Good defensive null check:
if (table_cache_ != nullptr)is more defensive than the existing assert-only pattern.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…AndApply (facebook#14720) Summary: ### Context `VersionSet::LogAndApply` loads table handlers before the `MANIFEST` update is durable. If that `MANIFEST` update later fails, newly added files can be discarded by `VersionBuilder` without ever being installed in a `Version`. `VersionBuilder::UnrefFile` released the `FileMetaData` `pinned_reader` handle, but releasing that handle only dropped the reference; it did not erase the cache key, so an orphaned table-cache entry could survive for a file that is neither live nor quarantined. When metadata read fault injection prevents the obsolete-file scan from cleaning up the orphan, DB close can trip `TEST_VerifyNoObsoleteFilesCached` with File N is not live nor quarantined. ### Fix After releasing a pinned_reader for a `FileMetaData` whose refcount reaches zero, explicitly evict that file number from the table cache. The new `VersionBuilderDBTest.FailedLogAndApplyEvictsTableCacheEntry` regression test exercises the realistic production path: it creates a real SST, calls `LogAndApply` through the DB's own `VersionSet`, injects a MANIFEST sync failure via the `AfterSyncManifest` sync point, and asserts that the table cache no longer contains the file after `ProcessManifestWrites` destroys the `VersionBuilder`. With the eviction removed, the test fails deterministically with `leaked_cache_entry=true`. Differential Revision: D99759696
d089ff2 to
3314ad1
Compare
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 3314ad1 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 3314ad1 SummarySolid, low-risk fix for a real table cache entry leak on failed High-severity findings (0): Full review (click to expand)Findings🔴 HIGHNone. 🟡 MEDIUMM1. Unconditional eviction is broader than necessary —
|
| Concern | Verdict | Reasoning |
|---|---|---|
| Copy constructor double-evict | Safe | Rep(const Rep&) calls RefFile (bumping refs to 2). When the first copy is destroyed, refs goes 2→1 (no eviction). When the second copy is destroyed, refs goes 1→0 (eviction fires correctly — no Rep holds the file). |
| Base version cache entry conflict | Safe | File numbers are strictly monotonic (NewFileNumber uses fetch_add(1) in version_set.h). A newly added file will always have a unique number. ApplyFileAddition rejects duplicate file numbers. The VersionBuilder's FileMetaData is separate from the base version's. |
| ApplyFileDeletion premature eviction | Safe (no-op) | For files added then deleted in the same builder, if LoadTableHandlers was never called for them (no pinned reader), Cache::Erase is a no-op. If it was called, the file was only referenced by this builder, so eviction is correct. |
| Recovery path (VersionEditHandler) | Safe | During recovery, UnrefFile in the destructor won't drop refs to 0 for files installed in the Version (refs > 1). For files that are added then deleted during replay, eviction of orphaned cache entries is correct. |
| ReadOnly / SecondaryInstance | N/A | These contexts don't trigger the write path that creates VersionBuilders for LogAndApply. |
| WritePreparedTxn / WriteUnpreparedTxn | Safe | Same VersionBuilder lifecycle; no additional interaction. |
| Performance: unnecessary Erase calls | Negligible | Cache::Erase for a non-existent key is O(1) hash lookup + shard lock. This runs only when refs reaches 0, which is rare in normal operation (files are installed in Versions). |
Positive Observations
- The fix correctly addresses a real resource leak that could cause
TEST_VerifyNoObsoleteFilesCachedassertions during DB close under fault injection. - The
table_cache_ != nullptrguard correctly handles the many existing unit tests that passnullptrfor the table cache. - The test uses sync points rather than sleep, avoiding flakiness.
- The test follows established patterns from similar tests (e.g.,
LeakedTableCacheEntryOnFlushInstallFailureindb_flush_test.cc). Cache::Eraseis idempotent and safe for concurrent access, so even ifPurgeObsoleteFileslater tries to erase the same entry, there's no issue.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Summary:
Context
VersionSet::LogAndApplyloads table handlers before theMANIFESTupdate is durable. If thatMANIFESTupdate later fails, newly added files can be discarded byVersionBuilderwithout ever being installed in aVersion.VersionBuilder::UnrefFilereleased theFileMetaDatapinned_readerhandle, but releasing that handle only dropped the reference; it did not erase the cache key, so an orphaned table-cache entry could survive for a file that is neither live nor quarantined. When metadata read fault injection prevents the obsolete-file scan from cleaning up the orphan, DB close can tripTEST_VerifyNoObsoleteFilesCachedwith File N is not live nor quarantined.Fix
After releasing a pinned_reader for a
FileMetaDatawhose refcount reaches zero, explicitly evict that file number from the table cache. The newVersionBuilderDBTest.FailedLogAndApplyEvictsTableCacheEntryregression test exercises the realistic production path: it creates a real SST, callsLogAndApplythrough the DB's ownVersionSet, injects a MANIFEST sync failure via theAfterSyncManifestsync point, and asserts that the table cache no longer contains the file afterProcessManifestWritesdestroys theVersionBuilder. With the eviction removed, the test fails deterministically withleaked_cache_entry=true.Differential Revision: D99759696