Skip to content

Add iterator key/value pinning APIs#14761

Open
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:kv_pinning
Open

Add iterator key/value pinning APIs#14761
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:kv_pinning

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

Summary

  • Add PinnableKeyValue, PinnableKeyValueBatch, Iterator::PinCurrent(), and Iterator::AppendPinnedCurrent() so callers can keep the current key/value valid after iterator movement.
  • Plumb an internal PinCurrentKeyValue() hook through DB, forward, merging, level, and wrapper iterators to block-based table iterators.
  • Retain block-cache entries for pinned data-block values, track pinned cache charge and copied bytes, dedupe batch leases for rows from the same data block, and fall back to copying when pinning is unavailable.
  • Document and enforce the initial support boundary: forward iteration over plain kTypeValue rows only; reverse iteration, merge results, timestamps, blobs, wide-column entities, timed puts, and other value types return NotSupported.
  • Fix PinnableSlice move construction so moved pinned slices can be reset and reused safely.

Test Plan

  • git diff --check upstream/main...kv_pinning
  • timeout 60s ./db_test2 --gtest_filter='DBTest2.IteratorPinCurrentDeltaEncodedKeyAndPinnedValue:DBTest2.IteratorPinAPIsPinNonDeltaEncodedKeyAndValue:DBTest2.IteratorPinCurrentFallsBackToCopyWithoutBlockCache:DBTest2.IteratorAppendPinnedCurrentDedupesSameBlockLease:DBTest2.IteratorAppendPinnedCurrentTracksDistinctBlockUsage:DBTest2.IteratorAppendPinnedCurrentFallsBackToCopyWithoutBlockCache:DBTest2.IteratorPinCurrentClearsOutputOnReverseIterationNotSupported:DBTest2.IteratorPinAPIsHandleReverseIterationSafely:DBTest2.IteratorAppendPinnedCurrentResetReleasesAndReusesBatch:DBTest2.IteratorPinAPIsRejectNullOutputPointers:DBTest2.IteratorPinAPIsRejectBlobRowsAfterPrepareValue:DBTest2.IteratorPinAPIsRejectWideColumnEntityRows:DBTest2.IteratorPinAPIsRejectUserDefinedTimestampRows'
  • timeout 60s ./db_iter_test --gtest_filter='DBIteratorTest.DBIteratorPinAPIsRejectTimedPutRows'
  • timeout 60s ./slice_test --gtest_filter='PinnableSliceTest.Move'

@meta-cla meta-cla Bot added the CLA Signed label May 19, 2026
@github-actions
Copy link
Copy Markdown

⚠️ clang-tidy: 1 warning(s) on changed lines

Completed in 593.0s.

Summary by check

Check Count
performance-noexcept-move-constructor 1
Total 1

Details

util/slice.cc (1 warning(s))
util/slice.cc:325:16: warning: move constructors should be marked noexcept [performance-noexcept-move-constructor]

@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit f115033


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit f115033


Summary

Well-structured PR that adds iterator key/value pinning APIs with correct cache reference counting, proper fallback-to-copy semantics, and good test coverage for the supported use cases. The PinnableSlice move constructor fix is necessary and correct. The main gaps are in missing test coverage for some edge cases and a potential issue with PinCurrent resetting caller-owned output on NotSupported.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. PinCurrent resets output before checking support -- db/db_iter.cc:244
  • Issue: PinCurrent() calls out->Reset() at the top of the method, then calls PrepareForPinCurrent() which may return NotSupported. This means if a caller has a populated PinnableKeyValue and calls PinCurrent on an unsupported row (e.g., reverse iteration), their previously-stored data is lost even though the operation fails.
  • Root cause: The reset happens unconditionally before validation.
  • Suggested fix: Move out->Reset() to after PrepareForPinCurrent() succeeds, or document that PinCurrent always clears output regardless of success/failure. The current test IteratorPinCurrentClearsOutputOnReverseIterationNotSupported explicitly tests this behavior, suggesting it is intentional. If intentional, add a doc comment to the API noting this contract.
M2. AppendPinnedCurrent does NOT reset batch on NotSupported (asymmetry) -- db/db_iter.cc:281
  • Issue: Unlike PinCurrent, AppendPinnedCurrent does not reset the batch on NotSupported. This is the correct append semantic (don't destroy existing entries), but the asymmetry with PinCurrent (which does reset) could confuse callers.
  • Root cause: Design choice, but not documented.
  • Suggested fix: Add explicit documentation to both APIs explaining the reset-on-error semantics. The tests cover this (IteratorPinAPIsHandleReverseIterationSafely verifies batch is unchanged), so the behavior is intentional.
M3. Missing #include for <unordered_set> in public header -- include/rocksdb/iterator.h
  • Issue: The public header iterator.h now includes <unordered_set>, <utility>, and <vector>. These are heavyweight standard library headers that will increase compile time for every translation unit that includes iterator.h (which is virtually every RocksDB user).
  • Root cause: PinnableKeyValueBatch uses std::unordered_set<const void*> internally for dedupe.
  • Suggested fix: Consider using the PIMPL idiom or moving PinnableKeyValueBatch internals to a separate header, or at minimum moving cleanup_dedupe_tokens_ to a forward-declared implementation type. Alternatively, since this is a private member, a sorted std::vector<const void*> (with std::lower_bound) would be smaller and avoid the <unordered_set> include for typical small batch sizes.
M4. HistoryTrimmingIterator delegates without validity check -- db/history_trimming_iterator.h:85
  • Issue: HistoryTrimmingIterator::PinCurrentKeyValue() delegates directly to input_->PinCurrentKeyValue(out) without checking Valid() or asserting validity first, unlike other wrapper iterators (ClippingIterator, BlobCountingIterator, LevelIterator) which check validity before delegating.
  • Root cause: Inconsistent pattern across wrapper iterators.
  • Suggested fix: Add assert(Valid()) for consistency with other wrappers, or rely on the underlying iterator's own validity check. Since callers (DBIter) already check validity, this is not a correctness bug, but an inconsistency.
M5. PinnableKeyValue and PinnableKeyValueBatch defined in public header but have friend class DBIter -- include/rocksdb/iterator.h
  • Issue: PinnableKeyValue grants friend access to DBIter, tying the public API to an internal implementation class. This makes it harder for third-party iterator implementations to populate PinnableKeyValue.
  • Root cause: The mutating methods (PinKey, CopyKey, PinValue, CopyValue, SetCleanup, SetPinnedBlockCacheUsage) are private.
  • Suggested fix: Consider making these methods protected or providing a builder/factory pattern, so that custom Iterator subclasses can also implement PinCurrent. Alternatively, document that only RocksDB's built-in iterators support this API.

🟢 LOW / NIT

L1. PinnedIterKeyValue::Reset() lifetime model undocumented -- table/internal_iterator.h
  • Issue: PinnedIterKeyValue::Reset() calls cleanup.Reset() which correctly decrements the shared reference. The Slice members use .clear() which just zeroes the pointer/size. This is correct since data lifetime is managed by the cleanup, but a comment would help future maintainers.
  • Suggested fix: Add a brief comment to PinnedIterKeyValue::Reset().
L2. No db_bench support -- PR description
  • Issue: Per CLAUDE.md guidelines, performance-related features should be supported in db_bench. No db_bench flags are added.
  • Suggested fix: Consider adding a db_bench flag for future performance regression tracking.
L3. Test helper generates zero-padded keys for ordering -- db/db_test2.cc
  • Issue: Minor nit on PinnedKVTestKey("prefix" + std::to_string(100000 + index)) -- this works correctly for lexicographic ordering. No action needed.

Cross-Component Analysis

Context Executes? Safe? Notes
WritePreparedTxnDB YES YES PinCurrent only reads already-validated position
ReadOnly DB YES YES Same iterator stack
User-defined timestamps Rejected N/A timestamp_size_ != 0 check
BlobDB Rejected N/A ikey_.type != kTypeValue check
Wide-column entities Rejected N/A Type check rejects
Merge results Rejected N/A current_entry_is_merged_ check
Memory-mapped files Fallback to copy YES No cache handle -> copies
Prefix seek YES YES No prefix-specific behavior

Cache::Ref() correctness: Cache::Ref() correctly bumps reference count independently of pinned_iters_mgr_. The cleanup callback calls Cache::Release(). This is the same pattern as PinnableSlice in Get(). No double-release risk.

Transaction visibility: PrepareForPinCurrent() only succeeds on already-visible keys (DBIter has applied read_callback_ checks). No bypass possible.

Positive Observations

  • Clean layering of internal PinnedIterKeyValue vs public PinnableKeyValue
  • Correct fallback to copy when pinning unavailable
  • Clever same-block deduplication via cache handle pointer identity
  • Comprehensive test coverage across all rejection cases
  • Necessary and correct PinnableSlice move constructor fix

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant