Add use_direct_io_for_compaction_reads option by zaidoon1 · Pull Request #14743 · facebook/rocksdb

zaidoon1 · 2026-05-14T21:53:26Z

Adds a new DBOption use_direct_io_for_compaction_reads (default false). When on, compaction-input SST files are opened with O_DIRECT so the sequential read-once data from compaction doesn't pollute the OS page cache and evict the hot user-read working set. User reads keep going through the buffered fast path. This protects user-read tail latency on write-heavy workloads without forcing user reads onto the existing global use_direct_reads knob (which pays in throughput and P50 — see the bench below).

The interesting bit is that just flipping the FileOptions returned by FileSystem::OptimizeForCompactionTableRead doesn't actually trigger O_DIRECT at the kernel level. The TableCache (and FileMetaData::pinned_reader) is already holding buffered handles opened at flush time or at DB::Open via LoadTableHandlers. When compaction asks for an iterator, it gets back the cached buffered handle and the kernel never sees the O_DIRECT flag.

So this PR also adds a small bypass path:

TableCache::FindTable / NewIterator learn a open_ephemeral_table_reader mode. When set, the pinned-reader fast path and the shared cache are skipped, GetTableReader is called directly with the caller's FileOptions, and ownership of the freshly opened TableReader is handed back via a unique_ptr. The iterator takes ownership via RegisterCleanup and frees the reader on destruction.
VersionSet::MakeInputIterator and LevelIterator plumb the flag through both L0 and L1+ compaction-input paths.
CompactionJob::ProcessKeyValueCompaction turns the bypass on when use_direct_io_for_compaction_reads is set, the global use_direct_reads is off, and OptimizeForCompactionTableRead produced use_direct_reads=true in the compaction-read FileOptions.

The option is opt-in: when off, nothing changes for existing users. When on, only the compaction-input opens take the bypass path; user reads keep hitting the TableCache and the buffered fast path normally.

There's also a small db_bench helper in the same PR: a new --bgwriter_num flag that lets the writer thread in readwhilewriting (and the other "while writing" variants) spread its puts across [0, bgwriter_num) instead of [0, num). Without this the readers and writer share a key range and you can't have both a hot read subset and meaningful compaction work — this lets you have both.

Benchmark

Setup: Ubuntu 24.04 (kernel 7.0.5, OrbStack Linux VM on Apple Silicon), 14 vCPUs, virtio-blk disk, ext4. MGLRU disabled (echo 0 > /sys/kernel/mm/lru_gen/enabled) so the kernel uses the classic active/inactive LRU. 14 GB DB (3.5M keys × 4 KB values), no compression. Each measurement run is pinned to a 1 GB cgroup via systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0. Page cache is dropped between configs. db_bench is Release build.

Workload: readwhilewriting for 120s. 4 reader threads doing random reads over a hot key subset, plus 1 writer thread spreading overwrites across the full 3.5M-key keyspace (via --bgwriter_num=3500000) throttled at 200 MB/s, so there's continuous compaction running while the readers go.

The size of the hot reader subset relative to available page cache controls how visible the optimization is. The Cassandra blog (Lightfoot 2026) documented the same thing: biggest wins when the hot set is big enough to actually compete for cache, smaller wins when the hot set trivially fits, neutral when the hot set is way bigger than cache. So I ran two hot-set sizes.

Small hot set: ~30 MB (~3% of the 1 GB cgroup) — N=5 iterations, mean (CV)

--num=7500. The hot set is small enough that the page cache holds it without much trouble even under compaction, so the wins here are real but on the modest side.

Config	Throughput (ops/s)	Read P50 (µs)	Read P99 (µs)	Read P99.9 (µs)	Read P99.99 (µs)
buffered (default)	233,477 (8.2%)	16.09	82.24	721.0	2,102.5
direct_compaction_writes_only (existing knob alone)	287,405 (2.8%) — +23.1%	13.00 (−19.2%)	66.77 (−18.8%)	553.9 (−23.2%)	1,787.6 (−15.0%)
direct_compaction_read_only (new knob alone)	250,669 (2.4%) — +7.4%	14.16 (−12.0%)	102.99 (+25.2%)	689.8 (−4.3%)	1,801.3 (−14.3%)
direct_compaction_read_write (new + existing, recommended)	277,920 (3.3%) — +19.0%	12.99 (−19.3%)	84.23 (+2.4%)	613.4 (−14.9%)	1,738.2 (−17.3%)
use_direct_reads=true (existing global) + write-side	249,014 (2.5%) — +6.7%	15.95 (−0.9%)	68.78 (−16.4%)	450.8 (−37.5%)	1,814.5 (−13.7%)

CV is 2.4–3.3% on the optimized configs (8.2% on buffered), so the deltas are real. With a hot set this small, the existing use_direct_io_for_flush_and_compaction knob is already doing most of the work — the new flag's main extra contribution here is P99.99 (combined wins it by ~2 points vs writes-only-alone). Worth noting: the new flag alone (without the existing write-side flag) improves P99.99 but regresses P99 by 25% on this small-hot-set workload, because direct compaction reads lose kernel readahead and compaction-output writes are still hitting the page cache. That regression goes away once you combine with the existing write-side flag, or once the hot set is bigger (see next table). So if you're using just one knob, use the existing one. If you're using this PR's flag, pair it with use_direct_io_for_flush_and_compaction=true.

Larger hot set: ~400 MB (~40% of cache) — N=5 iterations, mean (CV)

--num=100000. This is the case the Cassandra blog calls out — hot set big enough to actually fight compaction for cache. Their analogous setup (1M hot partitions, ~33% hot/cache) reported 1.93× p99 improvement. Numbers here are the headline:

Config	Throughput (ops/s)	Read P50 (µs)	Read P99 (µs)	Read P99.9 (µs)	Read P99.99 (µs)
buffered (default)	68,959 (7.7%)	44.81	541.22	2,225.2	11,334.5
direct_compaction_writes_only (existing knob alone)	73,973 (10.3%) — +7.3%	42.22 (−5.8%)	456.27 (−15.7%)	2,016.9 (−9.4%)	9,190.0 (−18.9%)
direct_compaction_read_only (new knob alone)	84,337 (2.3%) — +22.3%	38.66 (−13.7%)	386.97 (−28.5%)	1,644.8 (−26.1%)	4,837.9 (−57.3%, 2.34×)
direct_compaction_read_write (new + existing, recommended)	104,923 (8.4%) — +52.2%	34.26 (−23.5%)	290.97 (−46.2%)	1,143.4 (−48.6%)	3,080.3 (−72.8%, 3.68×)
use_direct_reads=true (existing global) + write-side	71,598 (9.1%) — +3.8%	51.33 (+14.5%)	297.91 (−45.0%)	1,663.6 (−25.2%)	6,530.0 (−42.4%)

Combined config gets a 3.68× p99.99 win, 1.86× p99, p50 down 23%, throughput up 52%. Same shape as the Cassandra blog's 1.93× p99 result — the improvement just lands at deeper percentiles for us because RocksDB's baseline data path is roughly 40× faster than Cassandra's (their buffered p99 was 35 ms, ours is 0.54 ms), so the cache-miss tail is further out.

A few things worth calling out from this table:

The new flag is doing real work on top of the existing write-side flag here, not just shifting things around. Combined throughput is +42% over direct_compaction_writes_only alone, and combined p99.99 is 3× better. The existing knob alone gives a fairly modest +7% throughput / -19% p99.99 in this case — there's a clear gap that the new flag fills.
The new flag alone (no existing write-side flag) is also a real improvement here: +22% throughput, p99.99 down 57%. The P99 regression we saw in the small-hot-set case is gone, because the cache-protection effect now dominates the lost-readahead cost.
use_direct_reads=true (the existing global flag) actually regresses P50 by 14.5% in this workload — taking user reads off the page cache hurts you when the hot data could have been cached. It also gets the worst throughput of any direct config. It's not an equivalent way to get these gains.

`compaction_readahead_size` matters when this flag is on

Direct I/O bypasses kernel readahead, so RocksDB's own DBOptions::compaction_readahead_size becomes the only prefetch the iterator has. The default of 2 MB is enough and real users will get it automatically. But db_bench's --compaction_readahead_size CLI default is 0, which defeats prefetch and makes direct compaction look slower than it actually is. If you're reproducing the numbers above, pass --compaction_readahead_size=2097152 (or larger).

Summary

Recommended production config is use_direct_io_for_compaction_reads=true + use_direct_io_for_flush_and_compaction=true. Strongest configuration at every percentile and throughput in both benches.
The new flag is the read-side counterpart to use_direct_io_for_flush_and_compaction, which handles compaction-write cache pollution. They address different sources of pollution and compose. The gap between "combined" and "writes-only-alone" is 17 percentage points on p99.99 in the small-hot-set bench and 54 points in the larger one, so the new flag is contributing real value, especially as the hot set grows.
The new flag alone is also a real improvement when the hot set is big enough to compete with cache (+22% throughput, 2.34× p99.99 in the larger-hot-set bench). On a very small hot set it improves p99.99 but regresses p99, so pairing with the existing write-side flag is safer.
The benefit is workload-dependent. Small hot sets get modest tail-latency wins. Hot sets sized to actually compete for cache get the big multi-percentile wins shown above. Hot sets bigger than cache (not benched here but covered in the Cassandra blog) see no change either way — every read misses regardless.

Reproducing

Any Linux host (or a Linux VM on macOS via OrbStack / Multipass / lima):

sudo apt-get install -y build-essential clang cmake git pkg-config \
  libgflags-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev libzstd-dev

cmake -DCMAKE_BUILD_TYPE=Release -DPORTABLE=1 -DWITH_GFLAGS=1 -DWITH_TESTS=0 ..
make -j db_bench

echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled

Build the source DB once, unrestricted memory:

./db_bench --benchmarks=fillrandom,compact,waitforcompaction,stats \
  --db=/path/to/source_db --num=3500000 --key_size=16 --value_size=4096 \
  --write_buffer_size=16777216 --target_file_size_base=16777216 \
  --max_background_jobs=4 --compression_type=none --cache_size=4194304 \
  --max_bytes_for_level_base=67108864 --disable_wal=1 --sync=0

For each config, copy source_db -> scratch_db, run sync && echo 3 > /proc/sys/vm/drop_caches, then:

sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \
  ./db_bench --use_existing_db=1 \
    --benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \
    --threads=5 --duration=120 --statistics=true --histogram=1 \
    --num=7500 --bgwriter_num=3500000 \
    --key_size=16 --value_size=4096 \
    --write_buffer_size=16777216 --target_file_size_base=16777216 \
    --max_background_jobs=4 --compression_type=none \
    --cache_size=4194304 --open_files=200 \
    --skip_stats_update_on_db_open=true \
    --max_bytes_for_level_base=67108864 \
    --benchmark_write_rate_limit=209715200 \
    --compaction_readahead_size=2097152 \
    --rate_limiter_bytes_per_sec=0 \
    --use_direct_reads={true|false} \
    --use_direct_io_for_compaction_reads={true|false} \
    --use_direct_io_for_flush_and_compaction={true|false}

For the larger hot-set table, change --num=7500 to --num=100000.

The five configs in the tables:

buffered: all three flags false.
direct_compaction_writes_only: use_direct_io_for_flush_and_compaction=true, the other two false. This is what users have today without this PR.
direct_compaction_read_only: use_direct_io_for_compaction_reads=true, the other two false.
direct_compaction_read_write: use_direct_io_for_compaction_reads=true, use_direct_io_for_flush_and_compaction=true, use_direct_reads=false. Recommended.
direct_all: use_direct_reads=true, use_direct_io_for_flush_and_compaction=true, use_direct_io_for_compaction_reads=false.

github-actions · 2026-05-14T22:08:19Z

⚠️ clang-tidy: 4 warning(s) on changed lines

Completed in 801.2s.

Summary by check

Check	Count
`bugprone-unused-return-value`	1
`readability-braces-around-statements`	2
`readability-isolate-declaration`	1
Total	4

Details

db/db_compaction_test.cc (3 warning(s))

db/db_compaction_test.cc:6979:39: warning: statement should be inside braces [readability-braces-around-statements]
db/db_compaction_test.cc:6988:35: warning: statement should be inside braces [readability-braces-around-statements]
db/db_compaction_test.cc:6992:3: warning: multiple declarations in a single statement reduces readability [readability-isolate-declaration]

db/table_cache.cc (1 warning(s))

db/table_cache.cc:489:11: warning: the value returned by this function should not be disregarded; neglecting it may lead to errors [bugprone-unused-return-value]

github-actions · 2026-05-14T22:51:26Z

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 7931580

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-05-14T23:07:49Z

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 7931580

Summary

Well-designed feature addressing a real performance need (avoiding page-cache pollution from compaction scans). The core bypass mechanism in TableCache is sound, and the PosixFileSystem fix is a valuable correction. Main concerns are around OptimizeForBlobFileRead scope creep affecting backup operations, and a minor inaccuracy in the release note.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. `OptimizeForBlobFileRead` scope creep — `env/file_system.cc:181`, `env/env.cc:1190`

Issue: The PR changes OptimizeForBlobFileRead to set use_direct_reads = db_options.use_direct_reads || db_options.use_direct_reads_for_compaction. This function is called from utilities/backup/backup_engine.cc at 3 call sites (lines 1568, 3049, 3084) for blob file copying during backup. When use_direct_reads_for_compaction=true and use_direct_reads=false, backup blob file reads will unexpectedly use O_DIRECT.
Root cause: The option name says "for compaction" but OptimizeForBlobFileRead is used by non-compaction code paths. Blob files read during compaction go through BlobFileCache which uses general file_options_, not OptimizeForBlobFileRead — so the change doesn't even help the intended compaction case.
Suggested fix: Either (a) don't change OptimizeForBlobFileRead — it's not needed for the stated goal, or (b) rename the option to use_direct_reads_for_background if broader scope is intentional. At minimum, document that the flag affects backup blob reads.

M2. Release note inaccuracy — `unreleased_history/new_features/use_direct_reads_for_compaction.md`

Issue: The release note says "RocksDB issues compaction (and flush) background reads through O_DIRECT." The PR does not affect flush reads — flush produces new SST files, it doesn't read existing ones. The direct I/O is only for compaction-input reads.
Root cause: Copy-paste from use_direct_io_for_flush_and_compaction description.
Suggested fix: Remove "(and flush)" from the release note.

M3. `prefetch_index_and_filter_in_cache=true` for ephemeral readers — `db/table_cache.cc:332`

Issue: The bypass path in FindTable calls GetTableReader with prefetch_index_and_filter_in_cache=true (inherited from the default in NewIterator). Since the ephemeral reader is never placed in the cache, prefetching into the block cache is wasteful — it loads index/filter blocks into the block cache from O_DIRECT reads that were meant to bypass caching.
Root cause: The prefetch_index_and_filter_in_cache parameter was designed for cached readers. The bypass path doesn't override it.
Suggested fix: Pass prefetch_index_and_filter_in_cache=false when bypass_cache_for_scan=true, or set it to false in the FindTable bypass path.

🟢 LOW / NIT

L1. PosixFileSystem behavioral change is a pre-existing bug fix — `env/fs_posix.cc:950-969`

Issue: The current PosixFileSystem::OptimizeForCompactionTableRead override does NOT call the base class FileSystem::OptimizeForCompactionTableRead. This means it never sets use_direct_reads from db_options.use_direct_reads. The PR fixes this by calling the base class first. This is a correct bug fix but is orthogonal to the new feature.
Root cause: Original PosixFileSystem override was written only for the readahead clamping fix (compaction_readahead_size doesn't work when it is larger than max_sectors_kb #12038) and forgot to call the base class.
Suggested fix: Consider noting this as a separate bug fix in the commit message. The fix itself is correct and preserves behavior when both flags are false (since use_direct_reads=false || false=false).

L2. Redundant third condition in `bypass_cache_for_scan` — `db/compaction/compaction_job.cc`

Issue: The condition file_options_for_read_.use_direct_reads in bypass_cache_for_scan = use_direct_reads_for_compaction && !use_direct_reads && file_options_for_read_.use_direct_reads is technically redundant — when use_direct_reads_for_compaction=true, OptimizeForCompactionTableRead will always set file_options_for_read_.use_direct_reads=true.
Root cause: Defensive programming — guards against custom FileSystem overrides that might not call the base class.
Suggested fix: This is actually a reasonable safety check. Keep it, but consider adding a comment explaining it's a defensive check against custom FS implementations.

L3. db_bench `--bgwriter_num` bundled with feature PR — `tools/db_bench_tool.cc`

Issue: The --bgwriter_num flag for db_bench is a useful benchmarking tool but is a separate improvement unrelated to the core feature.
Suggested fix: Consider splitting into a separate commit or PR for cleaner history. Not a blocker.

L4. `table_reader_ptr` in bypass mode — `db/table_cache.cc:354-356`

Issue: When bypass_cache_for_scan=true, table_reader_ptr is set to the ephemeral reader. Callers that store this pointer must not use it after the iterator is destroyed. In the compaction path, table_reader_ptr is always nullptr, so this is safe in practice.
Suggested fix: No action needed — all bypass-mode callers pass nullptr for table_reader_ptr.

Cross-Component Analysis

Context	Affected?	Assessment
WritePreparedTxnDB	Yes (same compaction path)	Safe — bypass is per-iterator, no shared state
FIFO/Universal compaction	Yes (same MakeInputIterator)	Safe — all compaction types share the path
ReadOnly/Secondary DB	No compaction	Not affected
CompactionServiceCompactionJob	Inherits privately from CompactionJob	Safe — inherits CreateInputIterator
BackupEngine blob reads	Yes via OptimizeForBlobFileRead	Unintended — see M1
ForwardIterator/LevelIterator	Default false	Safe — only compaction sets bypass=true
Concurrent compactions on same file	Both open ephemeral readers	Safe — each has independent reader/FD
Mixed O_DIRECT + buffered on same file	Kernel concern	Acceptable — compaction is sequential one-pass; file is dropped after

Assumption stress-test results:

"Ephemeral reader cleanup is safe" — CONFIRMED. Range tombstone iterators hold shared_ptr<FragmentedRangeTombstoneList>, not raw pointers into the TableReader. The InternalKeyComparator reference is to the CFD-level comparator which outlives the compaction. No use-after-free.
"Default false preserves existing behavior" — CONFIRMED. When both flags are false, use_direct_reads stays false through all code paths. The PosixFileSystem base-class call is a correct fix for a pre-existing bug.
"Mixing O_DIRECT and buffered is safe in practice" — ACCEPTABLE. The kernel documentation warns against it, but the pattern (sequential one-pass scan that ends with file deletion) avoids the problematic sustained mixed-access scenario. Well-documented in options.h.

Positive Observations

The bypass mechanism is cleanly scoped — the fresh_table_reader_owner unique_ptr in FindTable and RegisterCleanup in NewIterator provide clear ownership semantics.
The PosixFileSystem fix to call the base class is a valuable correction that ensures custom readahead clamping doesn't silently override the direct-reads decision.
Comprehensive test coverage: end-to-end test with both kernel-level (O_DIRECT sync point) and plumbing-level (FileOptions) probes, plus a negative test for the off case.
Proper crash test integration with mmap and direct IO sanitization.
Good option validation (mmap + direct_reads_for_compaction rejected at Open time).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

github-actions · 2026-05-14T23:41:59Z

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit d268198

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-05-15T01:55:27Z

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 176a7c2

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-05-15T02:11:21Z

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 176a7c2

Summary

Well-designed, opt-in feature that addresses a real performance need (preventing compaction reads from polluting the OS page cache). The implementation is architecturally sound, following existing RocksDB patterns. Most infrastructure (options, serialization, validation, stress test, db_bench) is complete and correct.

High-severity findings (2):

[db/table_cache.cc:384] Ephemeral TableReader leaked if NewIterator returns an error after FindTable succeeds in bypass mode — ephemeral_reader goes out of scope without transferring ownership to a cleanup.
[db/table_cache.cc:346-393] Range tombstone iterators created from an ephemeral TableReader (via table_reader->NewRangeTombstoneIterator) are owned by TruncatedRangeDelIterator/range_del_agg which may outlive the result iterator that owns the ephemeral reader via RegisterCleanup, causing use-after-free.

Full review (click to expand)

Findings

🔴 HIGH

H1. Ephemeral TableReader leak on iterator-creation failure — `db/table_cache.cc:384`

Issue: In the bypass path, FindTable creates an ephemeral TableReader owned by ephemeral_reader (a unique_ptr on the stack of NewIterator). If FindTable succeeds but the subsequent table_reader->NewIterator() call fails or s becomes not-OK in the range-deletion processing block (lines 358–395), the error path at line 401–403 creates an ErrorInternalIterator but never transfers the ephemeral reader to it. ephemeral_reader is destroyed when the function returns, but this is actually benign for the error case since the reader is cleaned up by the destructor. However, if FindTable succeeds, result is created, range-del processing then sets s to not-OK (line 380), the code falls through to line 398 where handle is nullptr (bypass), then at line 401 result (the valid iterator that has the cleanup registered) is replaced by NewErrorInternalIterator — but result was already created and the cleanup was already registered on it. The old result is leaked (raw pointer, no delete), and with it the registered cleanup that would have freed the ephemeral reader.
Root cause: Raw pointer result is overwritten at line 403 without deleting the previous iterator that already had the ephemeral-reader cleanup registered on it.
Suggested fix: Before overwriting result with the error iterator, delete the old result (which will trigger its cleanup chain and free the ephemeral reader), or restructure the error handling to avoid the overwrite. This is an existing pattern issue amplified by the bypass path.

H2. Range tombstone iterator lifetime vs ephemeral TableReader — `db/table_cache.cc:358-395`

Issue: When bypass_cache_for_scan is active, the ephemeral TableReader is owned by result via RegisterCleanup. But range_del_iter (line 359) and range_del_agg (line 374) receive FragmentedRangeTombstoneIterator objects created from the same TableReader. These range-tombstone iterators hold internal pointers into the TableReader's data structures. In the compaction merging iterator, range-tombstone iterators may be accessed after the file iterator (result) that owns the ephemeral reader is destroyed during LevelIterator file transitions. If the LevelIterator destroys the old file iterator (and thus the ephemeral reader) before the CompactionMergingIterator finishes with that file's range tombstones, a use-after-free occurs.
Root cause: With cached readers, the reader stays alive in the cache even after the iterator is destroyed. With ephemeral readers, the reader's lifetime is strictly tied to the iterator. Range-tombstone iterators returned via range_del_iter are stored separately (in LevelIterator's tombstone tracking) and may outlive the file iterator.
Suggested fix: Either (a) have the range-tombstone iterator share ownership of the ephemeral reader (e.g., via shared_ptr), (b) ensure the ephemeral reader lifetime extends to cover all tombstone iterators derived from it, or (c) for the L0 path in MakeInputIterator, the range tombstone iters and file iters are passed together to NewCompactionMergingIterator which manages their lifetimes jointly — verify this is safe. For the L1+ LevelIterator path, the range_tombstone_iter_ pointer is updated atomically during InitFileIterator, and the old tombstone iter is only freed after the merging iterator is done with it. This needs careful verification: examine CompactionMergingIterator to confirm it never holds a stale tombstone iterator after the owning file iterator is destroyed.

🟡 MEDIUM

M1. Scope creep: `bgwriter_num` flag in db_bench — `tools/db_bench_tool.cc:268-276,7944-7948`

Issue: The --bgwriter_num flag is a useful benchmarking enhancement but is functionally independent of the use_direct_reads_for_compaction option. Bundling it in the same PR makes the diff larger and review harder.
Suggested fix: Split into a separate PR.

M2. Backup engine inherits changed `OptimizeForCompactionTableRead` behavior — `env/file_system.cc:170`

Issue: OptimizeForCompactionTableRead is called by the backup engine (backup_engine.cc) for copying SST files. When use_direct_reads_for_compaction=true, backup engine SST reads will also get use_direct_reads=true in their FileOptions. This is likely benign (backup reads are sequential, like compaction) but is an unintended side effect.
Suggested fix: Document this interaction. If undesirable, the backup engine could use its own optimization method.

M3. Missing test coverage for L1+ LevelIterator bypass path — `db/db_compaction_test.cc`

Issue: UseDirectReadsForCompactionEndToEnd creates only 2 L0 files that compact to L1. This exercises only the L0 direct-NewIterator path in MakeInputIterator, not the LevelIterator path used for L1+ inputs. The LevelIterator bypass path (where bypass_cache_for_scan_ is stored and used in InitFileIterator) is untested.
Suggested fix: Add a test that populates data into L1 and L2, then triggers an L1→L2 compaction to exercise the LevelIterator bypass.

M4. Sync-point callback counters not atomic — `db/db_compaction_test.cc:6695-6707`

Issue: observed_run_starts, observed_odirect_opens, observed_callbacks are plain int variables incremented from sync-point callbacks that run on compaction threads, while assertions run on the test thread. This is a data race (undefined behavior per C++ standard). In practice, compaction is serial in this test due to disable_auto_compactions + CompactRange, but if any internal parallelism triggers these callbacks concurrently, the test becomes flaky.
Suggested fix: Use std::atomic<int> for all counters incremented from sync-point callbacks.

M5. `NO_FILE_OPENS` statistic inflated for bypass path — `db/table_cache.cc:127`

Issue: GetTableReader records RecordTick(ioptions_.stats, NO_FILE_OPENS) on every call. In normal operation, this is called once per cache miss. With bypass, it's called for every compaction file access. This inflates the rocksdb.no.file.opens statistic, potentially confusing monitoring systems.
Suggested fix: This is technically correct (each call does open a file), but the changed semantics should be documented. Consider adding a separate COMPACTION_DIRECT_FILE_OPENS counter.

🟢 LOW / NIT

L1. Three-way bypass condition could be simplified — `db/compaction/compaction_job.cc:1547-1549`

Issue: bypass_cache_for_scan = use_direct_reads_for_compaction && !use_direct_reads && file_options_for_read_.use_direct_reads — the third term is always true when the first two are true (since OptimizeForCompactionTableRead ORs the two flags). The third check is a defensive guard against custom FileSystem overrides that don't call the base class, which is reasonable but could use a comment explaining the rationale.
Suggested fix: Add a brief comment explaining the third condition serves as a guard for custom FileSystem implementations.

L2. Option documentation mentions mixed O_DIRECT access — `include/rocksdb/options.h:1147-1157`

Issue: The doc comment correctly notes the Linux open(2) advisory against mixed O_DIRECT and buffered I/O. The explanation of why it's safe in practice is good. Minor nit: could mention that posix_fadvise(POSIX_FADV_DONTNEED) is an alternative approach that avoids mixed-mode access entirely.
Suggested fix: Optional: mention alternatives in the documentation.

L3. `RandomInitDBOptions` may generate invalid combinations — `test_util/testutil.cc:304`

Issue: use_direct_reads_for_compaction is randomized independently of allow_mmap_reads, which could generate invalid option combinations. db_crashtest.py sanitizes these, but direct use of RandomInitDBOptions in other tests might hit the mmap + direct_compaction_reads validation error.
Suggested fix: This follows the existing pattern (same issue exists for use_direct_reads + allow_mmap_reads). No action required beyond awareness.

Cross-Component Analysis

Context	Executes?	Assumptions hold?	Action needed?
WritePreparedTxnDB	YES (same compaction path)	YES	Safe
ReadOnly DB	NO (no compactions)	N/A	Safe
CompactionService	YES (inherits CompactionJob)	YES	Safe — options propagated via serialization
User-defined timestamps	YES	YES	No special handling needed
BlobDB	Separate path (`OptimizeForBlobFileRead`)	YES	Intentionally excluded — correct
FIFO/Universal compaction	YES (all use `MakeInputIterator`)	YES	Safe
Backup engine	YES (`OptimizeForCompactionTableRead` called)	Partially	See M2 above
Old snapshots	YES	YES	No impact on snapshot semantics
Concurrent writers	YES	YES	Bypass path is independent per compaction
Prefix seek	N/A (compaction uses total-order)	YES	Safe

Positive Observations

Option infrastructure is thorough: Options registration, serialization round-trip, dump logging, C API, db_bench, db_stress, crashtest integration, and validation are all present and follow existing patterns precisely.
PosixFileSystem correctly calls base class: The override now calls FileSystem::OptimizeForCompactionTableRead first, then applies the Linux-specific readahead clamping on the post-base values. This is the right fix for the inheritance pattern.
Blob file reads intentionally excluded: The test in UseDirectReadsForCompactionOptionMechanics explicitly verifies that OptimizeForBlobFileRead does NOT enable direct reads for the new flag. Good isolation.
End-to-end test validates kernel-level O_DIRECT: The NewRandomAccessFile:O_DIRECT sync point verifies the actual OS-level flag, not just in-memory FileOptions.
Clean bypass activation logic: The bypass_cache_for_scan flag is only computed in one place (CreateInputIterator) and threaded down consistently through MakeInputIterator → LevelIterator → NewIterator → FindTable.
Defensive validation: mmap_reads + use_direct_reads_for_compaction is properly rejected at DB::Open with a clear error message.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

github-actions · 2026-05-15T09:44:13Z

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit c87bff7

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-05-15T09:57:54Z

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit c87bff7

Summary

Well-designed, opt-in feature with correct lifetime management and clean plumbing through the TableCache → MakeInputIterator → LevelIterator → CompactionJob stack. The bypass-cache approach is the right solution given RocksDB's architectural constraints. The PosixFileSystem override fix is essential and correctly implemented. No high-severity issues found after thorough multi-agent review and debate.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Missing Java JNI Bindings — `java/rocksjni/options.cc`

Issue: The existing use_direct_reads option has full Java JNI support. The new use_direct_reads_for_compaction option does not include equivalent Java bindings.
Root cause: Incomplete API surface coverage across language bindings.
Suggested fix: Add JNI getter/setter following the exact pattern used by use_direct_reads. Can be done in a follow-up PR.

M2. `bgwriter_num` db_bench Addition Is Unrelated — `tools/db_bench_tool.cc`

Issue: The new --bgwriter_num flag is functionally unrelated to use_direct_reads_for_compaction. Bundling it makes the change harder to review and bisect.
Suggested fix: Extract bgwriter_num into a separate PR.

M3. `RandomInitDBOptions` May Create Invalid Configs — `test_util/testutil.cc`

Issue: RandomInitDBOptions sets use_direct_reads_for_compaction = rnd->Uniform(2) independently of allow_mmap_reads. If both are 1, ValidateOptions rejects the combination. The db_crashtest.py sanitization handles this correctly, but RandomInitDBOptions does not.
Suggested fix: Either add if (db_opt->allow_mmap_reads) db_opt->use_direct_reads_for_compaction = false; or rely on ValidateOptions rejection (acceptable since random tests should handle Status::NotSupported()).

🟢 LOW / NIT

L1. Redundant Option Combination Not Rejected — `db/db_impl/db_impl_open.cc`

Issue: use_direct_reads=true + use_direct_reads_for_compaction=true is allowed but the compaction flag is a no-op. Harmless but potentially confusing. The option doc already notes "has no effect if use_direct_reads is also true."

L2. BlobDB Intentionally Excluded — `env/file_system.cc`

Issue: OptimizeForBlobFileRead doesn't include the new flag. This is intentional and verified by tests. No action needed.

Cross-Component Analysis

Context	Affected?	Verified Safe?
Compaction (L0 + L1+)	YES	YES — ephemeral readers with proper cleanup
CompactionService	NO	YES — doesn't call MakeInputIterator
ReadOnly DB	NO	YES — no compaction
WritePreparedTxnDB	YES (same path)	YES — orthogonal
User reads	NO	YES — default param prevents bypass
BlobDB	Excluded	YES — intentional

Debated Findings (Resolved)

"Range tombstone destruction order bug" → FALSE POSITIVE. FragmentedRangeTombstoneIterator holds shared_ptr<FragmentedRangeTombstoneList> which keeps tombstone data alive independently of the TableReader. Destruction order in CompactionMergingIterator is irrelevant.

"Block cache pollution negates feature" → FALSE POSITIVE. The feature addresses OS page cache pollution (via O_DIRECT), not block cache. Compaction already sets fill_cache = false (compaction_job.cc:1465).

Positive Observations

Excellent deferred ownership transfer pattern for ephemeral readers
PosixFileSystem fix correctly chains base class call before Linux readahead clamping
Defensive third clause in bypass condition protects against custom FileSystems
Comprehensive test coverage (3 e2e tests + option mechanics + stress test integration)
Zero behavioral change when flag is off (default)

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

Adds a new db_bench flag --bgwriter_num (default 0). When > 0, the background writer thread used by readwhilewriting / readwhilemerging / multireadwhilewriting writes random keys over [0, bgwriter_num) instead of [0, num). This lets the reader thread group operate on a small hot subset (set by --num) while the writer spreads its puts across a much larger keyspace, which is what drives continuous flushes and compaction. Without it readers and the writer share a single keyspace and you can't have both a hot read subset and meaningful compaction work at the same time. Default of 0 preserves existing behavior (writer uses --num).

Introduces a new DBOption `use_direct_io_for_compaction_reads` (default false) that lets users route compaction background reads through O_DIRECT while keeping user reads on the buffered/page-cache path. Sequential compaction reads otherwise pollute the OS page cache with read-once data that evicts the hot user-read working set; bypassing the cache for those reads protects user-read tail latency on write-heavy workloads without forcing users onto the global `use_direct_reads` path (which slows user reads dramatically). A naive implementation that only flipped the FileOptions returned by `OptimizeForCompactionTableRead` does not actually trigger the OS-level O_DIRECT open, because the TableCache (and FileMetaData::pinned_reader) already holds long-lived buffered handles opened at flush time or at DB::Open via LoadTableHandlers. Compaction would silently reuse those cached buffered handles and the kernel would never see the O_DIRECT flag. The fix opens ephemeral O_DIRECT TableReaders for the lifetime of the compaction scan, separate from the cache: * TableCache::FindTable / NewIterator learn an `open_ephemeral_table_reader` mode. When set, the pinned-reader fast path and the shared cache are skipped, GetTableReader is called directly with the caller's FileOptions, and ownership of the freshly opened TableReader is handed back to the caller. The iterator takes ownership via RegisterCleanup and frees the reader on destruction. In this mode the bypass also forces skip_filters=true and prefetch_index_and_filter_in_cache=false: compaction reads every key in the file (filters are useless) and inserting the ephemeral reader's index/filter blocks into the shared block cache would partially defeat the page-cache-protection goal of the option. * VersionSet::MakeInputIterator and LevelIterator plumb the flag through both the L0 and L1+ compaction-input paths. * CompactionJob::ProcessKeyValueCompaction enables the flag exactly when `use_direct_io_for_compaction_reads` is on, the global `use_direct_reads` is off, and `OptimizeForCompactionTableRead` actually produced `use_direct_reads=true` in the compaction-read FileOptions. End-to-end tests in db_compaction_test.cc use the existing `NewRandomAccessFile:O_DIRECT` sync point in env/fs_posix.cc to assert that the kernel-level open really happens for compaction inputs when the flag is set, and never fires when the flag is off. There are on-path tests (L0->L1 via MakeInputIterator), L1+ tests with range tombstones (LevelIterator path), and a concurrent-read-vs-compaction stress test for use under TSAN/ASAN. The negative ("off stays buffered") test runs on all platforms; the others are scoped to platforms with the O_DIRECT path. The new option follows the existing add_option.md checklist: it is registered in ImmutableDBOptions for serialization, surfaced through the C API, exposed in db_bench / db_stress / db_crashtest.py, randomized in RandomInitDBOptions, validated against allow_mmap_reads at Open time, and documented in unreleased_history (new_features + behavior_changes, the latter for the FileSystem::OptimizeForCompactionTableRead contract change). Java JNI is left for a follow-up. The same underlying hook (FileSystem::OptimizeForCompactionTableRead) is also called by BackupEngine when copying SST files, so backups will also route through O_DIRECT when the new flag is true. This is intentional -- backups exhibit the same sequential-scan / cache- pollution pattern as compaction inputs -- and is documented in the option's Doxygen and the FileSystem interface. Blob-file reads are not affected by the new flag (BlobFileCache holds its own cached handles); this is documented as a known limitation with a follow-up to extend the ephemeral-reader mechanism to BlobFileCache. Benchmark results ================= Setup: Ubuntu 24.04 (kernel 7.0.5 OrbStack Linux VM on Apple Silicon), 14 vCPUs, virtio-blk disk. MGLRU disabled (echo 0 > /sys/kernel/mm/lru_gen/enabled). 14 GB DB (3.5M keys * 4 KB values), no compression. Each measurement run pinned to a 1 GB cgroup via `systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0`, so DB-to-cache ratio is ~14x. Page cache dropped between configs. Workload: readwhilewriting for 180 s, 4 reader threads on a hot 2,000-key subset (~8 MB, ~3% of cache) + 1 writer thread spreading overwrites across the full 3.5M-key keyspace (via `--bgwriter_num=3500000`), throttled at 100 MB/s. Compaction ran at ~500 MB/s read/write during the buffered run, ~400 MB/s with direct compaction. Each run was 3 minutes long; "buffered" is the existing default. | Config | Throughput | Read P50 | Read P99 | Read P99.9 | Read P99.99 | |-------------------------------------------|-----------------|---------------|---------------|----------------|----------------| | buffered (default) | 406 K ops/s | 7.34 us | 79.11 us | 533.14 us | 1647.79 us | | direct_compaction_read_write | **464 K ops/s** | **6.37 us** | **71.64 us** | **468.28 us** | **1363.91 us** | | | (+14%) | (-13%) | (-9%) | (-12%) | (-17%) | | direct_compaction_read_only | 421 K ops/s | 6.99 us | 88.95 us | 504.32 us | 1456.75 us | | | (+4%) | (-5%) | (+13%) | (-5%) | (-12%) | | use_direct_reads = true (existing global) | 442 K ops/s | 7.37 us | 50.82 us | 472.23 us | 1626.77 us | | | (+9%) | (0%) | (-36%) | (-11%) | (-1%) | The recommended production configuration is `use_direct_io_for_compaction_reads = true` together with `use_direct_io_for_flush_and_compaction = true` ("direct reads + writes for compaction"). It wins on every metric simultaneously: throughput up 14%, every read percentile from P50 to P99.99 down 9 to 17%. The existing global `use_direct_reads = true` flag does help P99 specifically but at a noticeable throughput cost and is no better at P99.99; the new compaction-only path is strictly better for the write-heavy workloads it is designed for. Higher DB-to-cache ratios (the Cassandra blog at https://lightfoot.dev/direct-i-o-for-cassandra-compaction-cutting-p99-read-latency-by-5x/ reports ~5x P99 improvement at a 43x ratio) should widen the gap further; the 14x ratio used above is what fit in a single laptop's disk budget. Repro recipe ============ Setup: - Install OrbStack on macOS or use any Linux host - On macOS: orb create -t ubuntu rocksdb-bench - Inside the Linux machine: apt-get install -y build-essential clang cmake git pkg-config \ libgflags-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev \ libzstd-dev rsync cmake -DCMAKE_BUILD_TYPE=Release -DPORTABLE=1 -DWITH_GFLAGS=1 \ -DWITH_TESTS=0 .. && make -j db_bench Build the source DB (once, unrestricted memory): ./db_bench --benchmarks=fillrandom,compact,waitforcompaction,stats \ --db=/path/to/source_db --num=3500000 --key_size=16 \ --value_size=4096 --write_buffer_size=16777216 \ --target_file_size_base=16777216 --max_background_jobs=4 \ --compression_type=none --cache_size=4194304 \ --max_bytes_for_level_base=67108864 --disable_wal=1 --sync=0 Per-config measurement (copy source_db -> scratch_db first, then drop_caches, then run under cgroup): sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \ ./db_bench --use_existing_db=1 \ --benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \ --threads=5 --duration=180 --statistics=true --histogram=1 \ --num=2000 --bgwriter_num=3500000 \ --key_size=16 --value_size=4096 \ --write_buffer_size=16777216 --target_file_size_base=16777216 \ --max_background_jobs=4 --compression_type=none \ --cache_size=4194304 --open_files=200 \ --skip_stats_update_on_db_open=true \ --max_bytes_for_level_base=67108864 \ --benchmark_write_rate_limit=104857600 \ --rate_limiter_bytes_per_sec=0 \ --use_direct_reads={true|false} \ --use_direct_io_for_compaction_reads={true|false} \ --use_direct_io_for_flush_and_compaction={true|false} Disable MGLRU first so the kernel uses the classic active/inactive LRU: echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled

github-actions · 2026-05-25T03:03:02Z

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 5adce3a

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-05-25T03:23:59Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 5adce3a

Summary

Well-engineered PR that adds a useful use_direct_io_for_compaction_reads option with thorough testing, proper option plumbing, stress test integration, and a real PosixFileSystem bug fix. The ephemeral TableReader bypass design is sound and the ownership/cleanup logic is careful and correct.

High-severity findings (1):

[table_cache.cc:237-259] Ephemeral readers bypass the OS page cache (O_DIRECT) but still insert data blocks into the shared RocksDB block cache during iteration. Document this limitation or consider setting no_block_cache for the ephemeral path.

Full review (click to expand)

Findings

🔴 HIGH

H1. Ephemeral readers still use the shared block cache -- `table_cache.cc:237-259`

Issue: The ephemeral TableReader bypass correctly skips the TableCache (table-level cache) and sets prefetch_index_and_filter_in_cache=false and skip_filters=true. However, during iteration the ephemeral reader will still insert data blocks into the shared block cache via the normal BlockBasedTableIterator path, because BlockBasedTableOptions::no_block_cache is not set.
Root cause: The PR targets OS page-cache pollution (via O_DIRECT), not RocksDB block-cache pollution. These are separate caches. The PR description and benchmarks focus on OS page cache, which O_DIRECT correctly addresses.
Suggested fix: This is arguably by-design: the RocksDB block cache uses LRU/HCC eviction with priority hints, so compaction data blocks are naturally low-priority. However, the limitation should be documented, or no_block_cache could be considered for the ephemeral path if block cache pollution is also a concern. At minimum, add a comment in the bypass path noting that data blocks still go through the block cache.

🟡 MEDIUM

M1. `bgwriter_num` is unrelated to the core feature -- `tools/db_bench_tool.cc:268-275`

Issue: The --bgwriter_num flag for db_bench is a useful benchmarking helper but is a separate feature from use_direct_io_for_compaction_reads. Bundling it in the same PR makes the change larger and harder to review/bisect.
Suggested fix: Consider splitting into a separate small PR, or document the coupling clearly in the commit message. (The PR description does explain it, so this is minor.)

M2. `RandomInitDBOptions` can generate invalid combinations -- `test_util/testutil.cc:304`

Issue: use_direct_io_for_compaction_reads = rnd->Uniform(2) can randomly combine with allow_mmap_reads = true, producing an invalid option combination that would fail at DB::Open. This is consistent with the existing pattern (use_direct_reads has the same issue), and the db_crashtest.py sanitizer handles it, but callers of RandomInitDBOptions outside of db_crashtest may hit failures.
Suggested fix: Document this or add a post-randomization sanitization helper. Low priority since it matches existing behavior.

M3. BackupEngine side effect may surprise users -- `env/file_system.cc:167-199`

Issue: OptimizeForCompactionTableRead is also called by BackupEngine (backup_engine.cc:1560, 3052, 3087) when copying SST files. Enabling use_direct_io_for_compaction_reads will route backup SST reads through O_DIRECT. The PR documents this as intentional, but users may not expect a "compaction reads" option to affect backups.
Suggested fix: The documentation in options.h already notes this. Consider mentioning it in the release note as well.

🟢 LOW / NIT

L1. Statistics inflation from ephemeral opens -- `table_cache.cc:237-259`

Issue: Each ephemeral reader creation increments NO_FILE_OPENS and TABLE_OPEN_IO_MICROS. During a large L1->L2 compaction, this could significantly inflate these metrics vs. the cached-handle path.
Suggested fix: Consider adding a separate counter (e.g., COMPACTION_EPHEMERAL_FILE_OPENS) so users can distinguish compaction-bypass opens from normal opens, or document the expected metric change.

L2. Lambda in RegisterCleanup vs function pointer -- `table_cache.cc:466-470`

Issue: The cleanup uses a captureless lambda [](void* arg1, void*) { delete static_cast<TableReader*>(arg1); }. While a captureless lambda is convertible to a function pointer and works correctly, the existing codebase pattern uses cache_.RegisterReleaseAsCleanup() or static functions. A named static function would be more consistent.
Suggested fix: Optional style nit. The lambda works correctly.

L3. Option comment length -- `include/rocksdb/options.h:1135-1200`

Issue: The option comment is very thorough (65 lines), which is excellent for documentation but significantly longer than the comments on sibling options (use_direct_reads is 2 lines, use_direct_io_for_flush_and_compaction is 2 lines).
Suggested fix: The detail is valuable; no change needed. Just noting the asymmetry.

L4. PosixFileSystem now calls base impl -- `env/fs_posix.cc:950-969`

Issue: The old PosixFileSystem::OptimizeForCompactionTableRead override did NOT call the base implementation, which meant use_direct_reads was silently ignored for compaction reads on POSIX. The PR fixes this by calling FileSystem::OptimizeForCompactionTableRead first. This is a bug fix that improves correctness even without the new option.
Suggested fix: This fix is correct. Consider mentioning it explicitly in the release notes as a separate behavior change.

Cross-Component Analysis

Context	Does code execute?	Assumptions hold?	Action needed?
WritePreparedTxnDB	YES (compaction runs)	YES (no read_callback in compaction)	Safe
ReadOnly DB	NO (no compaction)	N/A	Safe
CompactionService	YES (inherits CreateInputIterator)	YES	Safe, auto-inherited
User-defined timestamps	YES	YES (standard iterator path)	Safe
MemPurge	NO (memtable-to-memtable)	N/A	Safe
BlobDB	Excluded by design	N/A	Documented TODO
Concurrent compactions	YES (multiple jobs)	YES (each has own ephemeral readers)	FD usage may increase
FIFO / Universal compaction	YES	YES (same MakeInputIterator path)	Safe

Verified Non-Issues (disproven during debate)

Range tombstone use-after-free: Multiple agents flagged FragmentedRangeTombstoneIterator holding a pointer to rep_->internal_comparator (range_tombstone_fragmenter.h:319). After deep analysis: in LevelIterator, ClearRangeTombstoneIter() destroys the old range tombstone iterator BEFORE SetFileIterator() destroys the old file iterator (and thus the ephemeral reader) at version_set.cc:1770. For L0 files, both the file iterator and range-del aggregator live for the full compaction lifetime. The tombstone data itself uses shared_ptr<FragmentedRangeTombstoneList>. Not a bug.
table_filter resource leak: When table_filter rejects, result = NewEmptyInternalIterator which inherits from Cleanable via InternalIteratorBase and supports RegisterCleanup. The condition s.ok() && ephemeral_reader && result != nullptr is satisfied since the empty iterator is non-null. Not a leak.
no_io + fresh_table_reader_owner contract violation: The PR handles this with both assert(!no_io) and a defensive Status::Incomplete return in release builds (table_cache.cc:231-237). Properly handled.
Three-clause condition logic error: When use_direct_reads=true, all reads are already O_DIRECT, so skipping ephemeral readers (!db_options_.use_direct_reads clause) is correct. When a custom FS strips the flag, skipping ephemeral readers avoids wasteful buffered re-opens. Correct logic.
Error handling assertion removal: The old assert(result == nullptr) was actually incorrect -- range-del processing CAN set s to non-OK after result is assigned (via new_range_del_iter->status() at table_cache.cc:380). The PR's explicit cleanup of result in the error path is a bug fix.

Positive Observations

Thorough testing: Four compaction tests covering the off-path, end-to-end O_DIRECT, LevelIterator with tombstones, and concurrent stress.
Clean option plumbing: The option is properly wired through ImmutableDBOptions, serialization, C API, db_bench, db_stress, and crash tests.
Defensive coding: The assert(!no_io) + runtime check, the OOM-safe cleanup ordering, and the detailed inline comments show careful engineering.
PosixFileSystem bug fix: The fix to call the base implementation is a meaningful correctness improvement.
Well-written documentation: The option comment in options.h is exemplary -- it covers scope, limitations, platform considerations, and usage guidance.
Error path bug fix: The new cleanup of non-null result on error fixes a latent bug in the existing code.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

meta-cla Bot added the CLA Signed label May 14, 2026

zaidoon1 force-pushed the use-direct-reads-for-compaction branch 2 times, most recently from 8b08b12 to 7931580 Compare May 14, 2026 22:07

zaidoon1 force-pushed the use-direct-reads-for-compaction branch 2 times, most recently from 7892251 to d268198 Compare May 14, 2026 23:25

zaidoon1 force-pushed the use-direct-reads-for-compaction branch 2 times, most recently from d5bb307 to 176a7c2 Compare May 15, 2026 01:38

zaidoon1 force-pushed the use-direct-reads-for-compaction branch 2 times, most recently from 9a8d933 to c87bff7 Compare May 15, 2026 08:52

zaidoon1 force-pushed the use-direct-reads-for-compaction branch from c87bff7 to c6ec228 Compare May 25, 2026 02:09

zaidoon1 changed the title ~~Add use_direct_reads_for_compaction option~~ Add use_direct_io_for_compaction_reads option May 25, 2026

zaidoon1 force-pushed the use-direct-reads-for-compaction branch from c6ec228 to 64f9849 Compare May 25, 2026 02:14

zaidoon1 force-pushed the use-direct-reads-for-compaction branch from 64f9849 to 5adce3a Compare May 25, 2026 02:22

Conversation

zaidoon1 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Small hot set: ~30 MB (~3% of the 1 GB cgroup) — N=5 iterations, mean (CV)

Larger hot set: ~400 MB (~40% of cache) — N=5 iterations, mean (CV)

compaction_readahead_size matters when this flag is on

Summary

Reproducing

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ clang-tidy: 4 warning(s) on changed lines

Summary by check

Details

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Code Review - OBSOLETE

🟡 Codex Code Review

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review - OBSOLETE

✅ Claude Code Review

Summary

Findings

🟡 MEDIUM

M1. OptimizeForBlobFileRead scope creep — env/file_system.cc:181, env/env.cc:1190

M2. Release note inaccuracy — unreleased_history/new_features/use_direct_reads_for_compaction.md

M3. prefetch_index_and_filter_in_cache=true for ephemeral readers — db/table_cache.cc:332

🟢 LOW / NIT

L1. PosixFileSystem behavioral change is a pre-existing bug fix — env/fs_posix.cc:950-969

L2. Redundant third condition in bypass_cache_for_scan — db/compaction/compaction_job.cc

L3. db_bench --bgwriter_num bundled with feature PR — tools/db_bench_tool.cc

L4. table_reader_ptr in bypass mode — db/table_cache.cc:354-356

Cross-Component Analysis

Positive Observations

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Code Review - OBSOLETE

🟡 Codex Code Review

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Code Review - OBSOLETE

🟡 Codex Code Review

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review - OBSOLETE

✅ Claude Code Review

Summary

Findings

🔴 HIGH

H1. Ephemeral TableReader leak on iterator-creation failure — db/table_cache.cc:384

H2. Range tombstone iterator lifetime vs ephemeral TableReader — db/table_cache.cc:358-395

🟡 MEDIUM

M1. Scope creep: bgwriter_num flag in db_bench — tools/db_bench_tool.cc:268-276,7944-7948

M2. Backup engine inherits changed OptimizeForCompactionTableRead behavior — env/file_system.cc:170

M3. Missing test coverage for L1+ LevelIterator bypass path — db/db_compaction_test.cc

M4. Sync-point callback counters not atomic — db/db_compaction_test.cc:6695-6707

M5. NO_FILE_OPENS statistic inflated for bypass path — db/table_cache.cc:127

🟢 LOW / NIT

L1. Three-way bypass condition could be simplified — db/compaction/compaction_job.cc:1547-1549

L2. Option documentation mentions mixed O_DIRECT access — include/rocksdb/options.h:1147-1157

L3. RandomInitDBOptions may generate invalid combinations — test_util/testutil.cc:304

Cross-Component Analysis

Positive Observations

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Code Review - OBSOLETE

🟡 Codex Code Review

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review - OBSOLETE

✅ Claude Code Review

zaidoon1 commented May 14, 2026 •

edited

Loading

`compaction_readahead_size` matters when this flag is on

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

M1. `OptimizeForBlobFileRead` scope creep — `env/file_system.cc:181`, `env/env.cc:1190`

M2. Release note inaccuracy — `unreleased_history/new_features/use_direct_reads_for_compaction.md`

M3. `prefetch_index_and_filter_in_cache=true` for ephemeral readers — `db/table_cache.cc:332`

L1. PosixFileSystem behavioral change is a pre-existing bug fix — `env/fs_posix.cc:950-969`

L2. Redundant third condition in `bypass_cache_for_scan` — `db/compaction/compaction_job.cc`

L3. db_bench `--bgwriter_num` bundled with feature PR — `tools/db_bench_tool.cc`

L4. `table_reader_ptr` in bypass mode — `db/table_cache.cc:354-356`

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading

H1. Ephemeral TableReader leak on iterator-creation failure — `db/table_cache.cc:384`

H2. Range tombstone iterator lifetime vs ephemeral TableReader — `db/table_cache.cc:358-395`

M1. Scope creep: `bgwriter_num` flag in db_bench — `tools/db_bench_tool.cc:268-276,7944-7948`

M2. Backup engine inherits changed `OptimizeForCompactionTableRead` behavior — `env/file_system.cc:170`

M3. Missing test coverage for L1+ LevelIterator bypass path — `db/db_compaction_test.cc`

M4. Sync-point callback counters not atomic — `db/db_compaction_test.cc:6695-6707`

M5. `NO_FILE_OPENS` statistic inflated for bypass path — `db/table_cache.cc:127`

L1. Three-way bypass condition could be simplified — `db/compaction/compaction_job.cc:1547-1549`

L2. Option documentation mentions mixed O_DIRECT access — `include/rocksdb/options.h:1147-1157`

L3. `RandomInitDBOptions` may generate invalid combinations — `test_util/testutil.cc:304`

github-actions Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading

M1. Missing Java JNI Bindings — `java/rocksjni/options.cc`

M2. `bgwriter_num` db_bench Addition Is Unrelated — `tools/db_bench_tool.cc`

M3. `RandomInitDBOptions` May Create Invalid Configs — `test_util/testutil.cc`

L1. Redundant Option Combination Not Rejected — `db/db_impl/db_impl_open.cc`

L2. BlobDB Intentionally Excluded — `env/file_system.cc`

H1. Ephemeral readers still use the shared block cache -- `table_cache.cc:237-259`

M1. `bgwriter_num` is unrelated to the core feature -- `tools/db_bench_tool.cc:268-275`

M2. `RandomInitDBOptions` can generate invalid combinations -- `test_util/testutil.cc:304`

M3. BackupEngine side effect may surprise users -- `env/file_system.cc:167-199`

L1. Statistics inflation from ephemeral opens -- `table_cache.cc:237-259`

L2. Lambda in RegisterCleanup vs function pointer -- `table_cache.cc:466-470`

L3. Option comment length -- `include/rocksdb/options.h:1135-1200`

L4. PosixFileSystem now calls base impl -- `env/fs_posix.cc:950-969`