Add use_direct_io_for_compaction_reads option#14743
Conversation
8b08b12 to
7931580
Compare
|
| Check | Count |
|---|---|
bugprone-unused-return-value |
1 |
readability-braces-around-statements |
2 |
readability-isolate-declaration |
1 |
| Total | 4 |
Details
db/db_compaction_test.cc (3 warning(s))
db/db_compaction_test.cc:6979:39: warning: statement should be inside braces [readability-braces-around-statements]
db/db_compaction_test.cc:6988:35: warning: statement should be inside braces [readability-braces-around-statements]
db/db_compaction_test.cc:6992:3: warning: multiple declarations in a single statement reduces readability [readability-isolate-declaration]
db/table_cache.cc (1 warning(s))
db/table_cache.cc:489:11: warning: the value returned by this function should not be disregarded; neglecting it may lead to errors [bugprone-unused-return-value]
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 7931580 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 7931580 SummaryWell-designed feature addressing a real performance need (avoiding page-cache pollution from compaction scans). The core bypass mechanism in TableCache is sound, and the PosixFileSystem fix is a valuable correction. Main concerns are around High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1.
|
| Context | Affected? | Assessment |
|---|---|---|
| WritePreparedTxnDB | Yes (same compaction path) | Safe — bypass is per-iterator, no shared state |
| FIFO/Universal compaction | Yes (same MakeInputIterator) | Safe — all compaction types share the path |
| ReadOnly/Secondary DB | No compaction | Not affected |
| CompactionServiceCompactionJob | Inherits privately from CompactionJob | Safe — inherits CreateInputIterator |
| BackupEngine blob reads | Yes via OptimizeForBlobFileRead | Unintended — see M1 |
| ForwardIterator/LevelIterator | Default false | Safe — only compaction sets bypass=true |
| Concurrent compactions on same file | Both open ephemeral readers | Safe — each has independent reader/FD |
| Mixed O_DIRECT + buffered on same file | Kernel concern | Acceptable — compaction is sequential one-pass; file is dropped after |
Assumption stress-test results:
- "Ephemeral reader cleanup is safe" — CONFIRMED. Range tombstone iterators hold
shared_ptr<FragmentedRangeTombstoneList>, not raw pointers into the TableReader. TheInternalKeyComparatorreference is to the CFD-level comparator which outlives the compaction. No use-after-free. - "Default false preserves existing behavior" — CONFIRMED. When both flags are false,
use_direct_readsstays false through all code paths. The PosixFileSystem base-class call is a correct fix for a pre-existing bug. - "Mixing O_DIRECT and buffered is safe in practice" — ACCEPTABLE. The kernel documentation warns against it, but the pattern (sequential one-pass scan that ends with file deletion) avoids the problematic sustained mixed-access scenario. Well-documented in options.h.
Positive Observations
- The bypass mechanism is cleanly scoped — the
fresh_table_reader_ownerunique_ptr inFindTableandRegisterCleanupinNewIteratorprovide clear ownership semantics. - The PosixFileSystem fix to call the base class is a valuable correction that ensures custom readahead clamping doesn't silently override the direct-reads decision.
- Comprehensive test coverage: end-to-end test with both kernel-level (O_DIRECT sync point) and plumbing-level (FileOptions) probes, plus a negative test for the off case.
- Proper crash test integration with mmap and direct IO sanitization.
- Good option validation (mmap + direct_reads_for_compaction rejected at Open time).
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
7892251 to
d268198
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit d268198 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
d5bb307 to
176a7c2
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 176a7c2 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 176a7c2 SummaryWell-designed, opt-in feature that addresses a real performance need (preventing compaction reads from polluting the OS page cache). The implementation is architecturally sound, following existing RocksDB patterns. Most infrastructure (options, serialization, validation, stress test, db_bench) is complete and correct. High-severity findings (2):
Full review (click to expand)Findings🔴 HIGHH1. Ephemeral TableReader leak on iterator-creation failure —
|
| Context | Executes? | Assumptions hold? | Action needed? |
|---|---|---|---|
| WritePreparedTxnDB | YES (same compaction path) | YES | Safe |
| ReadOnly DB | NO (no compactions) | N/A | Safe |
| CompactionService | YES (inherits CompactionJob) | YES | Safe — options propagated via serialization |
| User-defined timestamps | YES | YES | No special handling needed |
| BlobDB | Separate path (OptimizeForBlobFileRead) |
YES | Intentionally excluded — correct |
| FIFO/Universal compaction | YES (all use MakeInputIterator) |
YES | Safe |
| Backup engine | YES (OptimizeForCompactionTableRead called) |
Partially | See M2 above |
| Old snapshots | YES | YES | No impact on snapshot semantics |
| Concurrent writers | YES | YES | Bypass path is independent per compaction |
| Prefix seek | N/A (compaction uses total-order) | YES | Safe |
Positive Observations
- Option infrastructure is thorough: Options registration, serialization round-trip, dump logging, C API, db_bench, db_stress, crashtest integration, and validation are all present and follow existing patterns precisely.
- PosixFileSystem correctly calls base class: The override now calls
FileSystem::OptimizeForCompactionTableReadfirst, then applies the Linux-specific readahead clamping on the post-base values. This is the right fix for the inheritance pattern. - Blob file reads intentionally excluded: The test in
UseDirectReadsForCompactionOptionMechanicsexplicitly verifies thatOptimizeForBlobFileReaddoes NOT enable direct reads for the new flag. Good isolation. - End-to-end test validates kernel-level O_DIRECT: The
NewRandomAccessFile:O_DIRECTsync point verifies the actual OS-level flag, not just in-memory FileOptions. - Clean bypass activation logic: The
bypass_cache_for_scanflag is only computed in one place (CreateInputIterator) and threaded down consistently throughMakeInputIterator→LevelIterator→NewIterator→FindTable. - Defensive validation:
mmap_reads + use_direct_reads_for_compactionis properly rejected at DB::Open with a clear error message.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
9a8d933 to
c87bff7
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit c87bff7 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit c87bff7 SummaryWell-designed, opt-in feature with correct lifetime management and clean plumbing through the TableCache → MakeInputIterator → LevelIterator → CompactionJob stack. The bypass-cache approach is the right solution given RocksDB's architectural constraints. The PosixFileSystem override fix is essential and correctly implemented. No high-severity issues found after thorough multi-agent review and debate. High-severity findings (0): Full review (click to expand)Findings🟡 MEDIUMM1. Missing Java JNI Bindings —
|
| Context | Affected? | Verified Safe? |
|---|---|---|
| Compaction (L0 + L1+) | YES | YES — ephemeral readers with proper cleanup |
| CompactionService | NO | YES — doesn't call MakeInputIterator |
| ReadOnly DB | NO | YES — no compaction |
| WritePreparedTxnDB | YES (same path) | YES — orthogonal |
| User reads | NO | YES — default param prevents bypass |
| BlobDB | Excluded | YES — intentional |
Debated Findings (Resolved)
"Range tombstone destruction order bug" → FALSE POSITIVE. FragmentedRangeTombstoneIterator holds shared_ptr<FragmentedRangeTombstoneList> which keeps tombstone data alive independently of the TableReader. Destruction order in CompactionMergingIterator is irrelevant.
"Block cache pollution negates feature" → FALSE POSITIVE. The feature addresses OS page cache pollution (via O_DIRECT), not block cache. Compaction already sets fill_cache = false (compaction_job.cc:1465).
Positive Observations
- Excellent deferred ownership transfer pattern for ephemeral readers
- PosixFileSystem fix correctly chains base class call before Linux readahead clamping
- Defensive third clause in bypass condition protects against custom FileSystems
- Comprehensive test coverage (3 e2e tests + option mechanics + stress test integration)
- Zero behavioral change when flag is off (default)
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Adds a new db_bench flag --bgwriter_num (default 0). When > 0, the background writer thread used by readwhilewriting / readwhilemerging / multireadwhilewriting writes random keys over [0, bgwriter_num) instead of [0, num). This lets the reader thread group operate on a small hot subset (set by --num) while the writer spreads its puts across a much larger keyspace, which is what drives continuous flushes and compaction. Without it readers and the writer share a single keyspace and you can't have both a hot read subset and meaningful compaction work at the same time. Default of 0 preserves existing behavior (writer uses --num).
c87bff7 to
c6ec228
Compare
c6ec228 to
64f9849
Compare
Introduces a new DBOption `use_direct_io_for_compaction_reads` (default
false) that lets users route compaction background reads
through O_DIRECT while keeping user reads on the buffered/page-cache
path. Sequential compaction reads otherwise pollute the OS page cache
with read-once data that evicts the hot user-read working set;
bypassing the cache for those reads protects user-read tail latency on
write-heavy workloads without forcing users onto the global
`use_direct_reads` path (which slows user reads dramatically).
A naive implementation that only flipped the FileOptions returned by
`OptimizeForCompactionTableRead` does not actually trigger the
OS-level O_DIRECT open, because the TableCache (and
FileMetaData::pinned_reader) already holds long-lived buffered
handles opened at flush time or at DB::Open via LoadTableHandlers.
Compaction would silently reuse those cached buffered handles and the
kernel would never see the O_DIRECT flag.
The fix opens ephemeral O_DIRECT TableReaders for the lifetime of the
compaction scan, separate from the cache:
* TableCache::FindTable / NewIterator learn an
`open_ephemeral_table_reader` mode. When set, the pinned-reader
fast path and the shared cache are skipped, GetTableReader is
called directly with the caller's FileOptions, and ownership of
the freshly opened TableReader is handed back to the caller. The
iterator takes ownership via RegisterCleanup and frees the reader
on destruction. In this mode the bypass also forces
skip_filters=true and prefetch_index_and_filter_in_cache=false:
compaction reads every key in the file (filters are useless) and
inserting the ephemeral reader's index/filter blocks into the
shared block cache would partially defeat the page-cache-protection
goal of the option.
* VersionSet::MakeInputIterator and LevelIterator plumb the flag
through both the L0 and L1+ compaction-input paths.
* CompactionJob::ProcessKeyValueCompaction enables the flag exactly
when `use_direct_io_for_compaction_reads` is on, the global
`use_direct_reads` is off, and `OptimizeForCompactionTableRead`
actually produced `use_direct_reads=true` in the
compaction-read FileOptions.
End-to-end tests in db_compaction_test.cc use the existing
`NewRandomAccessFile:O_DIRECT` sync point in env/fs_posix.cc to assert
that the kernel-level open really happens for compaction inputs when
the flag is set, and never fires when the flag is off. There are
on-path tests (L0->L1 via MakeInputIterator), L1+ tests with range
tombstones (LevelIterator path), and a concurrent-read-vs-compaction
stress test for use under TSAN/ASAN. The negative ("off stays
buffered") test runs on all platforms; the others are scoped to
platforms with the O_DIRECT path.
The new option follows the existing add_option.md checklist: it is
registered in ImmutableDBOptions for serialization, surfaced through
the C API, exposed in db_bench / db_stress / db_crashtest.py,
randomized in RandomInitDBOptions, validated against allow_mmap_reads
at Open time, and documented in unreleased_history (new_features +
behavior_changes, the latter for the
FileSystem::OptimizeForCompactionTableRead contract change). Java JNI
is left for a follow-up.
The same underlying hook (FileSystem::OptimizeForCompactionTableRead)
is also called by BackupEngine when copying SST files, so backups
will also route through O_DIRECT when the new flag is true. This is
intentional -- backups exhibit the same sequential-scan / cache-
pollution pattern as compaction inputs -- and is documented in the
option's Doxygen and the FileSystem interface. Blob-file reads are
not affected by the new flag (BlobFileCache holds its own cached
handles); this is documented as a known limitation with a follow-up
to extend the ephemeral-reader mechanism to BlobFileCache.
Benchmark results
=================
Setup: Ubuntu 24.04 (kernel 7.0.5 OrbStack Linux VM on Apple Silicon),
14 vCPUs, virtio-blk disk. MGLRU disabled (echo 0 >
/sys/kernel/mm/lru_gen/enabled). 14 GB DB (3.5M keys * 4 KB values),
no compression. Each measurement run pinned to a 1 GB cgroup
via `systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0`, so
DB-to-cache ratio is ~14x. Page cache dropped between configs.
Workload: readwhilewriting for 180 s, 4 reader threads on a hot
2,000-key subset (~8 MB, ~3% of cache) + 1 writer thread spreading
overwrites across the full 3.5M-key keyspace
(via `--bgwriter_num=3500000`), throttled at 100 MB/s. Compaction
ran at ~500 MB/s read/write during the buffered run, ~400 MB/s with
direct compaction.
Each run was 3 minutes long; "buffered" is the existing default.
| Config | Throughput | Read P50 | Read P99 | Read P99.9 | Read P99.99 |
|-------------------------------------------|-----------------|---------------|---------------|----------------|----------------|
| buffered (default) | 406 K ops/s | 7.34 us | 79.11 us | 533.14 us | 1647.79 us |
| direct_compaction_read_write | **464 K ops/s** | **6.37 us** | **71.64 us** | **468.28 us** | **1363.91 us** |
| | (+14%) | (-13%) | (-9%) | (-12%) | (-17%) |
| direct_compaction_read_only | 421 K ops/s | 6.99 us | 88.95 us | 504.32 us | 1456.75 us |
| | (+4%) | (-5%) | (+13%) | (-5%) | (-12%) |
| use_direct_reads = true (existing global) | 442 K ops/s | 7.37 us | 50.82 us | 472.23 us | 1626.77 us |
| | (+9%) | (0%) | (-36%) | (-11%) | (-1%) |
The recommended production configuration is
`use_direct_io_for_compaction_reads = true` together with
`use_direct_io_for_flush_and_compaction = true` ("direct reads + writes
for compaction"). It wins on every metric simultaneously: throughput
up 14%, every read percentile from P50 to P99.99 down 9 to 17%. The
existing global `use_direct_reads = true` flag does help P99
specifically but at a noticeable throughput cost and is no better at
P99.99; the new compaction-only path is strictly better for the
write-heavy workloads it is designed for.
Higher DB-to-cache ratios (the Cassandra blog at
https://lightfoot.dev/direct-i-o-for-cassandra-compaction-cutting-p99-read-latency-by-5x/
reports ~5x P99 improvement at a 43x ratio) should widen the gap
further; the 14x ratio used above is what fit in a single laptop's
disk budget.
Repro recipe
============
Setup:
- Install OrbStack on macOS or use any Linux host
- On macOS: orb create -t ubuntu rocksdb-bench
- Inside the Linux machine:
apt-get install -y build-essential clang cmake git pkg-config \
libgflags-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev \
libzstd-dev rsync
cmake -DCMAKE_BUILD_TYPE=Release -DPORTABLE=1 -DWITH_GFLAGS=1 \
-DWITH_TESTS=0 .. && make -j db_bench
Build the source DB (once, unrestricted memory):
./db_bench --benchmarks=fillrandom,compact,waitforcompaction,stats \
--db=/path/to/source_db --num=3500000 --key_size=16 \
--value_size=4096 --write_buffer_size=16777216 \
--target_file_size_base=16777216 --max_background_jobs=4 \
--compression_type=none --cache_size=4194304 \
--max_bytes_for_level_base=67108864 --disable_wal=1 --sync=0
Per-config measurement (copy source_db -> scratch_db first, then
drop_caches, then run under cgroup):
sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \
./db_bench --use_existing_db=1 \
--benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \
--threads=5 --duration=180 --statistics=true --histogram=1 \
--num=2000 --bgwriter_num=3500000 \
--key_size=16 --value_size=4096 \
--write_buffer_size=16777216 --target_file_size_base=16777216 \
--max_background_jobs=4 --compression_type=none \
--cache_size=4194304 --open_files=200 \
--skip_stats_update_on_db_open=true \
--max_bytes_for_level_base=67108864 \
--benchmark_write_rate_limit=104857600 \
--rate_limiter_bytes_per_sec=0 \
--use_direct_reads={true|false} \
--use_direct_io_for_compaction_reads={true|false} \
--use_direct_io_for_flush_and_compaction={true|false}
Disable MGLRU first so the kernel uses the classic active/inactive LRU:
echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled
64f9849 to
5adce3a
Compare
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 5adce3a ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 5adce3a SummaryWell-engineered PR that adds a useful High-severity findings (1):
Full review (click to expand)Findings🔴 HIGHH1. Ephemeral readers still use the shared block cache --
|
| Context | Does code execute? | Assumptions hold? | Action needed? |
|---|---|---|---|
| WritePreparedTxnDB | YES (compaction runs) | YES (no read_callback in compaction) | Safe |
| ReadOnly DB | NO (no compaction) | N/A | Safe |
| CompactionService | YES (inherits CreateInputIterator) | YES | Safe, auto-inherited |
| User-defined timestamps | YES | YES (standard iterator path) | Safe |
| MemPurge | NO (memtable-to-memtable) | N/A | Safe |
| BlobDB | Excluded by design | N/A | Documented TODO |
| Concurrent compactions | YES (multiple jobs) | YES (each has own ephemeral readers) | FD usage may increase |
| FIFO / Universal compaction | YES | YES (same MakeInputIterator path) | Safe |
Verified Non-Issues (disproven during debate)
-
Range tombstone use-after-free: Multiple agents flagged
FragmentedRangeTombstoneIteratorholding a pointer torep_->internal_comparator(range_tombstone_fragmenter.h:319). After deep analysis: inLevelIterator,ClearRangeTombstoneIter()destroys the old range tombstone iterator BEFORESetFileIterator()destroys the old file iterator (and thus the ephemeral reader) at version_set.cc:1770. For L0 files, both the file iterator and range-del aggregator live for the full compaction lifetime. The tombstone data itself usesshared_ptr<FragmentedRangeTombstoneList>. Not a bug. -
table_filter resource leak: When
table_filterrejects,result = NewEmptyInternalIteratorwhich inherits fromCleanableviaInternalIteratorBaseand supportsRegisterCleanup. The conditions.ok() && ephemeral_reader && result != nullptris satisfied since the empty iterator is non-null. Not a leak. -
no_io + fresh_table_reader_owner contract violation: The PR handles this with both
assert(!no_io)and a defensiveStatus::Incompletereturn in release builds (table_cache.cc:231-237). Properly handled. -
Three-clause condition logic error: When
use_direct_reads=true, all reads are already O_DIRECT, so skipping ephemeral readers (!db_options_.use_direct_readsclause) is correct. When a custom FS strips the flag, skipping ephemeral readers avoids wasteful buffered re-opens. Correct logic. -
Error handling assertion removal: The old
assert(result == nullptr)was actually incorrect -- range-del processing CAN setsto non-OK afterresultis assigned (vianew_range_del_iter->status()at table_cache.cc:380). The PR's explicit cleanup ofresultin the error path is a bug fix.
Positive Observations
- Thorough testing: Four compaction tests covering the off-path, end-to-end O_DIRECT, LevelIterator with tombstones, and concurrent stress.
- Clean option plumbing: The option is properly wired through ImmutableDBOptions, serialization, C API, db_bench, db_stress, and crash tests.
- Defensive coding: The
assert(!no_io)+ runtime check, the OOM-safe cleanup ordering, and the detailed inline comments show careful engineering. - PosixFileSystem bug fix: The fix to call the base implementation is a meaningful correctness improvement.
- Well-written documentation: The option comment in
options.his exemplary -- it covers scope, limitations, platform considerations, and usage guidance. - Error path bug fix: The new cleanup of non-null
resulton error fixes a latent bug in the existing code.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Adds a new
DBOption use_direct_io_for_compaction_reads(default false). When on, compaction-input SST files are opened withO_DIRECTso the sequential read-once data from compaction doesn't pollute the OS page cache and evict the hot user-read working set. User reads keep going through the buffered fast path. This protects user-read tail latency on write-heavy workloads without forcing user reads onto the existing globaluse_direct_readsknob (which pays in throughput and P50 — see the bench below).The interesting bit is that just flipping the FileOptions returned by
FileSystem::OptimizeForCompactionTableReaddoesn't actually triggerO_DIRECTat the kernel level. The TableCache (andFileMetaData::pinned_reader) is already holding buffered handles opened at flush time or atDB::OpenviaLoadTableHandlers. When compaction asks for an iterator, it gets back the cached buffered handle and the kernel never sees theO_DIRECTflag.So this PR also adds a small bypass path:
TableCache::FindTable/NewIteratorlearn aopen_ephemeral_table_readermode. When set, the pinned-reader fast path and the shared cache are skipped,GetTableReaderis called directly with the caller's FileOptions, and ownership of the freshly opened TableReader is handed back via aunique_ptr. The iterator takes ownership viaRegisterCleanupand frees the reader on destruction.VersionSet::MakeInputIteratorandLevelIteratorplumb the flag through both L0 and L1+ compaction-input paths.CompactionJob::ProcessKeyValueCompactionturns the bypass on whenuse_direct_io_for_compaction_readsis set, the globaluse_direct_readsis off, andOptimizeForCompactionTableReadproduceduse_direct_reads=truein the compaction-read FileOptions.The option is opt-in: when off, nothing changes for existing users. When on, only the compaction-input opens take the bypass path; user reads keep hitting the TableCache and the buffered fast path normally.
There's also a small db_bench helper in the same PR: a new
--bgwriter_numflag that lets the writer thread inreadwhilewriting(and the other "while writing" variants) spread its puts across[0, bgwriter_num)instead of[0, num). Without this the readers and writer share a key range and you can't have both a hot read subset and meaningful compaction work — this lets you have both.Benchmark
Setup: Ubuntu 24.04 (kernel 7.0.5, OrbStack Linux VM on Apple Silicon), 14 vCPUs, virtio-blk disk, ext4. MGLRU disabled (
echo 0 > /sys/kernel/mm/lru_gen/enabled) so the kernel uses the classic active/inactive LRU. 14 GB DB (3.5M keys × 4 KB values), no compression. Each measurement run is pinned to a 1 GB cgroup viasystemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0. Page cache is dropped between configs. db_bench is Release build.Workload:
readwhilewritingfor 120s. 4 reader threads doing random reads over a hot key subset, plus 1 writer thread spreading overwrites across the full 3.5M-key keyspace (via--bgwriter_num=3500000) throttled at 200 MB/s, so there's continuous compaction running while the readers go.The size of the hot reader subset relative to available page cache controls how visible the optimization is. The Cassandra blog (Lightfoot 2026) documented the same thing: biggest wins when the hot set is big enough to actually compete for cache, smaller wins when the hot set trivially fits, neutral when the hot set is way bigger than cache. So I ran two hot-set sizes.
Small hot set: ~30 MB (~3% of the 1 GB cgroup) — N=5 iterations, mean (CV)
--num=7500. The hot set is small enough that the page cache holds it without much trouble even under compaction, so the wins here are real but on the modest side.CV is 2.4–3.3% on the optimized configs (8.2% on buffered), so the deltas are real. With a hot set this small, the existing
use_direct_io_for_flush_and_compactionknob is already doing most of the work — the new flag's main extra contribution here is P99.99 (combined wins it by ~2 points vs writes-only-alone). Worth noting: the new flag alone (without the existing write-side flag) improves P99.99 but regresses P99 by 25% on this small-hot-set workload, because direct compaction reads lose kernel readahead and compaction-output writes are still hitting the page cache. That regression goes away once you combine with the existing write-side flag, or once the hot set is bigger (see next table). So if you're using just one knob, use the existing one. If you're using this PR's flag, pair it withuse_direct_io_for_flush_and_compaction=true.Larger hot set: ~400 MB (~40% of cache) — N=5 iterations, mean (CV)
--num=100000. This is the case the Cassandra blog calls out — hot set big enough to actually fight compaction for cache. Their analogous setup (1M hot partitions, ~33% hot/cache) reported 1.93× p99 improvement. Numbers here are the headline:Combined config gets a 3.68× p99.99 win, 1.86× p99, p50 down 23%, throughput up 52%. Same shape as the Cassandra blog's 1.93× p99 result — the improvement just lands at deeper percentiles for us because RocksDB's baseline data path is roughly 40× faster than Cassandra's (their buffered p99 was 35 ms, ours is 0.54 ms), so the cache-miss tail is further out.
A few things worth calling out from this table:
direct_compaction_writes_onlyalone, and combined p99.99 is 3× better. The existing knob alone gives a fairly modest +7% throughput / -19% p99.99 in this case — there's a clear gap that the new flag fills.use_direct_reads=true(the existing global flag) actually regresses P50 by 14.5% in this workload — taking user reads off the page cache hurts you when the hot data could have been cached. It also gets the worst throughput of any direct config. It's not an equivalent way to get these gains.compaction_readahead_sizematters when this flag is onDirect I/O bypasses kernel readahead, so RocksDB's own
DBOptions::compaction_readahead_sizebecomes the only prefetch the iterator has. The default of 2 MB is enough and real users will get it automatically. Butdb_bench's--compaction_readahead_sizeCLI default is 0, which defeats prefetch and makes direct compaction look slower than it actually is. If you're reproducing the numbers above, pass--compaction_readahead_size=2097152(or larger).Summary
use_direct_io_for_compaction_reads=true+use_direct_io_for_flush_and_compaction=true. Strongest configuration at every percentile and throughput in both benches.use_direct_io_for_flush_and_compaction, which handles compaction-write cache pollution. They address different sources of pollution and compose. The gap between "combined" and "writes-only-alone" is 17 percentage points on p99.99 in the small-hot-set bench and 54 points in the larger one, so the new flag is contributing real value, especially as the hot set grows.Reproducing
Any Linux host (or a Linux VM on macOS via OrbStack / Multipass / lima):
Build the source DB once, unrestricted memory:
For each config, copy
source_db -> scratch_db, runsync && echo 3 > /proc/sys/vm/drop_caches, then:sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \ ./db_bench --use_existing_db=1 \ --benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \ --threads=5 --duration=120 --statistics=true --histogram=1 \ --num=7500 --bgwriter_num=3500000 \ --key_size=16 --value_size=4096 \ --write_buffer_size=16777216 --target_file_size_base=16777216 \ --max_background_jobs=4 --compression_type=none \ --cache_size=4194304 --open_files=200 \ --skip_stats_update_on_db_open=true \ --max_bytes_for_level_base=67108864 \ --benchmark_write_rate_limit=209715200 \ --compaction_readahead_size=2097152 \ --rate_limiter_bytes_per_sec=0 \ --use_direct_reads={true|false} \ --use_direct_io_for_compaction_reads={true|false} \ --use_direct_io_for_flush_and_compaction={true|false}For the larger hot-set table, change
--num=7500to--num=100000.The five configs in the tables:
buffered: all three flags false.direct_compaction_writes_only:use_direct_io_for_flush_and_compaction=true, the other two false. This is what users have today without this PR.direct_compaction_read_only:use_direct_io_for_compaction_reads=true, the other two false.direct_compaction_read_write:use_direct_io_for_compaction_reads=true,use_direct_io_for_flush_and_compaction=true,use_direct_reads=false. Recommended.direct_all:use_direct_reads=true,use_direct_io_for_flush_and_compaction=true,use_direct_io_for_compaction_reads=false.