Add opt-in coroutines feature for multi-level async MultiGet#218
Merged
Conversation
Adds a new cargo feature, `coroutines`, that compiles RocksDB with
USE_COROUTINES=1 + USE_FOLLY=1 and links against folly. When enabled
together with `io-uring` and ReadOptions::set_async_io(true), MultiGet
issues parallel io_uring reads across SST files in different LSM
levels, not just within a single level. Per the RocksDB Asynchronous
IO blog post, on remote/slow storage this drops MultiGet latency from
~775 to ~508 us/op (~30% reduction) at the cost of ~6-15% extra CPU.
Linux only. The feature panics at build time on other targets.
Changes:
librocksdb-sys/Cargo.toml, Cargo.toml
New `coroutines` feature in both crates.
librocksdb-sys/build.rs
- validate_coroutines_target() rejects non-Linux targets early.
- coroutines_compile_config() sets USE_COROUTINES/USE_FOLLY/
FOLLY_NO_CONFIG/HAVE_CXX11_ATOMIC, adds -fcoroutines on GCC,
silences folly-induced warnings, and adds include paths for
folly + its 8 dependencies.
- coroutines_link_config() emits the cargo:rustc-link-* directives
for folly, boost (7 components), double-conversion, libevent,
libsodium, fmt, glog, and gflags. Glog and gflags are linked
dynamically because folly's getdeps does not produce static
archives for them, with rpath entries embedded so the final
binary can find them without LD_LIBRARY_PATH.
- Link config runs in main() (not inside build_rocksdb) so it also
applies when ROCKSDB_LIB_DIR points at an externally-built
librocksdb compiled with USE_COROUTINES.
- Dependency directories are resolved via glob since folly's
getdeps install layout uses commit-hash-suffixed dir names.
scripts/build_folly.sh
Helper that wraps RocksDB's own `make build_folly` target, which
invokes folly's getdeps.py to build folly + 8 dependencies at the
commit pinned by librocksdb-sys/rocksdb/folly.mk:FOLLY_COMMIT_HASH.
Prints the install path to set ROCKSDB_FOLLY_INSTALL_PATH to.
src/lib.rs
Adds built_with_coroutines() runtime helper that returns
cfg!(feature = "coroutines"). Documented caveat: if linking
against an externally-built librocksdb via ROCKSDB_LIB_DIR, this
may not match what that library was actually compiled with.
tests/test_coroutines.rs
Three smoke tests that run in both feature configurations:
- built_with_coroutines() matches the feature flag.
- async_io=true MultiGet across multiple LSM levels returns
results identical to a loop of single Gets.
- same for a same-level batch.
.github/workflows/coroutines.yml
Ubuntu CI job that caches folly keyed on FOLLY_COMMIT_HASH, then
builds and runs tests with --features coroutines,io-uring.
README.md
New "Async MultiGet with C++20 Coroutines" section under Advanced
Features with the Meta benchmark table, build prerequisites, and
runtime constraints (dynamic glog/gflags, mt_static incompatible).
The optimize_multiget_for_io ReadOption is not yet exposed at the Rust
level - that depends on facebook/rocksdb#14752 merging and a release
being cut. For coroutine builds the C++ default of true is the right
choice for most workloads anyway.
Five fixes responding to review comments on the previous commit: 1. Drop `cargo:rustc-link-arg=-Wl,-rpath,...` for glog and gflags (rust-lang/cargo#9554): these directives only apply to artifacts of the crate that emits them, not to downstream binaries. Embedding rpath that only covers our own test binaries was misleading - CI tests would pass while user binaries would fail at startup with "libglog.so: cannot open shared object file". Instead, expose the discovered lib directories as `cargo:folly_glog_libdir` and `cargo:folly_gflags_libdir` (accessible to downstream build scripts as `DEP_ROCKSDB_FOLLY_GLOG_LIBDIR` and `DEP_ROCKSDB_FOLLY_GFLAGS_LIBDIR`) and document the LD_LIBRARY_PATH / rpath / system-install options in the README. 2. `resolve_folly_dep`: panic with a clear message when multiple matching directories exist (typically a stale install from a prior FOLLY_COMMIT_HASH mixed with the current one). Previously we picked the first glob entry non-deterministically, which could silently link the wrong version. 3. `lib_or_lib64`: add a comment documenting the assumption that exactly one of `lib/` or `lib64/` holds a given dependency, derived from the folly version at the pinned FOLLY_COMMIT_HASH. 4. Rewrite `multi_get_async_io_matches_serial_get` to explicitly build and verify a multi-level LSM layout via `compact_range_opt` with `CompactOptions::set_target_level`, then assert via `rocksdb.num-files-at-levelN` that data actually spans multiple levels before running the MultiGet. The previous version called `compact_range(None, None)` which collapses everything to the bottom level, so the multi-level dispatch path was never exercised despite the test name suggesting otherwise. 5. Rename `built_with_coroutines_matches_feature_flag` to `built_with_coroutines_helper_is_callable` and drop the tautology (`built_with_coroutines()` is `cfg!(feature = "coroutines")`, so asserting equality between the two tested nothing). Verify call-stability instead and document why the function is still useful (single source of truth for logging/diagnostics). Also add a 90-minute timeout-minutes to the CI job so a cold-cache folly build cannot exceed GHA's default 6-hour limit silently.
Investigated and fixed all 10 issues raised in the second review: #1 - CI cache key omitted OS image version. Pin runs-on to ubuntu-24.04 (was ubuntu-latest) and include the image name in the cache key. Folly is built against the host's glibc and libstdc++; a silent ubuntu-latest rollover with a cache hit would produce binaries with mismatched ABI vs the cargo build step. #2 - getdeps.py default scratch dir was outside the cached path. Investigated buildopts.py:setup_build_options upstream. Confirmed: with no --scratch-path argument and no Facebook-internal mkscratch, getdeps falls back to /tmp/fbcode_builder_getdeps-<munged-cwd>. This is outside librocksdb-sys/rocksdb/third-party/folly, so my original cache covered only source - on a cache hit, show-inst-dir would return a /tmp/... path that doesn't exist on disk. Fixed by rewriting build_folly.sh to bypass 'make build_folly' and call getdeps.py directly with --scratch-path=<workspace>/librocksdb-sys/folly-build/. Replicates the few things make build_folly did: - Clone folly + reset to FOLLY_COMMIT_HASH - Apply two upstream-required perl patches (idempotent) - Run getdeps with CXXFLAGS=-DHAVE_CXX11_ATOMIC and GETDEPS_USE_WGET=1 - patchelf libglog.so to embed libgflags rpath (matches folly.mk). CI now caches both the scratch dir AND the folly source checkout under a path that's stable and inside the workspace. #3 - freebsd early-return interaction with coroutines_link_config. Practically blocked by validate_coroutines_target panicking earlier, but the comment claimed the link config 'also applies when ROCKSDB_LIB_DIR points at an externally-built librocksdb'. Updated the comment to acknowledge that this branch only handles Linux today and that relaxing the target validation would require revisiting. #4 - Test compaction race. With disable_auto_compactions=true and L0 trigger=64, no background work runs concurrently, so the original test was probably safe. Added wait_for_compact between every put_batch/compact_range_opt phase as defense-in-depth so the layout is settled before level_layout() queries it. Cheap: each wait is a no-op with no scheduled work. #5 - lib_or_lib64 only checked directory existence, not file presence. Replaced with libdir_containing(prefix, lib_name), which probes for lib<name>.{so,a}* in each candidate dir via glob and panics with a clear error if neither contains the library. Catches folly's habit of producing an empty lib64/ on Debian-family distros (or vice versa) at config time instead of via a confusing 'cannot find -l<x>' from the linker later. #6 - README rpath snippet could mislead users into adding it in a library crate. Added explicit note: 'this must live in the crate that produces the binary you're shipping (a [[bin]] target), NOT in an intermediate library crate' along with the reason (rustc-link-arg doesn't propagate through transitive library dependencies, the same problem we documented above). #7 - built_with_coroutines doc made an unverified 'still works for scans' claim. That claim is probably true per RocksDB's blog but isn't tested by this PR. Trimmed to just the MultiGet behavior the doc can actually substantiate. #8 - Boost component list lacked a comment about which folly version it matches. Added a comment pointing at folly.mk's PLATFORM_LDFLAGS and noting how to react when a future FOLLY_COMMIT_HASH bump invalidates a component ('cannot find -lboost_<x>' at link time signals to trim). #9 - validate_coroutines_target was called twice (in main() and again inside coroutines_compile_config). Removed the inner call and documented the precondition. The outer call in main() runs first so the inner call was redundant. #10 - built_with_coroutines_helper_is_callable was a no-op. Replaced with built_with_coroutines_matches_feature_flag. Yes the assertion is currently tautological (the function literally is cfg!(feature)), but acknowledged in the doc - the test catches a refactor regression if/when we later wire the value to a runtime symbol (e.g. after upstream rocksdb#14752 merges and exposes rocksdb_compiled_with_coroutines).
The pinned folly commit references symbols from liburing 2.6 (`IORING_CQE_F_BUF_MORE`, `IOU_PBUF_RING_INC`, `io_uring_buf_ring_head`) and 2.7+ (the entire `io_uring_zcrx_*` zero-copy receive API in `folly/io/async/IoUringZeroCopyBufferPool.cpp`). Ubuntu 24.04 LTS only ships liburing 2.5 via apt, causing folly to fail at compile time with errors like "struct io_uring_zcrx_rq has incomplete type". Two changes: 1. CI workflow: run the whole job inside an `ubuntu:25.10` Docker container. Ubuntu 25.10 ships liburing 2.11 via apt, comfortably above the required 2.7. Avoids any manual build step in CI. `runs-on` stays `ubuntu-24.04` (the host); `container:` makes every step run inside the newer image. Containers start minimal, so a bootstrap step installs git + curl + ca-certificates + sudo before `actions/checkout` and `setup-rust-toolchain` need them. Cache key suffix bumped from `-v2` to `-v3` and the embedded image name updated to `ubuntu-25.10` so prior caches (built against 24.04's liburing 2.5) are invalidated. 2. `scripts/build_folly.sh`: keep a from-source liburing fallback for local users on older distros. It now checks the system liburing version via pkg-config and only builds 2.9 from source if the system version is < 2.7. On Ubuntu 25.10+ (which is what CI uses) the check passes and the from-source build is skipped. When the build does run, the resulting headers/libs are exported via PKG_CONFIG_PATH/CPATH/LIBRARY_PATH/LD_LIBRARY_PATH so both folly's CMake and rust-rocksdb's `io-uring` pkg-config lookup pick it up. Also fixes a cosmetic CI warning: `save-if` is not a valid input for `actions-rust-lang/setup-rust-toolchain@v1` (the right name is `cache-save-if`); this was a copy-paste error from the original workflow and produced a warning on every run.
`build_folly.sh` (mirroring RocksDB's folly.mk) sets GETDEPS_USE_WGET=1, which makes folly's getdeps download sources via wget instead of Python's built-in urllib. RocksDB uses this because some mirrors are unreliable with urllib's default handling, and the shipping fallback mirror script also assumes wget. ubuntu:25.10 minimal does not ship wget. Result: folly's first download attempt (boost-1.83.0.tar.gz) fails with `[Errno 2] No such file or directory: 'wget'` and getdeps retries five times before giving up. Add wget to the apt install list. Also add wget to the local prereq check in build_folly.sh so users on minimal hosts see a clear error before getdeps does.
GCC 15 defaults to `-std=gnu23` for C, which makes empty
parameter lists `()` mean "no arguments" instead of the pre-C23
"unspecified arguments" semantic. Folly's pinned libunwind
(f081cf4...) was written under the older rule and its test files
contain code like:
return func(s); // func declared as void *(*func)()
which gcc-15 rejects with "too many arguments to function 'func';
expected 0, have 1". The libunwind library itself builds OK but
the in-tree tests folly's getdeps tries to build do not.
Install gcc-14 and g++-14 alongside the default gcc-15, then point
the cc/gcc/c++/g++ alternatives at gcc-14 for the rest of the job.
gcc-14 defaults to `-std=gnu17` where this is still permitted.
gcc-14 and gcc-15 share the same libstdc++ ABI on Ubuntu so the
subsequent cargo build (which links folly's static archives into
the test binaries) is unaffected by the switch.
folly's getdeps tree uses two naming conventions:
- The *project being built* (folly itself) installs to
`<install_root>/folly` with no suffix.
- The project's *dependencies* install to
`<install_root>/<dep>-<hash>` where the hash captures manifest+ctx.
My `resolve_folly_dep` only globbed for the hashed pattern, so
`coroutines_compile_config("folly", ...)` and the matching
`coroutines_link_config` call both failed with:
thread 'main' panicked at librocksdb-sys/build.rs:785:15:
could not find `folly-*` under .../folly-build/installed;
did scripts/build_folly.sh finish successfully?
even though folly itself had built and installed correctly to
`.../folly-build/installed/folly/lib/libfolly*.a`.
Fix: probe for the unsuffixed dir first; fall back to globbing
`<name>-*` for the dependency case. Error message updated to
mention both shapes.
cargo build --release --features coroutines,io-uring now succeeds (folly compiled, linked into the test binary), but cargo nextest fails immediately at the "list tests" step: target/release/deps/rust_rocksdb-...: error while loading shared libraries: libglog.so.0: cannot open shared object file: No such file or directory This is the exact rpath-doesn't-propagate situation the README's "Runtime constraints" section documents: cargo:rustc-link-arg from librocksdb-sys doesn't apply to downstream test binaries (per rust-lang/cargo#9554), and folly's getdeps only produces glog/gflags as .so files (no static archives). The nextest-spawned test binary needs to find them at runtime. Set LD_LIBRARY_PATH to <install_root>/glog-<hash>/lib(64) and <install_root>/gflags-<hash>/lib(64) in $GITHUB_ENV right after the folly install path step, so every subsequent step (cargo build, nextest, doc tests) inherits it. This matches the first option the README recommends to downstream users for the same problem.
Update the "Async MultiGet with C++20 Coroutines" section to reflect
what we actually learned getting the CI green on this branch. The
prior version had several inaccuracies and a soft-pedaled performance
claim.
Performance section
-------------------
- Stop presenting the 1292/775/508 us/op numbers as if they apply
uniformly. They came from Meta's internal warm-storage flash
(ws.flash.ftw3preprod1), which has ~100-1000x the per-read latency
of a modern local NVMe.
- Drop the prior "the gain shrinks on local NVMe" sentence as an
unverified extrapolation, and explicitly note that Meta has NOT
published a local-NVMe equivalent. Frame the underlying reasoning
(async_io hides per-read latency; less latency to hide means less
to save) as reasoning, not measurement.
- Give a workload heuristic for when this is worth turning on:
remote/network storage, or many SST files spanning multiple LSM
levels per MultiGet batch, or both.
Prerequisites section
---------------------
- Document the actual liburing version requirement (>= 2.7) and what
each common distro ships. Note that build_folly.sh auto-builds
liburing 2.9 from source when the system version is too old (so
Ubuntu 24.04 LTS users don't have to do anything special, while
Ubuntu 25.10+ uses the apt-shipped liburing 2.11).
- Document the GCC 15 incompatibility. The pinned folly libunwind
uses K&R-style empty parameter lists that gcc-15 rejects under its
default -std=gnu23. Tell users to use gcc-14, clang, or any GCC
<= 14. Point them at the CI workflow as the working example.
- Replace the vague "folly + 8 transitive deps" line with the actual
apt package list the build needs (build-essential, cmake,
ninja-build, double-conversion-dev, libssl-dev, liburing-dev,
patchelf, wget, etc.). wget specifically is required because
RocksDB's folly.mk sets GETDEPS_USE_WGET=1.
- Fix the description of scripts/build_folly.sh. It used to wrap
RocksDB's `make build_folly`, but the script in this branch
invokes getdeps.py directly with --scratch-path so install
artifacts land at a predictable workspace-relative location
(librocksdb-sys/folly-build/installed/), not in a /tmp scratch
dir.
Runtime constraints section
---------------------------
- Reword the LD_LIBRARY_PATH option with an actual concrete glob
example (`ls -d ""/glog-*/lib* | head -1`)
instead of the prior <hash> placeholder that users would have had
to manually fill in.
- Point at the CI workflow as the canonical worked example for
setting LD_LIBRARY_PATH, since the workflow's "Export
ROCKSDB_FOLLY_INSTALL_PATH and LD_LIBRARY_PATH" step does exactly
this (we hit this exact runtime failure in CI and added that step
to fix it).
- Note the folly install is large (~2 GB) and slow to rebuild, and
reference the workflow's cache key example.
Other
-----
- Soften the "Experimental" tag with a more honest qualifier ("builds
and tests in CI but has not been exercised on production
workloads from this crate") rather than just the unqualified
asterisk.
Three corrections: 1. Performance table row was wrong about which config gets 775 us/op. The original wording said 'single-level parallel reads (works without this feature)' which is false - 775 us/op requires the coroutines feature AND async_io=true AND optimize_multiget_for_io=false. Without coroutines, both parallel paths in version_set.cc are short-circuited by the !using_coroutines() check, and MultiGet falls back to one-file-at-a-time. Now the table labels each row with the exact ReadOptions combo. 2. optimize_multiget_for_io paragraph now explains the flag as a CPU/latency knob within the coroutine-enabled space rather than an absent setter. Both 'on' (multi-level) and 'off' (within-level only) need the coroutines feature compiled in; the flag chooses which coroutine path runs. Setting it to false keeps ~40% of the latency win (1292->775 us/op) at lower CPU than the multi-level path. Notes that the setter goes in once facebook/rocksdb#14752 merges. 3. Replace 'Meta' / 'Meta's' with 'RocksDB team' / 'remote/warm-storage flash' throughout. The benchmark is the RocksDB team's, the project's affiliation isn't relevant to the section.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a
coroutinescargo feature. When enabled, RocksDB is compiled withUSE_COROUTINES=1and linked against folly. CallingReadOptions::set_async_io(true)on aMultiGetthen issues parallel io_uring reads across SST files in multiple LSM levels, which can lower MultiGet latency on slow storage.Meta's published benchmark is on remote/network flash. They haven't published numbers for local NVMe; the gain there is probably smaller, but I haven't measured it.
Linux only. Needs liburing >= 2.7 (Ubuntu 25.10+ via apt; older distros are handled by the script which builds liburing from source). Needs gcc <= 14 or clang — gcc-15 breaks folly's pinned libunwind.
Usage:
The folly build takes ~20-30 minutes the first time. CI caches it.
Caveats:
.sofiles. The final binary needs them at runtime viaLD_LIBRARY_PATHor an rpath in the binary crate'sbuild.rs. README has details.mt_static(folly precludes a fully static build).optimize_multiget_for_iocan't be set from Rust yet. It defaults to true, which is the right choice for coroutine builds. Expose ReadOptions::optimize_multiget_for_io in the C API facebook/rocksdb#14752 adds the C API; once that merges and we bump the submodule, the setter goes in here.Full details in the README's "Async MultiGet with C++20 Coroutines" section.