Skip to content

Add multi-DB stress testing support (--num_dbs flag) (#14749)#14749

Open
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:export-D104959942
Open

Add multi-DB stress testing support (--num_dbs flag) (#14749)#14749
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:export-D104959942

Conversation

@hx235
Copy link
Copy Markdown
Contributor

@hx235 hx235 commented May 16, 2026

Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:

  • --num_dbs flag (default 1) with guards for not-yet-supported features
  • [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
  • Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
  • Fault injection log path includes GetDbLabel() for per-DB uniqueness
  • Secondary paths are ephemeral, generated directly by C++
  • flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942

@meta-cla meta-cla Bot added the CLA Signed label May 16, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 16, 2026

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104959942.

@hx235 hx235 changed the title Add multi-DB stress testing support (--num_dbs flag) [CI only] Add multi-DB stress testing support (--num_dbs flag) May 16, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

❌ clang-tidy: 1 error(s) and 7 warning(s) on changed lines

Completed in 151.3s.

Summary by check

Check Count
clang-diagnostic-error 1
concurrency-mt-unsafe 4
cppcoreguidelines-avoid-non-const-global-variables 3
Total 8

Details

db_stress_tool/db_stress_common.cc (2 warning(s))
db_stress_tool/db_stress_common.cc:25:56: warning: variable 'wbm' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_common.cc:26:49: warning: variable 'rate_limiter' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_stat.h (1 error(s))
db_stress_tool/db_stress_stat.h:186:5: error: unknown type name 'MutexLock' [clang-diagnostic-error]
db_stress_tool/db_stress_tool.cc (5 warning(s))
db_stress_tool/db_stress_tool.cc:41:5: warning: variable 'fault_fs_for_crash_report' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_tool.cc:406:5: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:412:5: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:432:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:441:9: warning: function is not thread safe [concurrency-mt-unsafe]

@meta-codesync meta-codesync Bot changed the title [CI only] Add multi-DB stress testing support (--num_dbs flag) Add multi-DB stress testing support (--num_dbs flag) May 20, 2026
@hx235 hx235 force-pushed the export-D104959942 branch from 6be55c3 to c855385 Compare May 20, 2026 05:51
@meta-codesync meta-codesync Bot changed the title Add multi-DB stress testing support (--num_dbs flag) Add multi-DB stress testing support (--num_dbs flag) (#14749) May 22, 2026
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from c855385 to 0086484 Compare May 22, 2026 00:31
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from 0086484 to cb54e9d Compare May 22, 2026 00:38
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from cb54e9d to 49193e7 Compare May 22, 2026 00:43
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch 2 times, most recently from 4de8a83 to b072a0e Compare May 22, 2026 00:56
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from b072a0e to 723fe00 Compare May 22, 2026 02:12
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from 723fe00 to 19717b4 Compare May 22, 2026 02:41
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from 19717b4 to e775a94 Compare May 22, 2026 02:42
@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit e775a94


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit e775a94


Summary

Clean, well-structured PR that adds multi-DB stress testing with proper resource sharing and isolation. The design is sound -- per-DB paths, shared cache/WBM/rate_limiter, per-DB fault injection, and clear phase separation in the orchestration code. Most incompatible features are correctly guarded.

High-severity findings (1):

  • [db_stress_stat.h:184] flockfile/funlockfile are POSIX-only; breaks Windows/MSVC builds.
Full review (click to expand)

Findings

🔴 HIGH

H1. flockfile/funlockfile not portable to Windows -- db_stress_stat.h:184
  • Issue: The PR adds flockfile(stdout) and funlockfile(stdout) to Stats::Report() to prevent multi-DB output interleaving. These are POSIX functions and do not exist on Windows/MSVC.
  • Suggested fix: Use a static port::Mutex instead:
    static port::Mutex report_mutex;
    MutexLock l(&report_mutex);

🟡 MEDIUM

M1. Default --db path incompatible with --num_dbs > 1 when running db_stress directly -- db_stress_tool.cc:394-403
  • Issue: When FLAGS_db is empty and num_dbs > 1, the single default path causes a size mismatch error. Users must provide comma-separated paths.
  • Suggested fix: Auto-generate subdirectory paths in C++ (like Python does), or document this clearly.
M2. SyncPoint singleton destruction order -- db_stress_shared_state.h:83-91
  • Issue: SharedState::~SharedState() calls ClearAllCallBacks() globally. Currently safe because all threads are joined first, but fragile if destruction order changes. Add a comment documenting this invariant.
M3. No unit tests for num_dbs > 1 -- tools/db_crashtest_test.py
  • Issue: No tests exercise multi-DB paths (path generation, flag validation, cleanup, finalize_and_sanitize guards).
  • Suggested fix: Add tests for get_subdirectory_paths(), ValidateNumDbsFlags, and finalize_and_sanitize with num_dbs > 1.

🟢 LOW / NIT

L1. Dead else if branch in InitializeOptionsGeneral -- db_stress_test_base.cc:5278
  • Issue: The fallback else if (FLAGS_rate_limiter_bytes_per_sec > 0) creating a per-DB rate limiter is unreachable when the global rate_limiter is pre-created.
L2. manifest_path construction style inconsistency -- db_stress_test_base.cc:4434,4492
  • Issue: Mix of .append("/").append(f) and + "/CURRENT" concatenation styles.

Cross-Component Analysis

Shared Resource Thread-Safe? Guarded?
block_cache Yes N/A (safe to share)
WriteBufferManager Yes N/A (safe to share)
RateLimiter Yes N/A (safe to share)
CompressedSecondaryCache Yes Blocked with num_dbs>1
PoolSizeChangeThread N/A Blocked with num_dbs>1
SyncPoint (debug) Fragile Safe via destruction order

Positive Observations

  • Clean 4-phase orchestration design (create -> init -> run -> cleanup)
  • Proper shared resource factoring (cache, WBM, rate limiter)
  • Good feature incompatibility guards in both C++ and Python
  • Signal-safe crash callback with vector of fault_fs pointers
  • [db_N] output labeling for easy per-DB filtering

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from e775a94 to d9613c5 Compare May 22, 2026 06:14
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from d9613c5 to a5ee2a5 Compare May 22, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant