[Store] Support storage hierarchy with offload-on-evict mode#1899
[Store] Support storage hierarchy with offload-on-evict mode#1899hnts03-moreh wants to merge 5 commits intokvcache-ai:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an "offload-on-evict" mechanism that defers data offloading to local disk until eviction time, aimed at reducing disk I/O for hot data. The implementation includes new configuration flags, logic within the BatchEvict function to handle deferred offloading or force-eviction, and corresponding unit and integration tests. Review feedback highlights a critical data integrity issue where objects might be lost if the offloading queue is full, suggests replacing magic numbers with named constants, and recommends several performance optimizations regarding logging, timestamp retrieval, and replica management during eviction.
| } | ||
|
|
||
| // PushOffloadingQueue failed — force-evict as fallback | ||
| return metadata.size * evict_replicas(metadata); |
There was a problem hiding this comment.
This line causes unconditional force-eviction (and thus data loss) if PushOffloadingQueue fails (e.g., due to KEYS_ULTRA_LIMIT), even when offload_force_evict_ is set to false. To preserve data integrity in the default mode, eviction should only proceed if the object is already being offloaded (tracked in offloading_tasks) or if force-evict is explicitly enabled.
if (shard->offloading_tasks.count(key) > 0 || offload_force_evict_) {
return metadata.size * evict_replicas(metadata);
}
return 0;There was a problem hiding this comment.
It was a real oversight. The fallback now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is explicitly set. Otherwise it returns 0 (skip), letting the outer eviction loop retry on the next cycle after the offload queue drains. Updated the corresponding test case (ComboB_FallbackWhenNoLocalDiskSegment→ ComboB_NoFallbackWithoutForceEvict) to verify the new data-preserving default.
| long offload_queued_this_cycle = 0; | ||
| long offload_deferred_count = 0; | ||
| const long offload_cap = | ||
| offload_on_evict_ ? static_cast<long>(offloading_queue_limit_ * 0.5) |
There was a problem hiding this comment.
Extracted to constexpr double kOffloadCapRatio = 0.5 with a comment explaining the intent. Promoting it to a runtime-configurable parameter is tracked as a follow-up.
| LOG(WARNING) << "[EVICT] Offload cap reached, force-evicting key=" | ||
| << key << " without disk offload"; |
There was a problem hiding this comment.
This warning log is inside a loop that iterates over all objects in a shard. If a large number of objects reach the offload cap simultaneously, this could lead to log flooding and impact performance. Consider using LOG_EVERY_N(WARNING, 100) or summarizing the force-evictions at the end of the cycle.
There was a problem hiding this comment.
Moved the per-key warning out of the loop. Force-eviction counts are now aggregated and reported as a single summary LOG(WARNING) at the end of the BatchEvict cycle.
| return r.is_memory_replica() && r.is_completed() && | ||
| r.get_refcnt() == 0; | ||
| }, | ||
| [this, &key, &shard, &queued](Replica& replica) { |
There was a problem hiding this comment.
For efficiency, the visitor should stop attempting to queue replicas once one has been successfully queued for the current key. Since VisitReplicas does not support early exit, adding a check at the beginning of the lambda is recommended.
[this, &key, &shard, &queued, &now](Replica& replica) {
if (queued) return;
auto result = PushOffloadingQueue(key, replica);There was a problem hiding this comment.
Added if (queued) return; at the top of the visitor lambda.
| replica.inc_refcnt(); | ||
| shard->offloading_tasks.emplace( | ||
| key, OffloadingTask{replica.id(), | ||
| std::chrono::system_clock::now()}); |
There was a problem hiding this comment.
Now captures the outer now variable instead of calling system_clock::now() inside the lambda
| if (queued) { | ||
| offload_queued_this_cycle++; | ||
| offload_deferred_count++; | ||
| return 0; // No memory freed yet | ||
| } |
There was a problem hiding this comment.
When replica_num > 1, pinning one replica for offload is sufficient to preserve the data. The remaining memory replicas (which still have refcnt == 0) can and should be evicted immediately to reclaim memory. Returning 0 here defers their eviction to the next cycle, which is less efficient.
| if (queued) { | |
| offload_queued_this_cycle++; | |
| offload_deferred_count++; | |
| return 0; // No memory freed yet | |
| } | |
| if (queued) { | |
| offload_queued_this_cycle++; | |
| offload_deferred_count++; | |
| return metadata.size * evict_replicas(metadata); | |
| } |
There was a problem hiding this comment.
After pinning one replica for offload (refcnt=1), the remaining refcnt==0 MEMORY replicas are now evicted in the same cycle. This is safe because evict_replicas only touches replicas with refcnt == 0, and the pinned one survives. For the typical replica_num=1 case this is a no-op (the single replica is pinned), but it avoids wasting a cycle when replica_num > 1.
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Address 6 items from gemini-code-assist review on PR kvcache-ai#1899: 1. Extract magic number 0.5 to constexpr kOffloadCapRatio (medium) 2. Move per-key force-evict WARNING out of loop; aggregate at cycle end to prevent log flooding (medium) 3. Add early exit in VisitReplicas lambda after first successful queue — no need to visit remaining replicas for same key (medium) 4. Replace repeated std::chrono::system_clock::now() with captured `now` variable from BatchEvict scope (medium) 5. After pinning one replica for offload, also evict remaining refcnt==0 MEMORY replicas immediately instead of deferring to next cycle — safe because pinned replica has refcnt=1 (medium) 6. Fix unconditional force-eviction on PushOffloadingQueue failure: now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is set; default is skip (data-preserving). Prevents silent data loss when offload infra is unavailable. (HIGH) Test update: rename ComboB_FallbackWhenNoLocalDiskSegment to ComboB_NoFallbackWithoutForceEvict and flip assertion to verify the new data-preserving default (EXPECT_LE instead of EXPECT_GT). 9/9 unit tests + 9/9 existing eviction regression tests pass. Signed-off-by: hnts03-moreh <[email protected]>
| LOG(INFO) << "Offload-on-evict mode enabled: DRAM offload to " | ||
| "LOCAL_DISK will occur at eviction time instead of " | ||
| "PutEnd"; | ||
| const char* force_env = std::getenv("MOONCAKE_OFFLOAD_FORCE_EVICT"); |
There was a problem hiding this comment.
The parameters for the master should be passed via main in master.cpp, while only the client-side parameters should be configured through environment variables.
There was a problem hiding this comment.
@ykwd
Thanks for your review!
I've migrated both parameters from environment variables to proper MasterConfig fields.
Before:
MOONCAKE_OFFLOAD_ON_EVICT=1 MOONCAKE_OFFLOAD_FORCE_EVICT=1 mooncake_master ...
After (JSON config):
{
"offload_on_evict": true,
"offload_force_evict": false
}After (CLI flags):
mooncake_master --offload_on_evict=true --offload_force_evict=false
Thanks!
| add_store_test(serializer_test serializer_test.cpp) | ||
| add_store_test(embedded_snapshot_catalog_store_test | ||
| ha/snapshot/catalog/backends/embedded/embedded_snapshot_catalog_store_test.cpp) | ||
| add_store_test( |
There was a problem hiding this comment.
No need to format CMakeLists.txt.
There was a problem hiding this comment.
@zhangzuo21
Rebased on latest main! The only intentional change is adding add_store_test(offload_on_evict_test ...) (1 line). The remaining formatting diffs are from the cmake-format pre-commit hook automatically reformatting existing lines. I can't suppress the hook without --no-verify.
Defer LOCAL_DISK offload from PutEnd to BatchEvict time, enabling a proper DRAM → LOCAL_DISK storage hierarchy. When MOONCAKE_OFFLOAD_ON_EVICT=1, objects are only offloaded to disk when DRAM eviction pressure triggers, avoiding unnecessary disk I/O for hot data. Key changes: - Skip PushOffloadingQueue in PutEnd when offload-on-evict is active - BatchEvict: queue eviction candidates for disk offload via try_evict_or_offload before erasing MEMORY replicas - Objects with existing LOCAL_DISK replicas are evicted immediately - refcnt protects MEMORY data until offload completes - Optional MOONCAKE_OFFLOAD_FORCE_EVICT=1 for cap-based forced eviction under extreme DRAM pressure (default OFF, preserves data) Env vars: - MOONCAKE_OFFLOAD_ON_EVICT=1: enable offload-on-evict (default off) - MOONCAKE_OFFLOAD_FORCE_EVICT=1: allow data-lossy eviction when offload cap is exceeded (default off, requires OFFLOAD_ON_EVICT=1) Signed-off-by: hnts03-moreh <[email protected]>
9 test cases covering all 4 environment variable combinations:
A (default): offload at PutEnd, eviction works normally
B (ON_EVICT=1): PutEnd skips offload queue, eviction triggers offload,
fallback when no LocalDiskSegment mounted
C (ON_EVICT=1 + FORCE_EVICT=1): PutEnd skips, eviction works with cap
D (FORCE_EVICT=1 only): treated as default, FORCE_EVICT ignored
Tests validate: offload queue state after PutEnd, watermark-based
eviction under DRAM pressure, fallback behavior when PushOffloadingQueue
fails, and backward compatibility when env vars are unset.
Signed-off-by: hnts03-moreh <[email protected]>
Verifies the core behavioral difference between the default enable_ssd_offload path and the offload-on-evict mode: small workloads below the eviction watermark must not produce LOCAL_DISK replicas. Under default behavior, every PutEnd queues the key for synchronous offload via the FileStorage heartbeat. Under offload-on-evict, the queue stays empty until DRAM pressure triggers BatchEvict — which is the whole point of the feature (avoid redundant disk I/O for hot data). A zero disk-replica count doubles as a guard that the master was actually started with MOONCAKE_OFFLOAD_ON_EVICT=1. End-to-end overflow→evict→offload→read contract is covered by the existing test_ssd_offload_in_evict.py (legacy synchronous offload) and by the verify_offload_on_eviction scenario in the kv-cache-offload-benchmark project. Test style follows the existing test_ssd_offload_in_evict.py and test_batch_replica_clear.py conventions: unittest.TestCase, setUpClass for shared store, env-var driven configuration. Signed-off-by: hnts03-moreh <[email protected]>
Address 6 items from gemini-code-assist review on PR kvcache-ai#1899: 1. Extract magic number 0.5 to constexpr kOffloadCapRatio (medium) 2. Move per-key force-evict WARNING out of loop; aggregate at cycle end to prevent log flooding (medium) 3. Add early exit in VisitReplicas lambda after first successful queue — no need to visit remaining replicas for same key (medium) 4. Replace repeated std::chrono::system_clock::now() with captured `now` variable from BatchEvict scope (medium) 5. After pinning one replica for offload, also evict remaining refcnt==0 MEMORY replicas immediately instead of deferring to next cycle — safe because pinned replica has refcnt=1 (medium) 6. Fix unconditional force-eviction on PushOffloadingQueue failure: now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is set; default is skip (data-preserving). Prevents silent data loss when offload infra is unavailable. (HIGH) Test update: rename ComboB_FallbackWhenNoLocalDiskSegment to ComboB_NoFallbackWithoutForceEvict and flip assertion to verify the new data-preserving default (EXPECT_LE instead of EXPECT_GT). 9/9 unit tests + 9/9 existing eviction regression tests pass. Signed-off-by: hnts03-moreh <[email protected]>
Replace MOONCAKE_OFFLOAD_ON_EVICT and MOONCAKE_OFFLOAD_FORCE_EVICT environment variables with proper MasterConfig fields, as requested by @ykwd. Master-side parameters should be passed via config/CLI, not environment variables. Changes: - MasterConfig: add offload_on_evict, offload_force_evict fields - master.cpp: add DEFINE_bool + GetBool for JSON config and CLI flags - MasterService: read from config instead of std::getenv() - All config propagation paths updated (5 config classes) - C++ GTest: remove ScopedEnv, use config fields directly (9 tests pass) - Python test: update docstrings/error messages Configuration (JSON): "offload_on_evict": true, "offload_force_evict": false CLI: mooncake_master --offload_on_evict=true Signed-off-by: Geonwoo Shin <[email protected]>
bd633f6 to
3b5b3cc
Compare
Description
Add opt-in offload-on-evict mode to Mooncake Store, which defers LOCAL_DISK offload from
PutEndtime toBatchEvicttime, forming a proper DRAM→LOCAL_DISK storage hierarchy.Today (
enable_ssd_offload=true): everyPutEndsynchronously queues the key for offload, so the FileStorage heartbeat writes every put to disk — even hot keys that never leave DRAM. Disk I/O is incurred up front regardless of access pattern.With this change (
MOONCAKE_OFFLOAD_ON_EVICT=1): keys stay in DRAM only until watermark pressure causesBatchEvictto select them.BatchEvictqueues the selected MEMORY replicas for offload (pinning them via refcnt), the heartbeat completes the disk write, and the next eviction cycle frees the MEMORY replica via the existinghas_local_disk_replicashortcut. Hot keys that are never evicted never touch disk.Semantics summary:
MOONCAKE_OFFLOAD_ON_EVICT=1PutEndtoBatchEvict; requiresenable_offload=truein master configMOONCAKE_OFFLOAD_FORCE_EVICT=1MOONCAKE_OFFLOAD_ON_EVICT=1; default OFF (data-preserving mode)Neither env var is read through the JSON config schema — they are read once in the
MasterServiceconstructor viastd::getenv. This keeps the surface area minimal for an experimental flag. If the feature is adopted as the default, promoting it to a proper config field is straightforward.Backward compatibility: when neither env var is set, behavior is unchanged — the
PutEndoffload block and theBatchEvicteviction paths both take their original fast path. All 16 existing eviction / offload tests pass unchanged.Scope: LOCAL_DISK replicas (FileStorage-based) only. DISK replicas (
root_fs_dirpath, created byClient::PutToLocalFileatPutEnd) are written client-side, outside the Master's offload queue, and are not affected by this change. Deferring them is possible but requires client-side changes and is left as future work.Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
New tests in this PR
mooncake-store/tests/offload_on_evict_test.cpp, 9 cases across all four env-var combinations:PutEndstill queues offload; existing eviction works.ON_EVICT=1):PutEndskips the queue; eviction queues offload;PushOffloadingQueuefailure falls back to direct eviction.ON_EVICT=1+FORCE_EVICT=1):PutEndskips; eviction still makes progress under the force-evict cap.FORCE_EVICT=1only): treated as if unset — requiresON_EVICT=1to take effect.mooncake-wheel/tests/test_offload_on_eviction.py,test_small_workload_no_disk_replica: asserts that a workload below the eviction watermark produces zero LOCAL_DISK replicas. This is the behavioral difference from the defaultenable_ssd_offloadpath (which would produce disk replicas for every put).Existing tests (regression)
Run on
hnts03/support-storage-hierarchyafter the changes:eviction_strategy_test: 4/4 PASSmaster_service_test --gtest_filter='*Evict*': 8/8 PASSmaster_service_ssd_test --gtest_filter='*Evict*': 4/4 PASSNo behavioral change when the new env vars are unset.
End-to-end empirical verification
Validated against a running
mooncake_masterwith the Python client binding (enable_ssd_offload=true,MOONCAKE_OFFLOAD_ON_EVICT=1):Offload-on-evict mode enabled) and our new diagnostic ([EVICT] N objects deferred for disk offload).Checklist
./scripts/code_format.shbefore submitting.