[Store] Support storage hierarchy with offload-on-evict mode by hnts03-moreh · Pull Request #1899 · kvcache-ai/Mooncake

hnts03-moreh · 2026-04-16T01:06:20Z

Description

Add opt-in offload-on-evict mode to Mooncake Store, which defers LOCAL_DISK offload from PutEnd time to BatchEvict time, forming a proper DRAM→LOCAL_DISK storage hierarchy.

Today (enable_ssd_offload=true): every PutEnd synchronously queues the key for offload, so the FileStorage heartbeat writes every put to disk — even hot keys that never leave DRAM. Disk I/O is incurred up front regardless of access pattern.

With this change (MOONCAKE_OFFLOAD_ON_EVICT=1): keys stay in DRAM only until watermark pressure causes BatchEvict to select them. BatchEvict queues the selected MEMORY replicas for offload (pinning them via refcnt), the heartbeat completes the disk write, and the next eviction cycle frees the MEMORY replica via the existing has_local_disk_replica shortcut. Hot keys that are never evicted never touch disk.

Semantics summary:

env var	effect
`MOONCAKE_OFFLOAD_ON_EVICT=1`	defer offload from `PutEnd` to `BatchEvict`; requires `enable_offload=true` in master config
`MOONCAKE_OFFLOAD_FORCE_EVICT=1`	allow data-lossy force-eviction when the per-cycle offload cap is exceeded; requires `MOONCAKE_OFFLOAD_ON_EVICT=1`; default OFF (data-preserving mode)

Neither env var is read through the JSON config schema — they are read once in the MasterService constructor via std::getenv. This keeps the surface area minimal for an experimental flag. If the feature is adopted as the default, promoting it to a proper config field is straightforward.

Backward compatibility: when neither env var is set, behavior is unchanged — the PutEnd offload block and the BatchEvict eviction paths both take their original fast path. All 16 existing eviction / offload tests pass unchanged.

Scope: LOCAL_DISK replicas (FileStorage-based) only. DISK replicas (root_fs_dir path, created by Client::PutToLocalFile at PutEnd) are written client-side, outside the Master's offload queue, and are not affected by this change. Deferring them is possible but requires client-side changes and is left as future work.

Module

Type of Change

How Has This Been Tested?

New tests in this PR

C++ unit tests — mooncake-store/tests/offload_on_evict_test.cpp, 9 cases across all four env-var combinations:
- A (neither set): PutEnd still queues offload; existing eviction works.
- B (ON_EVICT=1): PutEnd skips the queue; eviction queues offload; PushOffloadingQueue failure falls back to direct eviction.
- C (ON_EVICT=1 + FORCE_EVICT=1): PutEnd skips; eviction still makes progress under the force-evict cap.
- D (FORCE_EVICT=1 only): treated as if unset — requires ON_EVICT=1 to take effect.
Python-binding integration test — mooncake-wheel/tests/test_offload_on_eviction.py, test_small_workload_no_disk_replica: asserts that a workload below the eviction watermark produces zero LOCAL_DISK replicas. This is the behavioral difference from the default enable_ssd_offload path (which would produce disk replicas for every put).

Existing tests (regression)

Run on hnts03/support-storage-hierarchy after the changes:

eviction_strategy_test: 4/4 PASS
master_service_test --gtest_filter='*Evict*': 8/8 PASS
master_service_ssd_test --gtest_filter='*Evict*': 4/4 PASS

No behavioral change when the new env vars are unset.

End-to-end empirical verification

Validated against a running mooncake_master with the Python client binding (enable_ssd_offload=true, MOONCAKE_OFFLOAD_ON_EVICT=1):

Workload: 50 MB DRAM segment, 100 keys × 1 MB (2x overflow), heartbeat interval = 1 s.
Result: 96 / 100 keys offloaded to LOCAL_DISK via the eviction path; 100 / 100 keys readable bit-exact afterwards.
Master log confirmed feature activation (Offload-on-evict mode enabled) and our new diagnostic ([EVICT] N objects deferred for disk offload).

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist

Code Review

This pull request introduces an "offload-on-evict" mechanism that defers data offloading to local disk until eviction time, aimed at reducing disk I/O for hot data. The implementation includes new configuration flags, logic within the BatchEvict function to handle deferred offloading or force-eviction, and corresponding unit and integration tests. Review feedback highlights a critical data integrity issue where objects might be lost if the offloading queue is full, suggests replacing magic numbers with named constants, and recommends several performance optimizations regarding logging, timestamp retrieval, and replica management during eviction.

gemini-code-assist · 2026-04-16T01:11:09Z

+        }
+
+        // PushOffloadingQueue failed — force-evict as fallback
+        return metadata.size * evict_replicas(metadata);


This line causes unconditional force-eviction (and thus data loss) if PushOffloadingQueue fails (e.g., due to KEYS_ULTRA_LIMIT), even when offload_force_evict_ is set to false. To preserve data integrity in the default mode, eviction should only proceed if the object is already being offloaded (tracked in offloading_tasks) or if force-evict is explicitly enabled.

if (shard->offloading_tasks.count(key) > 0 || offload_force_evict_) { return metadata.size * evict_replicas(metadata); } return 0;

It was a real oversight. The fallback now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is explicitly set. Otherwise it returns 0 (skip), letting the outer eviction loop retry on the next cycle after the offload queue drains. Updated the corresponding test case (ComboB_FallbackWhenNoLocalDiskSegment→ ComboB_NoFallbackWithoutForceEvict) to verify the new data-preserving default.

gemini-code-assist · 2026-04-16T01:11:09Z

+    long offload_queued_this_cycle = 0;
+    long offload_deferred_count = 0;
+    const long offload_cap =
+        offload_on_evict_ ? static_cast<long>(offloading_queue_limit_ * 0.5)


The multiplier 0.5 is a magic number. It would be better to define this as a named constant (e.g., kOffloadCapRatio) or make it a configurable parameter in MasterServiceConfig to improve maintainability and clarity.

Extracted to constexpr double kOffloadCapRatio = 0.5 with a comment explaining the intent. Promoting it to a runtime-configurable parameter is tracked as a follow-up.

gemini-code-assist · 2026-04-16T01:11:09Z

+            LOG(WARNING) << "[EVICT] Offload cap reached, force-evicting key="
+                         << key << " without disk offload";


This warning log is inside a loop that iterates over all objects in a shard. If a large number of objects reach the offload cap simultaneously, this could lead to log flooding and impact performance. Consider using LOG_EVERY_N(WARNING, 100) or summarizing the force-evictions at the end of the cycle.

Moved the per-key warning out of the loop. Force-eviction counts are now aggregated and reported as a single summary LOG(WARNING) at the end of the BatchEvict cycle.

gemini-code-assist · 2026-04-16T01:11:09Z

+                return r.is_memory_replica() && r.is_completed() &&
+                       r.get_refcnt() == 0;
+            },
+            [this, &key, &shard, &queued](Replica& replica) {


For efficiency, the visitor should stop attempting to queue replicas once one has been successfully queued for the current key. Since VisitReplicas does not support early exit, adding a check at the beginning of the lambda is recommended.

[this, &key, &shard, &queued, &now](Replica& replica) { if (queued) return; auto result = PushOffloadingQueue(key, replica);

Added if (queued) return; at the top of the visitor lambda.

gemini-code-assist · 2026-04-16T01:11:09Z

+                    replica.inc_refcnt();
+                    shard->offloading_tasks.emplace(
+                        key, OffloadingTask{replica.id(),
+                                            std::chrono::system_clock::now()});


Repeatedly calling std::chrono::system_clock::now() inside a loop can be expensive. It is better to use the now variable captured at the beginning of the BatchEvict function (line 3466).

Suggested change

std::chrono::system_clock::now()});

now});

Now captures the outer now variable instead of calling system_clock::now() inside the lambda

gemini-code-assist · 2026-04-16T01:11:09Z

+        if (queued) {
+            offload_queued_this_cycle++;
+            offload_deferred_count++;
+            return 0;  // No memory freed yet
+        }


When replica_num > 1, pinning one replica for offload is sufficient to preserve the data. The remaining memory replicas (which still have refcnt == 0) can and should be evicted immediately to reclaim memory. Returning 0 here defers their eviction to the next cycle, which is less efficient.

Suggested change

if (queued) {

offload_queued_this_cycle++;

offload_deferred_count++;

return 0; // No memory freed yet

}

if (queued) {

offload_queued_this_cycle++;

offload_deferred_count++;

return metadata.size * evict_replicas(metadata);

}

After pinning one replica for offload (refcnt=1), the remaining refcnt==0 MEMORY replicas are now evicted in the same cycle. This is safe because evict_replicas only touches replicas with refcnt == 0, and the pinned one survives. For the typical replica_num=1 case this is a no-op (the single replica is pinned), but it avoids wasting a cycle when replica_num > 1.

codecov-commenter · 2026-04-16T01:36:15Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 92.08861% with 25 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/src/master_service.cpp	80.24%	16 Missing ⚠️
mooncake-store/include/master_config.h	50.00%	4 Missing ⚠️
mooncake-store/src/master.cpp	0.00%	4 Missing ⚠️
mooncake-store/tests/offload_on_evict_test.cpp	99.55%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Address 6 items from gemini-code-assist review on PR kvcache-ai#1899: 1. Extract magic number 0.5 to constexpr kOffloadCapRatio (medium) 2. Move per-key force-evict WARNING out of loop; aggregate at cycle end to prevent log flooding (medium) 3. Add early exit in VisitReplicas lambda after first successful queue — no need to visit remaining replicas for same key (medium) 4. Replace repeated std::chrono::system_clock::now() with captured `now` variable from BatchEvict scope (medium) 5. After pinning one replica for offload, also evict remaining refcnt==0 MEMORY replicas immediately instead of deferring to next cycle — safe because pinned replica has refcnt=1 (medium) 6. Fix unconditional force-eviction on PushOffloadingQueue failure: now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is set; default is skip (data-preserving). Prevents silent data loss when offload infra is unavailable. (HIGH) Test update: rename ComboB_FallbackWhenNoLocalDiskSegment to ComboB_NoFallbackWithoutForceEvict and flip assertion to verify the new data-preserving default (EXPECT_LE instead of EXPECT_GT). 9/9 unit tests + 9/9 existing eviction regression tests pass. Signed-off-by: hnts03-moreh <[email protected]>

ykwd · 2026-04-16T12:18:54Z

+            LOG(INFO) << "Offload-on-evict mode enabled: DRAM offload to "
+                         "LOCAL_DISK will occur at eviction time instead of "
+                         "PutEnd";
+            const char* force_env = std::getenv("MOONCAKE_OFFLOAD_FORCE_EVICT");


The parameters for the master should be passed via main in master.cpp, while only the client-side parameters should be configured through environment variables.

@ykwd
Thanks for your review!
I've migrated both parameters from environment variables to proper MasterConfig fields.

Before:

MOONCAKE_OFFLOAD_ON_EVICT=1 MOONCAKE_OFFLOAD_FORCE_EVICT=1 mooncake_master ...

After (JSON config):

{ "offload_on_evict": true, "offload_force_evict": false }

After (CLI flags):

mooncake_master --offload_on_evict=true --offload_force_evict=false

Thanks!

zhangzuo21 · 2026-04-19T09:50:51Z

 add_store_test(serializer_test serializer_test.cpp)
-add_store_test(embedded_snapshot_catalog_store_test
-               ha/snapshot/catalog/backends/embedded/embedded_snapshot_catalog_store_test.cpp)
+add_store_test(


No need to format CMakeLists.txt.

@zhangzuo21
Rebased on latest main! The only intentional change is adding add_store_test(offload_on_evict_test ...) (1 line). The remaining formatting diffs are from the cmake-format pre-commit hook automatically reformatting existing lines. I can't suppress the hook without --no-verify.

Defer LOCAL_DISK offload from PutEnd to BatchEvict time, enabling a proper DRAM → LOCAL_DISK storage hierarchy. When MOONCAKE_OFFLOAD_ON_EVICT=1, objects are only offloaded to disk when DRAM eviction pressure triggers, avoiding unnecessary disk I/O for hot data. Key changes: - Skip PushOffloadingQueue in PutEnd when offload-on-evict is active - BatchEvict: queue eviction candidates for disk offload via try_evict_or_offload before erasing MEMORY replicas - Objects with existing LOCAL_DISK replicas are evicted immediately - refcnt protects MEMORY data until offload completes - Optional MOONCAKE_OFFLOAD_FORCE_EVICT=1 for cap-based forced eviction under extreme DRAM pressure (default OFF, preserves data) Env vars: - MOONCAKE_OFFLOAD_ON_EVICT=1: enable offload-on-evict (default off) - MOONCAKE_OFFLOAD_FORCE_EVICT=1: allow data-lossy eviction when offload cap is exceeded (default off, requires OFFLOAD_ON_EVICT=1) Signed-off-by: hnts03-moreh <[email protected]>

9 test cases covering all 4 environment variable combinations: A (default): offload at PutEnd, eviction works normally B (ON_EVICT=1): PutEnd skips offload queue, eviction triggers offload, fallback when no LocalDiskSegment mounted C (ON_EVICT=1 + FORCE_EVICT=1): PutEnd skips, eviction works with cap D (FORCE_EVICT=1 only): treated as default, FORCE_EVICT ignored Tests validate: offload queue state after PutEnd, watermark-based eviction under DRAM pressure, fallback behavior when PushOffloadingQueue fails, and backward compatibility when env vars are unset. Signed-off-by: hnts03-moreh <[email protected]>

Verifies the core behavioral difference between the default enable_ssd_offload path and the offload-on-evict mode: small workloads below the eviction watermark must not produce LOCAL_DISK replicas. Under default behavior, every PutEnd queues the key for synchronous offload via the FileStorage heartbeat. Under offload-on-evict, the queue stays empty until DRAM pressure triggers BatchEvict — which is the whole point of the feature (avoid redundant disk I/O for hot data). A zero disk-replica count doubles as a guard that the master was actually started with MOONCAKE_OFFLOAD_ON_EVICT=1. End-to-end overflow→evict→offload→read contract is covered by the existing test_ssd_offload_in_evict.py (legacy synchronous offload) and by the verify_offload_on_eviction scenario in the kv-cache-offload-benchmark project. Test style follows the existing test_ssd_offload_in_evict.py and test_batch_replica_clear.py conventions: unittest.TestCase, setUpClass for shared store, env-var driven configuration. Signed-off-by: hnts03-moreh <[email protected]>

Address 6 items from gemini-code-assist review on PR kvcache-ai#1899: 1. Extract magic number 0.5 to constexpr kOffloadCapRatio (medium) 2. Move per-key force-evict WARNING out of loop; aggregate at cycle end to prevent log flooding (medium) 3. Add early exit in VisitReplicas lambda after first successful queue — no need to visit remaining replicas for same key (medium) 4. Replace repeated std::chrono::system_clock::now() with captured `now` variable from BatchEvict scope (medium) 5. After pinning one replica for offload, also evict remaining refcnt==0 MEMORY replicas immediately instead of deferring to next cycle — safe because pinned replica has refcnt=1 (medium) 6. Fix unconditional force-eviction on PushOffloadingQueue failure: now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is set; default is skip (data-preserving). Prevents silent data loss when offload infra is unavailable. (HIGH) Test update: rename ComboB_FallbackWhenNoLocalDiskSegment to ComboB_NoFallbackWithoutForceEvict and flip assertion to verify the new data-preserving default (EXPECT_LE instead of EXPECT_GT). 9/9 unit tests + 9/9 existing eviction regression tests pass. Signed-off-by: hnts03-moreh <[email protected]>

@ykwd

Replace MOONCAKE_OFFLOAD_ON_EVICT and MOONCAKE_OFFLOAD_FORCE_EVICT environment variables with proper MasterConfig fields, as requested by @ykwd. Master-side parameters should be passed via config/CLI, not environment variables. Changes: - MasterConfig: add offload_on_evict, offload_force_evict fields - master.cpp: add DEFINE_bool + GetBool for JSON config and CLI flags - MasterService: read from config instead of std::getenv() - All config propagation paths updated (5 config classes) - C++ GTest: remove ScopedEnv, use config fields directly (9 tests pass) - Python test: update docstrings/error messages Configuration (JSON): "offload_on_evict": true, "offload_force_evict": false CLI: mooncake_master --offload_on_evict=true Signed-off-by: Geonwoo Shin <[email protected]>

hnts03-moreh requested review from ShangmingCai, XucSh, YiXR, stmatengss and ykwd as code owners April 16, 2026 01:06

github-actions Bot added run-ci Store Installation labels Apr 16, 2026

gemini-code-assist Bot reviewed Apr 16, 2026

View reviewed changes

ykwd requested review from maheshrbapatu and zhangzuo21 April 16, 2026 12:16

ykwd reviewed Apr 16, 2026

View reviewed changes

zhangzuo21 reviewed Apr 19, 2026

View reviewed changes

hnts03-moreh added 5 commits April 20, 2026 05:09

hnts03-moreh force-pushed the hnts03/support-storage-hierarchy branch from bd633f6 to 3b5b3cc Compare April 20, 2026 05:11

		LOG(WARNING) << "[EVICT] Offload cap reached, force-evicting key="
		<< key << " without disk offload";

Conversation

hnts03-moreh commented Apr 16, 2026

Description

Module

Type of Change

How Has This Been Tested?

New tests in this PR

Existing tests (regression)

End-to-end empirical verification

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 16, 2026 •

edited

Loading