Skip to content

[Store] Support storage hierarchy with offload-on-evict mode#1899

Open
hnts03-moreh wants to merge 5 commits intokvcache-ai:mainfrom
moreh-dev:hnts03/support-storage-hierarchy
Open

[Store] Support storage hierarchy with offload-on-evict mode#1899
hnts03-moreh wants to merge 5 commits intokvcache-ai:mainfrom
moreh-dev:hnts03/support-storage-hierarchy

Conversation

@hnts03-moreh
Copy link
Copy Markdown
Contributor

Description

Add opt-in offload-on-evict mode to Mooncake Store, which defers LOCAL_DISK offload from PutEnd time to BatchEvict time, forming a proper DRAM→LOCAL_DISK storage hierarchy.

Today (enable_ssd_offload=true): every PutEnd synchronously queues the key for offload, so the FileStorage heartbeat writes every put to disk — even hot keys that never leave DRAM. Disk I/O is incurred up front regardless of access pattern.

With this change (MOONCAKE_OFFLOAD_ON_EVICT=1): keys stay in DRAM only until watermark pressure causes BatchEvict to select them. BatchEvict queues the selected MEMORY replicas for offload (pinning them via refcnt), the heartbeat completes the disk write, and the next eviction cycle frees the MEMORY replica via the existing has_local_disk_replica shortcut. Hot keys that are never evicted never touch disk.

Semantics summary:

env var effect
MOONCAKE_OFFLOAD_ON_EVICT=1 defer offload from PutEnd to BatchEvict; requires enable_offload=true in master config
MOONCAKE_OFFLOAD_FORCE_EVICT=1 allow data-lossy force-eviction when the per-cycle offload cap is exceeded; requires MOONCAKE_OFFLOAD_ON_EVICT=1; default OFF (data-preserving mode)

Neither env var is read through the JSON config schema — they are read once in the MasterService constructor via std::getenv. This keeps the surface area minimal for an experimental flag. If the feature is adopted as the default, promoting it to a proper config field is straightforward.

Backward compatibility: when neither env var is set, behavior is unchanged — the PutEnd offload block and the BatchEvict eviction paths both take their original fast path. All 16 existing eviction / offload tests pass unchanged.

Scope: LOCAL_DISK replicas (FileStorage-based) only. DISK replicas (root_fs_dir path, created by Client::PutToLocalFile at PutEnd) are written client-side, outside the Master's offload queue, and are not affected by this change. Deferring them is possible but requires client-side changes and is left as future work.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

New tests in this PR

  • C++ unit testsmooncake-store/tests/offload_on_evict_test.cpp, 9 cases across all four env-var combinations:
    • A (neither set): PutEnd still queues offload; existing eviction works.
    • B (ON_EVICT=1): PutEnd skips the queue; eviction queues offload; PushOffloadingQueue failure falls back to direct eviction.
    • C (ON_EVICT=1 + FORCE_EVICT=1): PutEnd skips; eviction still makes progress under the force-evict cap.
    • D (FORCE_EVICT=1 only): treated as if unset — requires ON_EVICT=1 to take effect.
  • Python-binding integration testmooncake-wheel/tests/test_offload_on_eviction.py, test_small_workload_no_disk_replica: asserts that a workload below the eviction watermark produces zero LOCAL_DISK replicas. This is the behavioral difference from the default enable_ssd_offload path (which would produce disk replicas for every put).

Existing tests (regression)

Run on hnts03/support-storage-hierarchy after the changes:

  • eviction_strategy_test: 4/4 PASS
  • master_service_test --gtest_filter='*Evict*': 8/8 PASS
  • master_service_ssd_test --gtest_filter='*Evict*': 4/4 PASS

No behavioral change when the new env vars are unset.

End-to-end empirical verification

Validated against a running mooncake_master with the Python client binding (enable_ssd_offload=true, MOONCAKE_OFFLOAD_ON_EVICT=1):

  • Workload: 50 MB DRAM segment, 100 keys × 1 MB (2x overflow), heartbeat interval = 1 s.
  • Result: 96 / 100 keys offloaded to LOCAL_DISK via the eviction path; 100 / 100 keys readable bit-exact afterwards.
  • Master log confirmed feature activation (Offload-on-evict mode enabled) and our new diagnostic ([EVICT] N objects deferred for disk offload).

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an "offload-on-evict" mechanism that defers data offloading to local disk until eviction time, aimed at reducing disk I/O for hot data. The implementation includes new configuration flags, logic within the BatchEvict function to handle deferred offloading or force-eviction, and corresponding unit and integration tests. Review feedback highlights a critical data integrity issue where objects might be lost if the offloading queue is full, suggests replacing magic numbers with named constants, and recommends several performance optimizations regarding logging, timestamp retrieval, and replica management during eviction.

Comment thread mooncake-store/src/master_service.cpp Outdated
}

// PushOffloadingQueue failed — force-evict as fallback
return metadata.size * evict_replicas(metadata);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line causes unconditional force-eviction (and thus data loss) if PushOffloadingQueue fails (e.g., due to KEYS_ULTRA_LIMIT), even when offload_force_evict_ is set to false. To preserve data integrity in the default mode, eviction should only proceed if the object is already being offloaded (tracked in offloading_tasks) or if force-evict is explicitly enabled.

        if (shard->offloading_tasks.count(key) > 0 || offload_force_evict_) {
            return metadata.size * evict_replicas(metadata);
        }
        return 0;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a real oversight. The fallback now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is explicitly set. Otherwise it returns 0 (skip), letting the outer eviction loop retry on the next cycle after the offload queue drains. Updated the corresponding test case (ComboB_FallbackWhenNoLocalDiskSegmentComboB_NoFallbackWithoutForceEvict) to verify the new data-preserving default.

Comment thread mooncake-store/src/master_service.cpp Outdated
long offload_queued_this_cycle = 0;
long offload_deferred_count = 0;
const long offload_cap =
offload_on_evict_ ? static_cast<long>(offloading_queue_limit_ * 0.5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The multiplier 0.5 is a magic number. It would be better to define this as a named constant (e.g., kOffloadCapRatio) or make it a configurable parameter in MasterServiceConfig to improve maintainability and clarity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to constexpr double kOffloadCapRatio = 0.5 with a comment explaining the intent. Promoting it to a runtime-configurable parameter is tracked as a follow-up.

Comment thread mooncake-store/src/master_service.cpp Outdated
Comment on lines +3516 to +3517
LOG(WARNING) << "[EVICT] Offload cap reached, force-evicting key="
<< key << " without disk offload";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This warning log is inside a loop that iterates over all objects in a shard. If a large number of objects reach the offload cap simultaneously, this could lead to log flooding and impact performance. Consider using LOG_EVERY_N(WARNING, 100) or summarizing the force-evictions at the end of the cycle.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the per-key warning out of the loop. Force-eviction counts are now aggregated and reported as a single summary LOG(WARNING) at the end of the BatchEvict cycle.

Comment thread mooncake-store/src/master_service.cpp Outdated
return r.is_memory_replica() && r.is_completed() &&
r.get_refcnt() == 0;
},
[this, &key, &shard, &queued](Replica& replica) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For efficiency, the visitor should stop attempting to queue replicas once one has been successfully queued for the current key. Since VisitReplicas does not support early exit, adding a check at the beginning of the lambda is recommended.

            [this, &key, &shard, &queued, &now](Replica& replica) {
                if (queued) return;
                auto result = PushOffloadingQueue(key, replica);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added if (queued) return; at the top of the visitor lambda.

Comment thread mooncake-store/src/master_service.cpp Outdated
replica.inc_refcnt();
shard->offloading_tasks.emplace(
key, OffloadingTask{replica.id(),
std::chrono::system_clock::now()});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Repeatedly calling std::chrono::system_clock::now() inside a loop can be expensive. It is better to use the now variable captured at the beginning of the BatchEvict function (line 3466).

Suggested change
std::chrono::system_clock::now()});
now});

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now captures the outer now variable instead of calling system_clock::now() inside the lambda

Comment on lines +3539 to +3543
if (queued) {
offload_queued_this_cycle++;
offload_deferred_count++;
return 0; // No memory freed yet
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When replica_num > 1, pinning one replica for offload is sufficient to preserve the data. The remaining memory replicas (which still have refcnt == 0) can and should be evicted immediately to reclaim memory. Returning 0 here defers their eviction to the next cycle, which is less efficient.

Suggested change
if (queued) {
offload_queued_this_cycle++;
offload_deferred_count++;
return 0; // No memory freed yet
}
if (queued) {
offload_queued_this_cycle++;
offload_deferred_count++;
return metadata.size * evict_replicas(metadata);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After pinning one replica for offload (refcnt=1), the remaining refcnt==0 MEMORY replicas are now evicted in the same cycle. This is safe because evict_replicas only touches replicas with refcnt == 0, and the pinned one survives. For the typical replica_num=1 case this is a no-op (the single replica is pinned), but it avoids wasting a cycle when replica_num > 1.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 16, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 92.08861% with 25 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/master_service.cpp 80.24% 16 Missing ⚠️
mooncake-store/include/master_config.h 50.00% 4 Missing ⚠️
mooncake-store/src/master.cpp 0.00% 4 Missing ⚠️
mooncake-store/tests/offload_on_evict_test.cpp 99.55% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

hnts03-moreh added a commit to moreh-dev/Mooncake that referenced this pull request Apr 16, 2026
Address 6 items from gemini-code-assist review on PR kvcache-ai#1899:

1. Extract magic number 0.5 to constexpr kOffloadCapRatio (medium)
2. Move per-key force-evict WARNING out of loop; aggregate at cycle
   end to prevent log flooding (medium)
3. Add early exit in VisitReplicas lambda after first successful
   queue — no need to visit remaining replicas for same key (medium)
4. Replace repeated std::chrono::system_clock::now() with captured
   `now` variable from BatchEvict scope (medium)
5. After pinning one replica for offload, also evict remaining
   refcnt==0 MEMORY replicas immediately instead of deferring to
   next cycle — safe because pinned replica has refcnt=1 (medium)
6. Fix unconditional force-eviction on PushOffloadingQueue failure:
   now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is
   set; default is skip (data-preserving). Prevents silent data
   loss when offload infra is unavailable. (HIGH)

Test update: rename ComboB_FallbackWhenNoLocalDiskSegment to
ComboB_NoFallbackWithoutForceEvict and flip assertion to verify
the new data-preserving default (EXPECT_LE instead of EXPECT_GT).

9/9 unit tests + 9/9 existing eviction regression tests pass.

Signed-off-by: hnts03-moreh <[email protected]>
Comment thread mooncake-store/src/master_service.cpp Outdated
LOG(INFO) << "Offload-on-evict mode enabled: DRAM offload to "
"LOCAL_DISK will occur at eviction time instead of "
"PutEnd";
const char* force_env = std::getenv("MOONCAKE_OFFLOAD_FORCE_EVICT");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters for the master should be passed via main in master.cpp, while only the client-side parameters should be configured through environment variables.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ykwd
Thanks for your review!
I've migrated both parameters from environment variables to proper MasterConfig fields.

Before:

MOONCAKE_OFFLOAD_ON_EVICT=1 MOONCAKE_OFFLOAD_FORCE_EVICT=1 mooncake_master ...

After (JSON config):

{
  "offload_on_evict": true,
  "offload_force_evict": false
}

After (CLI flags):

mooncake_master --offload_on_evict=true --offload_force_evict=false

Thanks!

add_store_test(serializer_test serializer_test.cpp)
add_store_test(embedded_snapshot_catalog_store_test
ha/snapshot/catalog/backends/embedded/embedded_snapshot_catalog_store_test.cpp)
add_store_test(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to format CMakeLists.txt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangzuo21
Rebased on latest main! The only intentional change is adding add_store_test(offload_on_evict_test ...) (1 line). The remaining formatting diffs are from the cmake-format pre-commit hook automatically reformatting existing lines. I can't suppress the hook without --no-verify.

Defer LOCAL_DISK offload from PutEnd to BatchEvict time, enabling a
proper DRAM → LOCAL_DISK storage hierarchy. When MOONCAKE_OFFLOAD_ON_EVICT=1,
objects are only offloaded to disk when DRAM eviction pressure triggers,
avoiding unnecessary disk I/O for hot data.

Key changes:
- Skip PushOffloadingQueue in PutEnd when offload-on-evict is active
- BatchEvict: queue eviction candidates for disk offload via
  try_evict_or_offload before erasing MEMORY replicas
- Objects with existing LOCAL_DISK replicas are evicted immediately
- refcnt protects MEMORY data until offload completes
- Optional MOONCAKE_OFFLOAD_FORCE_EVICT=1 for cap-based forced eviction
  under extreme DRAM pressure (default OFF, preserves data)

Env vars:
- MOONCAKE_OFFLOAD_ON_EVICT=1: enable offload-on-evict (default off)
- MOONCAKE_OFFLOAD_FORCE_EVICT=1: allow data-lossy eviction when offload
  cap is exceeded (default off, requires OFFLOAD_ON_EVICT=1)

Signed-off-by: hnts03-moreh <[email protected]>
9 test cases covering all 4 environment variable combinations:
  A (default): offload at PutEnd, eviction works normally
  B (ON_EVICT=1): PutEnd skips offload queue, eviction triggers offload,
    fallback when no LocalDiskSegment mounted
  C (ON_EVICT=1 + FORCE_EVICT=1): PutEnd skips, eviction works with cap
  D (FORCE_EVICT=1 only): treated as default, FORCE_EVICT ignored

Tests validate: offload queue state after PutEnd, watermark-based
eviction under DRAM pressure, fallback behavior when PushOffloadingQueue
fails, and backward compatibility when env vars are unset.

Signed-off-by: hnts03-moreh <[email protected]>
Verifies the core behavioral difference between the default
enable_ssd_offload path and the offload-on-evict mode: small workloads
below the eviction watermark must not produce LOCAL_DISK replicas.

Under default behavior, every PutEnd queues the key for synchronous
offload via the FileStorage heartbeat. Under offload-on-evict, the
queue stays empty until DRAM pressure triggers BatchEvict — which is
the whole point of the feature (avoid redundant disk I/O for hot data).

A zero disk-replica count doubles as a guard that the master was
actually started with MOONCAKE_OFFLOAD_ON_EVICT=1.

End-to-end overflow→evict→offload→read contract is covered by the
existing test_ssd_offload_in_evict.py (legacy synchronous offload) and
by the verify_offload_on_eviction scenario in the
kv-cache-offload-benchmark project.

Test style follows the existing test_ssd_offload_in_evict.py and
test_batch_replica_clear.py conventions: unittest.TestCase, setUpClass
for shared store, env-var driven configuration.

Signed-off-by: hnts03-moreh <[email protected]>
Address 6 items from gemini-code-assist review on PR kvcache-ai#1899:

1. Extract magic number 0.5 to constexpr kOffloadCapRatio (medium)
2. Move per-key force-evict WARNING out of loop; aggregate at cycle
   end to prevent log flooding (medium)
3. Add early exit in VisitReplicas lambda after first successful
   queue — no need to visit remaining replicas for same key (medium)
4. Replace repeated std::chrono::system_clock::now() with captured
   `now` variable from BatchEvict scope (medium)
5. After pinning one replica for offload, also evict remaining
   refcnt==0 MEMORY replicas immediately instead of deferring to
   next cycle — safe because pinned replica has refcnt=1 (medium)
6. Fix unconditional force-eviction on PushOffloadingQueue failure:
   now only force-evicts when MOONCAKE_OFFLOAD_FORCE_EVICT=1 is
   set; default is skip (data-preserving). Prevents silent data
   loss when offload infra is unavailable. (HIGH)

Test update: rename ComboB_FallbackWhenNoLocalDiskSegment to
ComboB_NoFallbackWithoutForceEvict and flip assertion to verify
the new data-preserving default (EXPECT_LE instead of EXPECT_GT).

9/9 unit tests + 9/9 existing eviction regression tests pass.

Signed-off-by: hnts03-moreh <[email protected]>
Replace MOONCAKE_OFFLOAD_ON_EVICT and MOONCAKE_OFFLOAD_FORCE_EVICT
environment variables with proper MasterConfig fields, as requested
by @ykwd. Master-side parameters should be passed via config/CLI,
not environment variables.

Changes:
- MasterConfig: add offload_on_evict, offload_force_evict fields
- master.cpp: add DEFINE_bool + GetBool for JSON config and CLI flags
- MasterService: read from config instead of std::getenv()
- All config propagation paths updated (5 config classes)
- C++ GTest: remove ScopedEnv, use config fields directly (9 tests pass)
- Python test: update docstrings/error messages

Configuration (JSON):
  "offload_on_evict": true,
  "offload_force_evict": false

CLI:
  mooncake_master --offload_on_evict=true

Signed-off-by: Geonwoo Shin <[email protected]>
@hnts03-moreh hnts03-moreh force-pushed the hnts03/support-storage-hierarchy branch from bd633f6 to 3b5b3cc Compare April 20, 2026 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants