[Offload] Add asserts when removing and modifying files by kylesayrs · Pull Request #666 · vllm-project/compressed-tensors

kylesayrs · 2026-04-07T15:40:33Z

Purpose

Add stronger guards against modifying model checkpoint files
- No checkpoint overwrite bugs exist today, but at least one has existed in the past
- These asserts guard against the possibility in the future

Summary by CodeRabbit

Refactor
- Improved disk-cache file naming to include execution rank to prevent cross-process name collisions; tightened on-disk file validation and lifecycle handling for safer cache writes, removals, and symlink updates.
Tests
- Updated tests to match the revised cache naming/validation behavior and adjusted test-run logging path handling for the subprocess test runner.

coderabbitai · 2026-04-07T15:40:48Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2d7f6eb1-a984-4805-b51e-e1c1fb02eee7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

DiskCache filenames now use a rank-aware _ct_file_prefix and new helpers _get_ct_file_path, _is_ct_file_path, and _get_rank(); offload, symlink, update, and deletion logic were changed to centralize path construction and assert files are DiskCache-created before write/unlink.

Changes

Cohort / File(s)	Summary
DiskCache path helpers & logic `src/compressed_tensors/offload/cache/disk.py`	Renamed `_new_file_prefix` → `_ct_file_prefix`. Added `DiskCache._get_ct_file_path(offload_dir, offloaded)`, `DiskCache._is_ct_file_path(file_path)`, and `_get_rank()`. Centralized safetensors filename construction to include rank and tightened offload/update/delete to assert classification before writing or unlinking.
Tests updating prefix expectations `tests/test_offload/convert/test_from_accelerate.py`	Updated test filtering to exclude `.safetensors` files by `DiskCache._ct_file_prefix` (rank-aware) instead of the old `_new_file_prefix` when computing pre/post-update hashes.
Test runner log-dir path `tests/test_offload/conftest.py`	Changed `torchrun` helper log directory from an absolute `/tmp/torchrun-logs` to a relative `tmp/torchrun-logs` path for subprocess launches.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

brian-dellabetta
HDCharles

Poem

🐰 I hopped along the cache-lined trail,
I stamped each file with rank and name,
I sniffed the links and checked the trail,
No stray removals play the game.
My burrow's tidy, bytes stay tame. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding assertions when removing and modifying files in the offload cache system.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kylesayrs/assert-no-modify

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compressed_tensors/offload/cache/disk.py`:
- Around line 190-192: Update _is_created_file_path to accept the DiskCache
instance (e.g., def _is_created_file_path(file_path: str, disk_cache: DiskCache)
-> bool) and use the instance's prefix (disk_cache._new_file_prefix) rather than
the class constant so subclasses can override it; resolve both file_path and the
cache directory to real paths and verify the file is physically inside the disk
cache directory (use os.path.realpath + os.path.commonpath to compare against
disk_cache's directory property, e.g., disk_cache.cache_dir or
disk_cache.base_dir) and only then check that
os.path.basename(file_path).startswith(disk_cache._new_file_prefix).
- Around line 105-106: Replace the assert-based safety checks around
writing/reading cache files by enhancing _is_created_file_path to validate that
the resolved absolute path is inside offload_dir (not just basename prefix) and
then raise explicit exceptions (e.g., ValueError or PermissionError) instead of
using assert in the places that call _is_created_file_path and save_file; also
make the conditional usage of _is_created_file_path consistent by raising when
it fails (rather than silently skipping) so all checks for file creation use the
strengthened helper and explicit exceptions (update all call sites that
currently use assert or conditional checks with _is_created_file_path such as
where save_file is invoked and the other checks in the same class).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 42a0c560-a30c-482a-983b-2fd645a93de3

📥 Commits

Reviewing files that changed from the base of the PR and between bf783b1 and 0ce32d7.

📒 Files selected for processing (1)

src/compressed_tensors/offload/cache/disk.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_offload/convert/test_from_accelerate.py`:
- Line 154: In DiskCache._is_ct_file_path, replace the incorrect reference to
cls._created_file_prefix with the correctly defined attribute
cls._ct_file_prefix so the method matches _get_ct_file_path and avoids the
AttributeError at runtime; update the check inside _is_ct_file_path to use
cls._ct_file_prefix and ensure any related comparisons use the same prefix
symbol.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 108746c9-cee5-47d1-a3e8-88da7dbbafd6

📥 Commits

Reviewing files that changed from the base of the PR and between 0ce32d7 and 657d3c6.

📒 Files selected for processing (2)

src/compressed_tensors/offload/cache/disk.py
tests/test_offload/convert/test_from_accelerate.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/compressed_tensors/offload/cache/disk.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compressed_tensors/offload/cache/disk.py`:
- Around line 169-177: The helper constants and method decorators were changed
but not updated everywhere: update _is_ct_file_path to reference the correct
constant name (use cls._ct_file_prefix instead of cls._created_file_prefix) and
add the missing `@classmethod` decorator to _get_ct_file_path so it can be called
as cls._get_ct_file_path(...) (also ensure its signature accepts cls,
offload_dir, offloaded). Verify create_checkpoint_symlink (and callers at lines
noted) now call the corrected classmethod and no AttributeError/TypeError
remains.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1bb913a2-1937-461b-9d73-9336ca0765c8

📥 Commits

Reviewing files that changed from the base of the PR and between 657d3c6 and d0a1d0b.

📒 Files selected for processing (1)

src/compressed_tensors/offload/cache/disk.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/compressed_tensors/offload/cache/disk.py (2)
162-162: ⚠️ Potential issue | 🔴 Critical

Add @classmethod back to _get_ct_file_path().

Line 162 calls cls._get_ct_file_path(...), but without the decorator Python does not bind cls there, so create_checkpoint_symlink() raises TypeError before the symlink is created. offload() only appears fine because the instance call binds self into the first parameter.
🔧 Proposed fix
     `@classmethod`
     def _is_ct_file_path(cls, file_path: str) -> bool:
         """Only write and delete files that DiskCache has created"""
         return os.path.basename(file_path).startswith(cls._ct_file_prefix)
 
+    `@classmethod`
     def _get_ct_file_path(cls, offload_dir: str, offloaded: torch.Tensor) -> str:
         """Create file path with a prefix marking it as modifiable"""
         file_name = f"{cls._ct_file_prefix}{_get_rank()}{id(offloaded)}.safetensors"
         return os.path.join(offload_dir, file_name)
Also applies to: 175-178
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/compressed_tensors/offload/cache/disk.py` at line 162, The call sites
like create_checkpoint_symlink use cls._get_ct_file_path(...) but
_get_ct_file_path was changed to a plain function, causing an unbound
first-argument TypeError; restore the method binding by re-adding the
`@classmethod` decorator to _get_ct_file_path so it accepts cls as its first
parameter (this will also fix the other calls in the same area such as the calls
around lines 175-178), ensure the method signature expects cls and not self, and
run tests for create_checkpoint_symlink and offload to verify behavior.
170-173: ⚠️ Potential issue | 🟠 Major

Prefix-only ownership checks can still target checkpoint files.

_is_ct_file_path() only checks startswith(cls._ct_file_prefix). That means a real checkpoint like ct_disk_cache_model.safetensors would still pass the guards on Lines 120, 141, and 145, so __delitem__() / update_offload() can still unlink or overwrite a non-cache-owned file. This helper needs to prove ownership, not just match a basename prefix.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/compressed_tensors/offload/cache/disk.py` around lines 170 - 173,
_is_ct_file_path currently only checks startswith(cls._ct_file_prefix), which
allows unrelated checkpoint files like "ct_disk_cache_model.safetensors" to be
treated as cache-owned; update _is_ct_file_path in class DiskCache to prove
ownership by validating the full expected filename pattern (not just prefix)
and/or location: check that file_path is inside the cache directory and that
os.path.basename matches the exact naming scheme the cache generates (e.g.,
prefix + separator + nonce/hex and expected suffix) or match a strict regex for
generated names; this will ensure __delitem__ and update_offload only operate on
true cache-created files and not arbitrary checkpoint files.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compressed_tensors/offload/cache/disk.py`:
- Around line 175-178: The filename generation in _get_ct_file_path currently
concatenates _get_rank() and id(offloaded) without a separator causing
collisions; change the formatting used to build file_name (referencing
_ct_file_prefix, _get_rank(), and id(offloaded)) to include a clear separator or
labeled segments (e.g., "r{rank}_id{obj_id}" or similar) between rank and object
id so filenames are unambiguous across ranks in offload_dir and cannot collide.

---

Duplicate comments:
In `@src/compressed_tensors/offload/cache/disk.py`:
- Line 162: The call sites like create_checkpoint_symlink use
cls._get_ct_file_path(...) but _get_ct_file_path was changed to a plain
function, causing an unbound first-argument TypeError; restore the method
binding by re-adding the `@classmethod` decorator to _get_ct_file_path so it
accepts cls as its first parameter (this will also fix the other calls in the
same area such as the calls around lines 175-178), ensure the method signature
expects cls and not self, and run tests for create_checkpoint_symlink and
offload to verify behavior.
- Around line 170-173: _is_ct_file_path currently only checks
startswith(cls._ct_file_prefix), which allows unrelated checkpoint files like
"ct_disk_cache_model.safetensors" to be treated as cache-owned; update
_is_ct_file_path in class DiskCache to prove ownership by validating the full
expected filename pattern (not just prefix) and/or location: check that
file_path is inside the cache directory and that os.path.basename matches the
exact naming scheme the cache generates (e.g., prefix + separator + nonce/hex
and expected suffix) or match a strict regex for generated names; this will
ensure __delitem__ and update_offload only operate on true cache-created files
and not arbitrary checkpoint files.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4028389-e9d3-4d2b-a7e7-809bc827ea44

📥 Commits

Reviewing files that changed from the base of the PR and between d0a1d0b and 0cede2a.

📒 Files selected for processing (1)

src/compressed_tensors/offload/cache/disk.py

mergify · 2026-04-08T16:14:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-04-09T19:37:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kylesayrs · 2026-04-10T02:03:20Z

@CodeRabbit resolve merge conflict

coderabbitai · 2026-04-10T02:03:26Z

✅ Merge conflicts resolved successfully!

Resolved 1 conflict file(s). Commit: 2d14d78e3e5a6d83fab097cf5b11e4da32dbff17 pushed to kylesayrs/assert-no-modify.

3 file operation(s)

src/compressed_tensors/offload/cache/disk.py (update)
tests/test_offload/conftest.py (update)
tests/test_offload/convert/test_from_accelerate.py (update)

View agent analysis

I'll start by inspecting the git state and then analyze the conflict.

The conflict is clear and simple. The file already has the `accel_device` fixture (from the main branch's hardware abstraction changes), and the only conflict is in the last line of that fixture:

- **HEAD (PR branch `kylesayrs/assert-no-modify`)**: `else torch.device("cuda:0")` — still uses CUDA-specific hardcoding
- **main**: `else torch.device(accel_type, 0)` — uses the generic accelerator type (hardware-agnostic)

The main branch's approach is correct — it's consistent with the entire hardware abstraction refactor already applied throughout the file (using `torch.accelerator.*` APIs, `accel_device` fixture, etc.). The PR branch's version still had the old CUDA-specific hardcoding in this spot.

No unmerged paths remain. Let me verify the final file state:

Now let me run the build and lint checks:

Both syntax and lint checks pass. The resolution is complete.

**Summary:**

**What each version was doing:**
- **HEAD (PR branch `kylesayrs/assert-no-modify`)**: The `accel_device` fixture still had `torch.device("cuda:0")` hardcoded in the `else` branch — a leftover CUDA-specific reference that hadn't been updated.
- **main**: The fixture used `torch.device(accel_type, 0)` where `accel_type` comes from `torch.accelerator.current_accelerator().type` — the hardware-agnostic approach.

**How resolved:** Chose main's version (`torch.device(accel_type, 0)`), which is the correct resolution. The rest of the file was already fully updated to use `torch.accelerator.*` APIs (the hardware abstraction from main), so keeping the CUDA-specific hardcoding in the fixture's `else` branch would be inconsistent and would break on non-CUDA accelerators.

**Behavioral snapshot:** Every line in the resolved file traces back to one of the two sides. The `--log-dir tmp/torchrun-logs` (relative path, from the PR branch) and all `torch.accelerator.*` calls (from main's hardware abstraction) are both present and combined correctly. The result is a fully hardware-agnostic conftest with the PR's relative log-dir path.

mergify · 2026-04-10T02:07:01Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-04-13T21:42:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kylesayrs · 2026-04-17T14:30:48Z

@CodeRabbit resolve merge conflict

mergify · 2026-04-17T14:35:17Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

Signed-off-by: Kyle Sayers <[email protected]>

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/cache/disk.py Outdated

Comment thread src/compressed_tensors/offload/cache/disk.py Outdated

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread tests/test_offload/convert/test_from_accelerate.py

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/cache/disk.py

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/cache/disk.py

brian-dellabetta approved these changes Apr 7, 2026

View reviewed changes

mergify Bot added the needs-rebase label Apr 8, 2026

kylesayrs mentioned this pull request Apr 9, 2026

[offload] Migrate offload stack to torch.accelerator + XPU emulation tests #665

Merged

kylesayrs force-pushed the kylesayrs/assert-no-modify branch from 3a2207b to e23f7e0 Compare April 9, 2026 06:22

mergify Bot removed the needs-rebase label Apr 9, 2026

mergify Bot added the needs-rebase label Apr 9, 2026

mergify Bot added quality-failed and removed needs-rebase labels Apr 10, 2026

mergify Bot added the needs-rebase label Apr 13, 2026

mergify Bot removed quality-failed needs-rebase labels Apr 17, 2026

mergify Bot added the quality-failed label Apr 17, 2026

kylesayrs mentioned this pull request Apr 17, 2026

[offload] DiskCache.clean_offload_dir #678

Draft

kylesayrs added 4 commits April 22, 2026 11:38

add asserts

dbd966f

Signed-off-by: Kyle Sayers <[email protected]>

typos, expressive asserts

876757a

Signed-off-by: Kyle Sayers <[email protected]>

centralize functions

f7ad94d

Signed-off-by: Kyle Sayers <[email protected]>

add comment

b2d8351

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added 3 commits April 22, 2026 11:42

hash by rank

fa69460

Signed-off-by: Kyle Sayers <[email protected]>

fix typehint

ddac321

Signed-off-by: Kyle Sayers <[email protected]>

fix typo

77ad959

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/assert-no-modify branch from 2d14d78 to 77ad959 Compare April 22, 2026 15:42

kylesayrs requested a review from HDCharles April 22, 2026 15:43

mergify Bot removed the quality-failed label Apr 22, 2026

kylesayrs added 2 commits April 22, 2026 11:45

better hashing

5543fe7

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/assert-no-modify

a2aa012

HDCharles approved these changes Apr 22, 2026

View reviewed changes

Conversation

kylesayrs commented Apr 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify Bot commented Apr 8, 2026

Uh oh!

mergify Bot commented Apr 9, 2026

Uh oh!

kylesayrs commented Apr 10, 2026

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Apr 10, 2026

Uh oh!

mergify Bot commented Apr 13, 2026

Uh oh!

kylesayrs commented Apr 17, 2026

Uh oh!

mergify Bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylesayrs commented Apr 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading