Skip to content

[Offload] Add asserts when removing and modifying files#666

Open
kylesayrs wants to merge 9 commits intomainfrom
kylesayrs/assert-no-modify
Open

[Offload] Add asserts when removing and modifying files#666
kylesayrs wants to merge 9 commits intomainfrom
kylesayrs/assert-no-modify

Conversation

@kylesayrs
Copy link
Copy Markdown
Collaborator

@kylesayrs kylesayrs commented Apr 7, 2026

Purpose

  • Add stronger guards against modifying model checkpoint files
    • No checkpoint overwrite bugs exist today, but at least one has existed in the past
    • These asserts guard against the possibility in the future

Summary by CodeRabbit

  • Refactor

    • Improved disk-cache file naming to include execution rank to prevent cross-process name collisions; tightened on-disk file validation and lifecycle handling for safer cache writes, removals, and symlink updates.
  • Tests

    • Updated tests to match the revised cache naming/validation behavior and adjusted test-run logging path handling for the subprocess test runner.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2d7f6eb1-a984-4805-b51e-e1c1fb02eee7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

DiskCache filenames now use a rank-aware _ct_file_prefix and new helpers _get_ct_file_path, _is_ct_file_path, and _get_rank(); offload, symlink, update, and deletion logic were changed to centralize path construction and assert files are DiskCache-created before write/unlink.

Changes

Cohort / File(s) Summary
DiskCache path helpers & logic
src/compressed_tensors/offload/cache/disk.py
Renamed _new_file_prefix_ct_file_prefix. Added DiskCache._get_ct_file_path(offload_dir, offloaded), DiskCache._is_ct_file_path(file_path), and _get_rank(). Centralized safetensors filename construction to include rank and tightened offload/update/delete to assert classification before writing or unlinking.
Tests updating prefix expectations
tests/test_offload/convert/test_from_accelerate.py
Updated test filtering to exclude .safetensors files by DiskCache._ct_file_prefix (rank-aware) instead of the old _new_file_prefix when computing pre/post-update hashes.
Test runner log-dir path
tests/test_offload/conftest.py
Changed torchrun helper log directory from an absolute /tmp/torchrun-logs to a relative tmp/torchrun-logs path for subprocess launches.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • brian-dellabetta
  • HDCharles

Poem

🐰 I hopped along the cache-lined trail,
I stamped each file with rank and name,
I sniffed the links and checked the trail,
No stray removals play the game.
My burrow's tidy, bytes stay tame. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding assertions when removing and modifying files in the offload cache system.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kylesayrs/assert-no-modify

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compressed_tensors/offload/cache/disk.py`:
- Around line 190-192: Update _is_created_file_path to accept the DiskCache
instance (e.g., def _is_created_file_path(file_path: str, disk_cache: DiskCache)
-> bool) and use the instance's prefix (disk_cache._new_file_prefix) rather than
the class constant so subclasses can override it; resolve both file_path and the
cache directory to real paths and verify the file is physically inside the disk
cache directory (use os.path.realpath + os.path.commonpath to compare against
disk_cache's directory property, e.g., disk_cache.cache_dir or
disk_cache.base_dir) and only then check that
os.path.basename(file_path).startswith(disk_cache._new_file_prefix).
- Around line 105-106: Replace the assert-based safety checks around
writing/reading cache files by enhancing _is_created_file_path to validate that
the resolved absolute path is inside offload_dir (not just basename prefix) and
then raise explicit exceptions (e.g., ValueError or PermissionError) instead of
using assert in the places that call _is_created_file_path and save_file; also
make the conditional usage of _is_created_file_path consistent by raising when
it fails (rather than silently skipping) so all checks for file creation use the
strengthened helper and explicit exceptions (update all call sites that
currently use assert or conditional checks with _is_created_file_path such as
where save_file is invoked and the other checks in the same class).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 42a0c560-a30c-482a-983b-2fd645a93de3

📥 Commits

Reviewing files that changed from the base of the PR and between bf783b1 and 0ce32d7.

📒 Files selected for processing (1)
  • src/compressed_tensors/offload/cache/disk.py

Comment thread src/compressed_tensors/offload/cache/disk.py Outdated
Comment thread src/compressed_tensors/offload/cache/disk.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_offload/convert/test_from_accelerate.py`:
- Line 154: In DiskCache._is_ct_file_path, replace the incorrect reference to
cls._created_file_prefix with the correctly defined attribute
cls._ct_file_prefix so the method matches _get_ct_file_path and avoids the
AttributeError at runtime; update the check inside _is_ct_file_path to use
cls._ct_file_prefix and ensure any related comparisons use the same prefix
symbol.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 108746c9-cee5-47d1-a3e8-88da7dbbafd6

📥 Commits

Reviewing files that changed from the base of the PR and between 0ce32d7 and 657d3c6.

📒 Files selected for processing (2)
  • src/compressed_tensors/offload/cache/disk.py
  • tests/test_offload/convert/test_from_accelerate.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/compressed_tensors/offload/cache/disk.py

Comment thread tests/test_offload/convert/test_from_accelerate.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compressed_tensors/offload/cache/disk.py`:
- Around line 169-177: The helper constants and method decorators were changed
but not updated everywhere: update _is_ct_file_path to reference the correct
constant name (use cls._ct_file_prefix instead of cls._created_file_prefix) and
add the missing `@classmethod` decorator to _get_ct_file_path so it can be called
as cls._get_ct_file_path(...) (also ensure its signature accepts cls,
offload_dir, offloaded). Verify create_checkpoint_symlink (and callers at lines
noted) now call the corrected classmethod and no AttributeError/TypeError
remains.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1bb913a2-1937-461b-9d73-9336ca0765c8

📥 Commits

Reviewing files that changed from the base of the PR and between 657d3c6 and d0a1d0b.

📒 Files selected for processing (1)
  • src/compressed_tensors/offload/cache/disk.py

Comment thread src/compressed_tensors/offload/cache/disk.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/compressed_tensors/offload/cache/disk.py (2)

162-162: ⚠️ Potential issue | 🔴 Critical

Add @classmethod back to _get_ct_file_path().

Line 162 calls cls._get_ct_file_path(...), but without the decorator Python does not bind cls there, so create_checkpoint_symlink() raises TypeError before the symlink is created. offload() only appears fine because the instance call binds self into the first parameter.

🔧 Proposed fix
     `@classmethod`
     def _is_ct_file_path(cls, file_path: str) -> bool:
         """Only write and delete files that DiskCache has created"""
         return os.path.basename(file_path).startswith(cls._ct_file_prefix)
 
+    `@classmethod`
     def _get_ct_file_path(cls, offload_dir: str, offloaded: torch.Tensor) -> str:
         """Create file path with a prefix marking it as modifiable"""
         file_name = f"{cls._ct_file_prefix}{_get_rank()}{id(offloaded)}.safetensors"
         return os.path.join(offload_dir, file_name)

Also applies to: 175-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/compressed_tensors/offload/cache/disk.py` at line 162, The call sites
like create_checkpoint_symlink use cls._get_ct_file_path(...) but
_get_ct_file_path was changed to a plain function, causing an unbound
first-argument TypeError; restore the method binding by re-adding the
`@classmethod` decorator to _get_ct_file_path so it accepts cls as its first
parameter (this will also fix the other calls in the same area such as the calls
around lines 175-178), ensure the method signature expects cls and not self, and
run tests for create_checkpoint_symlink and offload to verify behavior.

170-173: ⚠️ Potential issue | 🟠 Major

Prefix-only ownership checks can still target checkpoint files.

_is_ct_file_path() only checks startswith(cls._ct_file_prefix). That means a real checkpoint like ct_disk_cache_model.safetensors would still pass the guards on Lines 120, 141, and 145, so __delitem__() / update_offload() can still unlink or overwrite a non-cache-owned file. This helper needs to prove ownership, not just match a basename prefix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/compressed_tensors/offload/cache/disk.py` around lines 170 - 173,
_is_ct_file_path currently only checks startswith(cls._ct_file_prefix), which
allows unrelated checkpoint files like "ct_disk_cache_model.safetensors" to be
treated as cache-owned; update _is_ct_file_path in class DiskCache to prove
ownership by validating the full expected filename pattern (not just prefix)
and/or location: check that file_path is inside the cache directory and that
os.path.basename matches the exact naming scheme the cache generates (e.g.,
prefix + separator + nonce/hex and expected suffix) or match a strict regex for
generated names; this will ensure __delitem__ and update_offload only operate on
true cache-created files and not arbitrary checkpoint files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compressed_tensors/offload/cache/disk.py`:
- Around line 175-178: The filename generation in _get_ct_file_path currently
concatenates _get_rank() and id(offloaded) without a separator causing
collisions; change the formatting used to build file_name (referencing
_ct_file_prefix, _get_rank(), and id(offloaded)) to include a clear separator or
labeled segments (e.g., "r{rank}_id{obj_id}" or similar) between rank and object
id so filenames are unambiguous across ranks in offload_dir and cannot collide.

---

Duplicate comments:
In `@src/compressed_tensors/offload/cache/disk.py`:
- Line 162: The call sites like create_checkpoint_symlink use
cls._get_ct_file_path(...) but _get_ct_file_path was changed to a plain
function, causing an unbound first-argument TypeError; restore the method
binding by re-adding the `@classmethod` decorator to _get_ct_file_path so it
accepts cls as its first parameter (this will also fix the other calls in the
same area such as the calls around lines 175-178), ensure the method signature
expects cls and not self, and run tests for create_checkpoint_symlink and
offload to verify behavior.
- Around line 170-173: _is_ct_file_path currently only checks
startswith(cls._ct_file_prefix), which allows unrelated checkpoint files like
"ct_disk_cache_model.safetensors" to be treated as cache-owned; update
_is_ct_file_path in class DiskCache to prove ownership by validating the full
expected filename pattern (not just prefix) and/or location: check that
file_path is inside the cache directory and that os.path.basename matches the
exact naming scheme the cache generates (e.g., prefix + separator + nonce/hex
and expected suffix) or match a strict regex for generated names; this will
ensure __delitem__ and update_offload only operate on true cache-created files
and not arbitrary checkpoint files.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4028389-e9d3-4d2b-a7e7-809bc827ea44

📥 Commits

Reviewing files that changed from the base of the PR and between d0a1d0b and 0cede2a.

📒 Files selected for processing (1)
  • src/compressed_tensors/offload/cache/disk.py

Comment thread src/compressed_tensors/offload/cache/disk.py
@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 9, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 9, 2026
@kylesayrs
Copy link
Copy Markdown
Collaborator Author

@CodeRabbit resolve merge conflict

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

Merge conflicts resolved successfully!

Resolved 1 conflict file(s). Commit: 2d14d78e3e5a6d83fab097cf5b11e4da32dbff17 pushed to kylesayrs/assert-no-modify.

3 file operation(s)
  • src/compressed_tensors/offload/cache/disk.py (update)
  • tests/test_offload/conftest.py (update)
  • tests/test_offload/convert/test_from_accelerate.py (update)
View agent analysis
I'll start by inspecting the git state and then analyze the conflict.

The conflict is clear and simple. The file already has the `accel_device` fixture (from the main branch's hardware abstraction changes), and the only conflict is in the last line of that fixture:

- **HEAD (PR branch `kylesayrs/assert-no-modify`)**: `else torch.device("cuda:0")` — still uses CUDA-specific hardcoding
- **main**: `else torch.device(accel_type, 0)` — uses the generic accelerator type (hardware-agnostic)

The main branch's approach is correct — it's consistent with the entire hardware abstraction refactor already applied throughout the file (using `torch.accelerator.*` APIs, `accel_device` fixture, etc.). The PR branch's version still had the old CUDA-specific hardcoding in this spot.

No unmerged paths remain. Let me verify the final file state:

Now let me run the build and lint checks:

Both syntax and lint checks pass. The resolution is complete.

**Summary:**

**What each version was doing:**
- **HEAD (PR branch `kylesayrs/assert-no-modify`)**: The `accel_device` fixture still had `torch.device("cuda:0")` hardcoded in the `else` branch — a leftover CUDA-specific reference that hadn't been updated.
- **main**: The fixture used `torch.device(accel_type, 0)` where `accel_type` comes from `torch.accelerator.current_accelerator().type` — the hardware-agnostic approach.

**How resolved:** Chose main's version (`torch.device(accel_type, 0)`), which is the correct resolution. The rest of the file was already fully updated to use `torch.accelerator.*` APIs (the hardware abstraction from main), so keeping the CUDA-specific hardcoding in the fixture's `else` branch would be inconsistent and would break on non-CUDA accelerators.

**Behavioral snapshot:** Every line in the resolved file traces back to one of the two sides. The `--log-dir tmp/torchrun-logs` (relative path, from the PR branch) and all `torch.accelerator.*` calls (from main's hardware abstraction) are both present and combined correctly. The result is a fully hardware-agnostic conftest with the PR's relative log-dir path.

@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 10, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 13, 2026
@kylesayrs
Copy link
Copy Markdown
Collaborator Author

@CodeRabbit resolve merge conflict

@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 17, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs force-pushed the kylesayrs/assert-no-modify branch from 2d14d78 to 77ad959 Compare April 22, 2026 15:42
@kylesayrs kylesayrs requested a review from HDCharles April 22, 2026 15:43
@mergify mergify Bot removed the quality-failed label Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants