Skip to content

feat: skip re-downloading models on shared storage#541

Draft
Kangyan-Zhou wants to merge 4 commits intosgl-project:mainfrom
Kangyan-Zhou:skip-redownload-shared-storage
Draft

feat: skip re-downloading models on shared storage#541
Kangyan-Zhou wants to merge 4 commits intosgl-project:mainfrom
Kangyan-Zhou:skip-redownload-shared-storage

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented Mar 16, 2026

Summary

When multiple nodes mount the same filesystem (e.g., GPFS/NFS at a shared model path), the model-agent on each node would independently re-download from HuggingFace or OCI on restart, causing rate-limiting and hours of unnecessary I/O.

  • Add isModelAlreadyDownloaded() that verifies model completeness locally using model.safetensors.index.json:
    1. config.json exists
    2. model.safetensors.index.json exists and is parseable
    3. ALL expected shards listed in the index are present on disk
  • Models without an index file proceed to the normal download path (no fallback heuristics)
  • Early-return in both OCI and HuggingFace download paths
  • Only applies to fresh Download tasks — DownloadOverride (spec updates, failed retries) still re-evaluates

Motivation

On a production cluster with 32 H200 nodes and shared GPFS, a DaemonSet restart caused all 32 model-agent pods to independently attempt HF downloads for a 756 GB model (142 shards). Only 1 of 32 nodes completed the download; the rest were stuck in Updating at ~100-500 KB/s due to HF rate limiting. This blocked pod scheduling because the clusterbasemodel Ready label was only applied to nodes that completed the download.

With this change, all 32 nodes detected the existing files and flipped to Ready within seconds of the pod restart.

Test plan

  • Unit tests for isModelAlreadyDownloaded covering: nonexistent dir, empty dir, config-only, weights-only (no index), shard completeness via index.json, missing shards, malformed index, empty weight_map
  • go vet and go test ./pkg/modelagent/... pass
  • Deployed to a production cluster and verified: model-agent pods detect existing files, verify all 142 shards, skip HF download, and mark nodes Ready immediately

🤖 Generated with Claude Code

@github-actions github-actions bot added model-agent Model agent changes tests Test changes labels Mar 16, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical performance and reliability issue where model-agent pods on nodes with shared storage would repeatedly attempt to download large models, leading to severe rate-limiting and prolonged startup times. By implementing a comprehensive check for already downloaded and complete models, the system can now efficiently detect existing files and skip unnecessary downloads, drastically improving model availability and reducing network strain in shared storage environments.

Highlights

  • Model Download Optimization: Introduced a new isModelAlreadyDownloaded function to prevent redundant model downloads when using shared storage, significantly reducing I/O and avoiding rate-limiting issues.
  • Download Logic Integration: Integrated the isModelAlreadyDownloaded check into both OCI and HuggingFace download workflows, allowing agents to skip downloads if models are already present and verified.
  • Model Completeness Verification: Defined a robust set of criteria for isModelAlreadyDownloaded, including the presence of config.json, at least one weight file, and full shard verification via model.safetensors.index.json if available.
  • Task Type Specificity: Ensured the optimization applies only to initial Download tasks, preserving re-evaluation for DownloadOverride tasks (e.g., spec updates or failed retries).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/modelagent/gopher.go
    • Imported the encoding/json package for JSON parsing.
    • Added a check for already downloaded models before initiating OCI downloads.
    • Added a check for already downloaded models before initiating HuggingFace downloads.
    • Implemented the isModelAlreadyDownloaded function to verify model presence and completeness on disk.
  • pkg/modelagent/gopher_test.go
    • Imported os and path/filepath packages for file system operations in tests.
    • Added a new test suite TestIsModelAlreadyDownloaded with various scenarios to validate the model download check logic.
Activity
  • Unit tests for isModelAlreadyDownloaded were developed, covering various scenarios like nonexistent directories, empty directories, config-only, weights-only, all weight formats, shard completeness via index.json, missing shards, and malformed index files.
  • go vet and go test ./pkg/modelagent/... commands were executed and passed successfully.
  • The changes were deployed to the prod-sci-us-central1-1 cluster, and verification confirmed that model-agent pods correctly detected existing files, verified all 142 shards, skipped HuggingFace downloads, and marked nodes as Ready immediately.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Kangyan-Zhou Kangyan-Zhou force-pushed the skip-redownload-shared-storage branch from 9b2bfa2 to 3d33788 Compare March 16, 2026 04:01
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable optimization to skip re-downloading models that already exist on shared storage, which should significantly improve performance and reliability in multi-node environments. The implementation of isModelAlreadyDownloaded is robust and well-tested. My review includes a few suggestions to enhance maintainability by reducing code duplication and simplifying variable assignments.

I am having trouble creating individual review comments. Click here to see my feedback.

pkg/modelagent/gopher.go (334-343)

medium

The variables baseModel and clusterBaseModel are redundant here. You can directly pass task.BaseModel and task.ClusterBaseModel to s.safeParseAndUpdateModelConfig to simplify the code and improve readability.

                                if err := s.safeParseAndUpdateModelConfig(destPath, task.BaseModel, task.ClusterBaseModel, nil); err != nil {
					s.logger.Errorf("Failed to parse and update model config for pre-existing model: %v", err)
				}

pkg/modelagent/gopher.go (1003-1013)

medium

This block of code to determine baseModel and clusterBaseModel is duplicated from processTask. You can simplify this by passing task.BaseModel and task.ClusterBaseModel directly to s.safeParseAndUpdateModelConfig, which will improve readability and reduce code duplication.

		if err := s.safeParseAndUpdateModelConfig(destPath, task.BaseModel, task.ClusterBaseModel, nil); err != nil {
			s.logger.Errorf("Failed to parse and update model config for pre-existing model: %v", err)
		}

pkg/modelagent/gopher.go (1568)

medium

The weightExtensions slice contains a constant set of values. To improve maintainability and avoid re-declaration on each function call, consider defining it as a package-level constant or variable.

When multiple nodes mount the same filesystem (e.g., GPFS/NFS at
/storage/models), the model-agent on each node would independently
re-download from HuggingFace or OCI, causing rate-limiting and hours
of unnecessary I/O.

Add isModelAlreadyDownloaded() that checks:
1. config.json exists
2. If model.safetensors.index.json exists, ALL expected shards present
3. Otherwise, at least one weight file (.safetensors/.bin/.pt/.gguf)

Only applies to fresh Download tasks (not DownloadOverride) so spec
updates and failed retries still re-evaluate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou force-pushed the skip-redownload-shared-storage branch from 3d33788 to 8e81b16 Compare March 16, 2026 04:04
@pallasathena92
Copy link
Copy Markdown
Collaborator

pallasathena92 commented Mar 31, 2026

  1. isModelAlreadyDownloaded function checked destPath, which is not only shared storage. This is not entirely aligned with comments.
  2. isModelAlreadyDownloaded only checked config.json, model.safetensors.index.json, xxx.safetensors. A model could have all shards present but the missing other config files needed for inference. In order to gurantee the model artifact fully ready, the progress will do redownload and check md5 for files.
  3. This fast-optimization optimization don't need to add it to oci storage type.
  4. we have download optimization for HF, it also skip some reuseEligible model weight download. The fast-optimization path is unecessary here.

@Kangyan-Zhou Kangyan-Zhou marked this pull request as draft April 3, 2026 05:59
…le models

Extend isModelAlreadyDownloaded() to handle three model layouts:
1. Sharded safetensors (existing): verify all shards via index
2. Diffusion pipelines (new): verify component dirs via model_index.json
3. Single-file fallback (new): config.json + weight file heuristic

Also:
- Propagate safeParseAndUpdateModelConfig errors instead of swallowing
- Add path traversal guard for untrusted JSON keys
- Add detailed logging at every decision point for debugging
- Differentiate os.Stat permission errors from "not exist"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou force-pushed the skip-redownload-shared-storage branch from 4c83bd1 to cb1fd8a Compare April 3, 2026 06:55
On shared filesystems (NFS/GPFS/CephFS/Lustre), only one agent should
download model files per model. Others wait with jitter and recheck.

- Detect shared storage via syscall.Statfs filesystem magic numbers
- Per-model K8s Leases (model-download-<name>) for parallel downloads
  of different models while preventing duplicate downloads of the same model
- Non-leaders wait up to 5.5min with 15s jitter between rechecks
- Handle expired leases, API errors (IsNotFound vs transient), context cancellation
- Guard against nil HolderIdentity, lease renewal conflicts
- Use time.NewTimer with explicit Stop() to avoid timer leaks
- Fall back to downloading if leader times out

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou force-pushed the skip-redownload-shared-storage branch from cb1fd8a to 8d0baf6 Compare April 3, 2026 07:03
Real-world model_index.json files contain non-component entries like:
- "boundary_ratio": 0.9 (float metadata)
- "image_encoder": [null, null] (disabled component)

Only treat entries as components if they are arrays with at least 2
elements where the first is a non-null string (library name).

Also: add lease cleanup after download, fix lease name sanitization
(spaces, dots), and improve sanitizeLeaseName for RFC 1123 compliance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-agent Model agent changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants