feat: skip re-downloading models on shared storage by Kangyan-Zhou · Pull Request #541 · sgl-project/ome

Kangyan-Zhou · 2026-03-16T03:58:13Z

Summary

When multiple nodes mount the same filesystem (e.g., GPFS/NFS at a shared model path), the model-agent on each node would independently re-download from HuggingFace or OCI on restart, causing rate-limiting and hours of unnecessary I/O.

Add isModelAlreadyDownloaded() that verifies model completeness locally using model.safetensors.index.json:
1. config.json exists
2. model.safetensors.index.json exists and is parseable
3. ALL expected shards listed in the index are present on disk
Models without an index file proceed to the normal download path (no fallback heuristics)
Early-return in both OCI and HuggingFace download paths
Only applies to fresh Download tasks — DownloadOverride (spec updates, failed retries) still re-evaluates

Motivation

On a production cluster with 32 H200 nodes and shared GPFS, a DaemonSet restart caused all 32 model-agent pods to independently attempt HF downloads for a 756 GB model (142 shards). Only 1 of 32 nodes completed the download; the rest were stuck in Updating at ~100-500 KB/s due to HF rate limiting. This blocked pod scheduling because the clusterbasemodel Ready label was only applied to nodes that completed the download.

With this change, all 32 nodes detected the existing files and flipped to Ready within seconds of the pod restart.

Test plan

Unit tests for isModelAlreadyDownloaded covering: nonexistent dir, empty dir, config-only, weights-only (no index), shard completeness via index.json, missing shards, malformed index, empty weight_map
go vet and go test ./pkg/modelagent/... pass
Deployed to a production cluster and verified: model-agent pods detect existing files, verify all 142 shards, skip HF download, and mark nodes Ready immediately

🤖 Generated with Claude Code

gemini-code-assist · 2026-03-16T03:58:31Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical performance and reliability issue where model-agent pods on nodes with shared storage would repeatedly attempt to download large models, leading to severe rate-limiting and prolonged startup times. By implementing a comprehensive check for already downloaded and complete models, the system can now efficiently detect existing files and skip unnecessary downloads, drastically improving model availability and reducing network strain in shared storage environments.

Highlights

Model Download Optimization: Introduced a new isModelAlreadyDownloaded function to prevent redundant model downloads when using shared storage, significantly reducing I/O and avoiding rate-limiting issues.
Download Logic Integration: Integrated the isModelAlreadyDownloaded check into both OCI and HuggingFace download workflows, allowing agents to skip downloads if models are already present and verified.
Model Completeness Verification: Defined a robust set of criteria for isModelAlreadyDownloaded, including the presence of config.json, at least one weight file, and full shard verification via model.safetensors.index.json if available.
Task Type Specificity: Ensured the optimization applies only to initial Download tasks, preserving re-evaluation for DownloadOverride tasks (e.g., spec updates or failed retries).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pkg/modelagent/gopher.go
- Imported the encoding/json package for JSON parsing.
- Added a check for already downloaded models before initiating OCI downloads.
- Added a check for already downloaded models before initiating HuggingFace downloads.
- Implemented the isModelAlreadyDownloaded function to verify model presence and completeness on disk.
pkg/modelagent/gopher_test.go
- Imported os and path/filepath packages for file system operations in tests.
- Added a new test suite TestIsModelAlreadyDownloaded with various scenarios to validate the model download check logic.

Activity

Unit tests for isModelAlreadyDownloaded were developed, covering various scenarios like nonexistent directories, empty directories, config-only, weights-only, all weight formats, shard completeness via index.json, missing shards, and malformed index files.
go vet and go test ./pkg/modelagent/... commands were executed and passed successfully.
The changes were deployed to the prod-sci-us-central1-1 cluster, and verification confirmed that model-agent pods correctly detected existing files, verified all 142 shards, skipped HuggingFace downloads, and marked nodes as Ready immediately.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable optimization to skip re-downloading models that already exist on shared storage, which should significantly improve performance and reliability in multi-node environments. The implementation of isModelAlreadyDownloaded is robust and well-tested. My review includes a few suggestions to enhance maintainability by reducing code duplication and simplifying variable assignments.

I am having trouble creating individual review comments. Click here to see my feedback.

pkg/modelagent/gopher.go (334-343)

The variables baseModel and clusterBaseModel are redundant here. You can directly pass task.BaseModel and task.ClusterBaseModel to s.safeParseAndUpdateModelConfig to simplify the code and improve readability.

                                if err := s.safeParseAndUpdateModelConfig(destPath, task.BaseModel, task.ClusterBaseModel, nil); err != nil {
					s.logger.Errorf("Failed to parse and update model config for pre-existing model: %v", err)
				}

pkg/modelagent/gopher.go (1003-1013)

This block of code to determine baseModel and clusterBaseModel is duplicated from processTask. You can simplify this by passing task.BaseModel and task.ClusterBaseModel directly to s.safeParseAndUpdateModelConfig, which will improve readability and reduce code duplication.

		if err := s.safeParseAndUpdateModelConfig(destPath, task.BaseModel, task.ClusterBaseModel, nil); err != nil {
			s.logger.Errorf("Failed to parse and update model config for pre-existing model: %v", err)
		}

pkg/modelagent/gopher.go (1568)

The weightExtensions slice contains a constant set of values. To improve maintainability and avoid re-declaration on each function call, consider defining it as a package-level constant or variable.

When multiple nodes mount the same filesystem (e.g., GPFS/NFS at /storage/models), the model-agent on each node would independently re-download from HuggingFace or OCI, causing rate-limiting and hours of unnecessary I/O. Add isModelAlreadyDownloaded() that checks: 1. config.json exists 2. If model.safetensors.index.json exists, ALL expected shards present 3. Otherwise, at least one weight file (.safetensors/.bin/.pt/.gguf) Only applies to fresh Download tasks (not DownloadOverride) so spec updates and failed retries still re-evaluate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pallasathena92 · 2026-03-31T19:35:48Z

isModelAlreadyDownloaded function checked destPath, which is not only shared storage. This is not entirely aligned with comments.
isModelAlreadyDownloaded only checked config.json, model.safetensors.index.json, xxx.safetensors. A model could have all shards present but the missing other config files needed for inference. In order to gurantee the model artifact fully ready, the progress will do redownload and check md5 for files.
This fast-optimization optimization don't need to add it to oci storage type.
we have download optimization for HF, it also skip some reuseEligible model weight download. The fast-optimization path is unecessary here.

…le models Extend isModelAlreadyDownloaded() to handle three model layouts: 1. Sharded safetensors (existing): verify all shards via index 2. Diffusion pipelines (new): verify component dirs via model_index.json 3. Single-file fallback (new): config.json + weight file heuristic Also: - Propagate safeParseAndUpdateModelConfig errors instead of swallowing - Add path traversal guard for untrusted JSON keys - Add detailed logging at every decision point for debugging - Differentiate os.Stat permission errors from "not exist" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On shared filesystems (NFS/GPFS/CephFS/Lustre), only one agent should download model files per model. Others wait with jitter and recheck. - Detect shared storage via syscall.Statfs filesystem magic numbers - Per-model K8s Leases (model-download-<name>) for parallel downloads of different models while preventing duplicate downloads of the same model - Non-leaders wait up to 5.5min with 15s jitter between rechecks - Handle expired leases, API errors (IsNotFound vs transient), context cancellation - Guard against nil HolderIdentity, lease renewal conflicts - Use time.NewTimer with explicit Stop() to avoid timer leaks - Fall back to downloading if leader times out Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Real-world model_index.json files contain non-component entries like: - "boundary_ratio": 0.9 (float metadata) - "image_encoder": [null, null] (disabled component) Only treat entries as components if they are arrays with at least 2 elements where the first is a non-null string (library name). Also: add lease cleanup after download, fix lease name sanitization (spaces, dots), and improve sanitizeLeaseName for RFC 1123 compliance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Kangyan-Zhou requested review from beiguo218, pallasathena92, slin1237 and truddy0 as code owners March 16, 2026 03:58

github-actions bot added model-agent Model agent changes tests Test changes labels Mar 16, 2026

Kangyan-Zhou force-pushed the skip-redownload-shared-storage branch from 9b2bfa2 to 3d33788 Compare March 16, 2026 04:01

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

Kangyan-Zhou force-pushed the skip-redownload-shared-storage branch from 3d33788 to 8e81b16 Compare March 16, 2026 04:04

Kangyan-Zhou marked this pull request as draft April 3, 2026 05:59

Kangyan-Zhou force-pushed the skip-redownload-shared-storage branch from 4c83bd1 to cb1fd8a Compare April 3, 2026 06:55

Kangyan-Zhou force-pushed the skip-redownload-shared-storage branch from cb1fd8a to 8d0baf6 Compare April 3, 2026 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: skip re-downloading models on shared storage#541

feat: skip re-downloading models on shared storage#541
Kangyan-Zhou wants to merge 4 commits intosgl-project:mainfrom
Kangyan-Zhou:skip-redownload-shared-storage

Kangyan-Zhou commented Mar 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

pallasathena92 commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kangyan-Zhou commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

pkg/modelagent/gopher.go (334-343)

pkg/modelagent/gopher.go (1003-1013)

pkg/modelagent/gopher.go (1568)

Uh oh!

pallasathena92 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kangyan-Zhou commented Mar 16, 2026 •

edited

Loading

pallasathena92 commented Mar 31, 2026 •

edited

Loading