feat: skip re-downloading models on shared storage#541
feat: skip re-downloading models on shared storage#541Kangyan-Zhou wants to merge 4 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical performance and reliability issue where model-agent pods on nodes with shared storage would repeatedly attempt to download large models, leading to severe rate-limiting and prolonged startup times. By implementing a comprehensive check for already downloaded and complete models, the system can now efficiently detect existing files and skip unnecessary downloads, drastically improving model availability and reducing network strain in shared storage environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
9b2bfa2 to
3d33788
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable optimization to skip re-downloading models that already exist on shared storage, which should significantly improve performance and reliability in multi-node environments. The implementation of isModelAlreadyDownloaded is robust and well-tested. My review includes a few suggestions to enhance maintainability by reducing code duplication and simplifying variable assignments.
I am having trouble creating individual review comments. Click here to see my feedback.
pkg/modelagent/gopher.go (334-343)
The variables baseModel and clusterBaseModel are redundant here. You can directly pass task.BaseModel and task.ClusterBaseModel to s.safeParseAndUpdateModelConfig to simplify the code and improve readability.
if err := s.safeParseAndUpdateModelConfig(destPath, task.BaseModel, task.ClusterBaseModel, nil); err != nil {
s.logger.Errorf("Failed to parse and update model config for pre-existing model: %v", err)
}pkg/modelagent/gopher.go (1003-1013)
This block of code to determine baseModel and clusterBaseModel is duplicated from processTask. You can simplify this by passing task.BaseModel and task.ClusterBaseModel directly to s.safeParseAndUpdateModelConfig, which will improve readability and reduce code duplication.
if err := s.safeParseAndUpdateModelConfig(destPath, task.BaseModel, task.ClusterBaseModel, nil); err != nil {
s.logger.Errorf("Failed to parse and update model config for pre-existing model: %v", err)
}pkg/modelagent/gopher.go (1568)
The weightExtensions slice contains a constant set of values. To improve maintainability and avoid re-declaration on each function call, consider defining it as a package-level constant or variable.
When multiple nodes mount the same filesystem (e.g., GPFS/NFS at /storage/models), the model-agent on each node would independently re-download from HuggingFace or OCI, causing rate-limiting and hours of unnecessary I/O. Add isModelAlreadyDownloaded() that checks: 1. config.json exists 2. If model.safetensors.index.json exists, ALL expected shards present 3. Otherwise, at least one weight file (.safetensors/.bin/.pt/.gguf) Only applies to fresh Download tasks (not DownloadOverride) so spec updates and failed retries still re-evaluate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3d33788 to
8e81b16
Compare
|
…le models Extend isModelAlreadyDownloaded() to handle three model layouts: 1. Sharded safetensors (existing): verify all shards via index 2. Diffusion pipelines (new): verify component dirs via model_index.json 3. Single-file fallback (new): config.json + weight file heuristic Also: - Propagate safeParseAndUpdateModelConfig errors instead of swallowing - Add path traversal guard for untrusted JSON keys - Add detailed logging at every decision point for debugging - Differentiate os.Stat permission errors from "not exist" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4c83bd1 to
cb1fd8a
Compare
On shared filesystems (NFS/GPFS/CephFS/Lustre), only one agent should download model files per model. Others wait with jitter and recheck. - Detect shared storage via syscall.Statfs filesystem magic numbers - Per-model K8s Leases (model-download-<name>) for parallel downloads of different models while preventing duplicate downloads of the same model - Non-leaders wait up to 5.5min with 15s jitter between rechecks - Handle expired leases, API errors (IsNotFound vs transient), context cancellation - Guard against nil HolderIdentity, lease renewal conflicts - Use time.NewTimer with explicit Stop() to avoid timer leaks - Fall back to downloading if leader times out Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cb1fd8a to
8d0baf6
Compare
Real-world model_index.json files contain non-component entries like: - "boundary_ratio": 0.9 (float metadata) - "image_encoder": [null, null] (disabled component) Only treat entries as components if they are arrays with at least 2 elements where the first is a non-null string (library name). Also: add lease cleanup after download, fix lease name sanitization (spaces, dots), and improve sanitizeLeaseName for RFC 1123 compliance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
When multiple nodes mount the same filesystem (e.g., GPFS/NFS at a shared model path), the model-agent on each node would independently re-download from HuggingFace or OCI on restart, causing rate-limiting and hours of unnecessary I/O.
isModelAlreadyDownloaded()that verifies model completeness locally usingmodel.safetensors.index.json:config.jsonexistsmodel.safetensors.index.jsonexists and is parseableDownloadtasks —DownloadOverride(spec updates, failed retries) still re-evaluatesMotivation
On a production cluster with 32 H200 nodes and shared GPFS, a DaemonSet restart caused all 32 model-agent pods to independently attempt HF downloads for a 756 GB model (142 shards). Only 1 of 32 nodes completed the download; the rest were stuck in
Updatingat ~100-500 KB/s due to HF rate limiting. This blocked pod scheduling because theclusterbasemodelReady label was only applied to nodes that completed the download.With this change, all 32 nodes detected the existing files and flipped to
Readywithin seconds of the pod restart.Test plan
isModelAlreadyDownloadedcovering: nonexistent dir, empty dir, config-only, weights-only (no index), shard completeness via index.json, missing shards, malformed index, empty weight_mapgo vetandgo test ./pkg/modelagent/...pass🤖 Generated with Claude Code