Allow osdc-deploy-prod to target either prod cluster#584
Merged
Conversation
[ghstack-poisoned]
This was referenced May 16, 2026
[ghstack-poisoned]
This was referenced May 16, 2026
jeanschmidt
reviewed
May 17, 2026
Contributor
|
You know what would be cool, not sure if we should do or not, but worth discussing: instead of deploying a single cluster per call, by default it deploys to one cluster, runs the full test battery and if all successful deploys to the next. A selector could force only one or another when triggering the workflow. This would make it operationally much simpler for 99% of deploys and much safer (not deploying both at the same time, only proceeding with the next if the first pass tests, etc) |
jeanschmidt
requested changes
May 17, 2026
[ghstack-poisoned]
[ghstack-poisoned]
jeanschmidt
approved these changes
May 19, 2026
[ghstack-poisoned]
tofu plan — arc-cbr-production❌ Plan failed · commit Plan output |
huydhn
added a commit
that referenced
this pull request
May 19, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * #585 * #584 * #587 * #583 * __->__ #580 * #581 * #586 Adds the new us-west-1 prod cluster definition. Mirrors arc-cbr-production (us-east-2) for everything except: - region: us-west-1 - vpc_cidr: 10.8.0.0/16 (non-overlapping with us-east-2's 10.4.0.0/16 and staging's 10.0.0.0/16) - runner_group: "arc-cbr-prod-uw1" (distinct from us-east-2's default) - nodepools-b200 + arc-runners-b200 dropped (no B200 capacity reserved in us-west-1) - nodepools-h100.capacity_reservation_ids: [] placeholder This commit is non-functional until the prerequisites listed in docs/prod-cluster-ha-us-west-1.md land: 1. Org runner group `arc-cbr-prod-uw1` created at pytorch. 2. Generator support for cluster-level runner_group override. 3. Generator support for cluster-level capacity_reservation_ids override. 4. Real H100 reservation IDs filled into the placeholder array. Keeping the entry in main lets the design doc reference a concrete example and lets the prerequisite PRs reference this entry while they add the generator changes.
huydhn
added a commit
that referenced
this pull request
May 19, 2026
…581) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * #585 * #584 * #587 * #583 * #580 * __->__ #581 * #586 Reads clusters.<id>.arc-runners.runner_group from clusters.yaml and uses it to override the runner_group value set on each runner def. The existing repo-scope guard at line 187 still applies — repo-scoped githubConfigUrl values are forced to "default" even when the cluster config asks for a custom group. This unblocks the multi-region prod design: two clusters can advertise the same mt-* runner labels and register in different GitHub runner groups (e.g. us-east-2 in "default", us-west-1 in "arc-cbr-prod-uw1") so GitHub routes runs-on jobs across both groups by capacity. Adds three new tests: - Cluster override wins over the def file's value - Cluster override applies even when the def doesn't set runner_group - Repo-scope guard still forces "default" when cluster sets a custom group See osdc/docs/prod-cluster-ha-us-west-1.md for the broader design.
huydhn
added a commit
that referenced
this pull request
May 19, 2026
…583) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * #585 * #584 * #587 * __->__ #583 * #580 * #581 * #586 Adds a per-cluster override for the capacity_reservation_ids that were previously hardcoded in nodepool def files (e.g. modules/nodepools-h100/defs/p5-48xlarge.yaml). The override key in clusters.yaml is namespaced per module so b200 and h100 stay independent: clusters: arc-cbr-production-uw1: nodepools-h100: capacity_reservation_ids: - cr-04d3d1d84e127a562 - cr-09a53051589034fb8 This unblocks the multi-region prod design: one shared module def file, different reservation IDs per region. The us-east-2 IDs in the def file remain as the fallback when no cluster-level override is set. Implementation: - cluster-config.py: print list values as comma-separated for shell consumption. Existing string/bool behaviors unchanged. - modules/nodepools/deploy.sh: read clusters.<id>.<MODULE_NAME>. capacity_reservation_ids and pass as NODEPOOLS_CAPACITY_RESERVATION_IDS_OVERRIDE env var. MODULE_NAME comes from the nodepools-b200 / nodepools-h100 delegators, so the override key namespaces automatically. - generate_nodepools.py: if the env var is set and non-empty, parse it (comma-separated) and override the def file's value. Empty string leaves the def value alone. Adds 5 new tests: - cluster-config.py list output format (comma-separated, empty list) - Generator override wins over def value - Override applies when def has no value set - Empty override env var keeps def value intact See osdc/docs/prod-cluster-ha-us-west-1.md "Phase 1 prerequisite 2" for the broader context.
huydhn
added a commit
that referenced
this pull request
May 19, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * #585 * #584 * __->__ #587 * #583 * #580 * #581 * #586 Both prod clusters now declare their capacity_reservation_ids in clusters.yaml instead of inside the shared module def files. The def files describe nodepool shape; clusters.yaml declares per-cluster reservations. Moved out of def files into clusters.arc-cbr-production: nodepools-h100: cr-0c3f05dffb85ed832 nodepools-b200: cr-02cf82c9a0f7fa8c0, cr-06ec9d6c14b9d9981 Added under clusters.arc-cbr-production-uw1: nodepools-h100: cr-04d3d1d84e127a562 (2 × p5.48xlarge, 16 H100 GPUs) cr-09a53051589034fb8 (4 × p5.48xlarge, 32 H100 GPUs) Future capacity-reservation rotations now touch clusters.yaml only, keeping the reservation lifecycle next to the cluster config rather than buried inside shared module data. The generator change in the previous commit reads these values via the NODEPOOLS_CAPACITY_RESERVATION_IDS_OVERRIDE env var that the nodepools deploy.sh now exports per cluster.
huydhn
added a commit
that referenced
this pull request
May 19, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * __->__ #585 * #584 * #587 * #583 * #580 * #581 * #586 Extracts the plan + PR-comment logic into a reusable _osdc-plan.yml and has osdc-plan-prod call it twice — once per cluster, in sequence (us-east-2, then us-west-1). Each cluster's plan posts its own PR comment with a per-cluster marker. The second leg runs regardless of the first's outcome (via `!cancelled()`) so reviewers see both diffs even when one is broken.
huydhn
added a commit
to huydhn/pytorch-ci-infra
that referenced
this pull request
May 19, 2026
…orch#586) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * pytorch#585 * pytorch#584 * pytorch#587 * pytorch#583 * pytorch#580 * pytorch#581 * __->__ pytorch#586 Documents the active/active design for adding a second prod cluster in us-west-1 alongside arc-cbr-production (us-east-2): - Both clusters advertise the same mt-* runner labels but register in different GitHub runner groups (us-east-2 in `default`, us-west-1 in `arc-cbr-prod-uw1`) so GitHub matches `runs-on` across both groups. - GitHub routes jobs by capacity → active/active for free, failover as a side effect. - B200 omitted (no capacity reservation in us-west-1); H100 supported via per-cluster `capacity_reservation_ids` in clusters.yaml. The doc covers the design, prerequisites (org runner group, H100 reservation IDs, service-quota raises), the full clusters.yaml entry to paste, deploy commands, post-deploy validation, capacity ramp recommendation, and what's deliberately out of scope.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Adds a workflow_dispatch
clusterchoice input so the prod deployworkflow can target arc-cbr-production (us-east-2, default) OR
arc-cbr-production-uw1 (us-west-1).