Allow osdc-deploy-prod to target either prod cluster by huydhn · Pull Request #584 · pytorch/ci-infra

huydhn · 2026-05-16T02:43:50Z

Stack from ghstack (oldest at bottom):

Adds a workflow_dispatch cluster choice input so the prod deploy
workflow can target arc-cbr-production (us-east-2, default) OR
arc-cbr-production-uw1 (us-west-1).

[ghstack-poisoned]

jeanschmidt · 2026-05-17T17:27:14Z

You know what would be cool, not sure if we should do or not, but worth discussing:

instead of deploying a single cluster per call, by default it deploys to one cluster, runs the full test battery and if all successful deploys to the next.

A selector could force only one or another when triggering the workflow.

This would make it operationally much simpler for 99% of deploys and much safer (not deploying both at the same time, only proceeding with the next if the first pass tests, etc)

[ghstack-poisoned]

github-actions · 2026-05-19T18:45:26Z

tofu plan — arc-cbr-production

❌ Plan failed · commit d0662b1d · run log

Plan output

Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


Acquiring state lock. This may take a few moments...

Error: Error acquiring the state lock

Error message: operation error DynamoDB: PutItem, https response error
StatusCode: 400, RequestID:
K5687TMKP3L251K4OOUFN7JSIFVV4KQNSO5AEMVJF66Q9ASUAAJG,
ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        d3a1f807-e72a-3e34-0bc6-625156e3efe0
  Path:      ciforge-tfstate-arc-cbr-prod/arc-cbr-production/base/terraform.tfstate
  Operation: OperationTypePlan
  Who:       runner@runnervmrw5os
  Version:   1.7.10
  Created:   2026-05-19 18:45:23.744829717 +0000 UTC
  Info:      


OpenTofu acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.
error: recipe `plan` failed with exit code 1

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * #585 * #584 * #587 * #583 * __->__ #580 * #581 * #586 Adds the new us-west-1 prod cluster definition. Mirrors arc-cbr-production (us-east-2) for everything except: - region: us-west-1 - vpc_cidr: 10.8.0.0/16 (non-overlapping with us-east-2's 10.4.0.0/16 and staging's 10.0.0.0/16) - runner_group: "arc-cbr-prod-uw1" (distinct from us-east-2's default) - nodepools-b200 + arc-runners-b200 dropped (no B200 capacity reserved in us-west-1) - nodepools-h100.capacity_reservation_ids: [] placeholder This commit is non-functional until the prerequisites listed in docs/prod-cluster-ha-us-west-1.md land: 1. Org runner group `arc-cbr-prod-uw1` created at pytorch. 2. Generator support for cluster-level runner_group override. 3. Generator support for cluster-level capacity_reservation_ids override. 4. Real H100 reservation IDs filled into the placeholder array. Keeping the entry in main lets the design doc reference a concrete example and lets the prerequisite PRs reference this entry while they add the generator changes.

…581) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * #585 * #584 * #587 * #583 * #580 * __->__ #581 * #586 Reads clusters.<id>.arc-runners.runner_group from clusters.yaml and uses it to override the runner_group value set on each runner def. The existing repo-scope guard at line 187 still applies — repo-scoped githubConfigUrl values are forced to "default" even when the cluster config asks for a custom group. This unblocks the multi-region prod design: two clusters can advertise the same mt-* runner labels and register in different GitHub runner groups (e.g. us-east-2 in "default", us-west-1 in "arc-cbr-prod-uw1") so GitHub routes runs-on jobs across both groups by capacity. Adds three new tests: - Cluster override wins over the def file's value - Cluster override applies even when the def doesn't set runner_group - Repo-scope guard still forces "default" when cluster sets a custom group See osdc/docs/prod-cluster-ha-us-west-1.md for the broader design.

…583) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * #585 * #584 * #587 * __->__ #583 * #580 * #581 * #586 Adds a per-cluster override for the capacity_reservation_ids that were previously hardcoded in nodepool def files (e.g. modules/nodepools-h100/defs/p5-48xlarge.yaml). The override key in clusters.yaml is namespaced per module so b200 and h100 stay independent: clusters: arc-cbr-production-uw1: nodepools-h100: capacity_reservation_ids: - cr-04d3d1d84e127a562 - cr-09a53051589034fb8 This unblocks the multi-region prod design: one shared module def file, different reservation IDs per region. The us-east-2 IDs in the def file remain as the fallback when no cluster-level override is set. Implementation: - cluster-config.py: print list values as comma-separated for shell consumption. Existing string/bool behaviors unchanged. - modules/nodepools/deploy.sh: read clusters.<id>.<MODULE_NAME>. capacity_reservation_ids and pass as NODEPOOLS_CAPACITY_RESERVATION_IDS_OVERRIDE env var. MODULE_NAME comes from the nodepools-b200 / nodepools-h100 delegators, so the override key namespaces automatically. - generate_nodepools.py: if the env var is set and non-empty, parse it (comma-separated) and override the def file's value. Empty string leaves the def value alone. Adds 5 new tests: - cluster-config.py list output format (comma-separated, empty list) - Generator override wins over def value - Override applies when def has no value set - Empty override env var keeps def value intact See osdc/docs/prod-cluster-ha-us-west-1.md "Phase 1 prerequisite 2" for the broader context.

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * #585 * #584 * __->__ #587 * #583 * #580 * #581 * #586 Both prod clusters now declare their capacity_reservation_ids in clusters.yaml instead of inside the shared module def files. The def files describe nodepool shape; clusters.yaml declares per-cluster reservations. Moved out of def files into clusters.arc-cbr-production: nodepools-h100: cr-0c3f05dffb85ed832 nodepools-b200: cr-02cf82c9a0f7fa8c0, cr-06ec9d6c14b9d9981 Added under clusters.arc-cbr-production-uw1: nodepools-h100: cr-04d3d1d84e127a562 (2 × p5.48xlarge, 16 H100 GPUs) cr-09a53051589034fb8 (4 × p5.48xlarge, 32 H100 GPUs) Future capacity-reservation rotations now touch clusters.yaml only, keeping the reservation lifecycle next to the cluster config rather than buried inside shared module data. The generator change in the previous commit reads these values via the NODEPOOLS_CAPACITY_RESERVATION_IDS_OVERRIDE env var that the nodepools deploy.sh now exports per cluster.

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * __->__ #585 * #584 * #587 * #583 * #580 * #581 * #586 Extracts the plan + PR-comment logic into a reusable _osdc-plan.yml and has osdc-plan-prod call it twice — once per cluster, in sequence (us-east-2, then us-west-1). Each cluster's plan posts its own PR comment with a per-cluster marker. The second leg runs regardless of the first's outcome (via `!cancelled()`) so reviewers see both diffs even when one is broken.

…orch#586) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0) (oldest at bottom): * pytorch#585 * pytorch#584 * pytorch#587 * pytorch#583 * pytorch#580 * pytorch#581 * __->__ pytorch#586 Documents the active/active design for adding a second prod cluster in us-west-1 alongside arc-cbr-production (us-east-2): - Both clusters advertise the same mt-* runner labels but register in different GitHub runner groups (us-east-2 in `default`, us-west-1 in `arc-cbr-prod-uw1`) so GitHub matches `runs-on` across both groups. - GitHub routes jobs by capacity → active/active for free, failover as a side effect. - B200 omitted (no capacity reservation in us-west-1); H100 supported via per-cluster `capacity_reservation_ids` in clusters.yaml. The doc covers the design, prerequisites (org runner group, H100 reservation IDs, service-quota raises), the full clusters.yaml entry to paste, deploy commands, post-deploy validation, capacity ramp recommendation, and what's deliberately out of scope.

huydhn added 2 commits May 15, 2026 19:43

Update (base update)

8894c6b

[ghstack-poisoned]

Update

1d1b30e

[ghstack-poisoned]

huydhn requested review from ZainRizvi, jeanschmidt and zxiiro as code owners May 16, 2026 02:43

huydhn added 2 commits May 15, 2026 20:05

Update (base update)

56b7f25

[ghstack-poisoned]

Update

70b3ea0

[ghstack-poisoned]

This was referenced May 16, 2026

Add design doc for arc-cbr-production-uw1 (us-west-1 HA cluster) #586

Merged

Move all prod capacity reservation IDs into clusters.yaml #587

Merged

jeanschmidt reviewed May 17, 2026

View reviewed changes

Comment thread .github/workflows/osdc-deploy-prod.yml Outdated

jeanschmidt requested changes May 17, 2026

View reviewed changes

huydhn added 2 commits May 17, 2026 19:17

Update (base update)

d304d21

[ghstack-poisoned]

Update

16595eb

[ghstack-poisoned]

huydhn requested a review from jeanschmidt May 18, 2026 02:19

huydhn added 2 commits May 17, 2026 22:34

Update (base update)

7784f36

[ghstack-poisoned]

Update

03eea75

[ghstack-poisoned]

jeanschmidt approved these changes May 19, 2026

View reviewed changes

huydhn added 2 commits May 19, 2026 11:44

Update (base update)

80a640b

[ghstack-poisoned]

Update

9dd9bb0

[ghstack-poisoned]

huydhn had a problem deploying to osdc-staging May 19, 2026 18:45 — with GitHub Actions Failure

huydhn changed the base branch from gh/huydhn/28/base to main May 19, 2026 18:51

Merge branch 'main' into gh/huydhn/28/head

68ce66a

huydhn merged commit d7e7815 into main May 19, 2026
10 checks passed

huydhn deleted the gh/huydhn/28/head branch May 19, 2026 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow osdc-deploy-prod to target either prod cluster#584

Allow osdc-deploy-prod to target either prod cluster#584
huydhn merged 11 commits into
mainfrom
gh/huydhn/28/head

huydhn commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

jeanschmidt commented May 17, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huydhn commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jeanschmidt commented May 17, 2026

Uh oh!

github-actions Bot commented May 19, 2026

tofu plan — arc-cbr-production

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huydhn commented May 16, 2026 •

edited

Loading