Skip to content

Allow osdc-deploy-prod to target either prod cluster#584

Merged
huydhn merged 11 commits into
mainfrom
gh/huydhn/28/head
May 19, 2026
Merged

Allow osdc-deploy-prod to target either prod cluster#584
huydhn merged 11 commits into
mainfrom
gh/huydhn/28/head

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented May 16, 2026

huydhn added 2 commits May 15, 2026 19:43
[ghstack-poisoned]
[ghstack-poisoned]
huydhn added 2 commits May 15, 2026 20:05
[ghstack-poisoned]
[ghstack-poisoned]
Comment thread .github/workflows/osdc-deploy-prod.yml Outdated
@jeanschmidt
Copy link
Copy Markdown
Contributor

You know what would be cool, not sure if we should do or not, but worth discussing:

instead of deploying a single cluster per call, by default it deploys to one cluster, runs the full test battery and if all successful deploys to the next.

A selector could force only one or another when triggering the workflow.

This would make it operationally much simpler for 99% of deploys and much safer (not deploying both at the same time, only proceeding with the next if the first pass tests, etc)

huydhn added 2 commits May 17, 2026 19:17
[ghstack-poisoned]
[ghstack-poisoned]
@huydhn huydhn requested a review from jeanschmidt May 18, 2026 02:19
huydhn added 2 commits May 17, 2026 22:34
[ghstack-poisoned]
[ghstack-poisoned]
huydhn added 2 commits May 19, 2026 11:44
[ghstack-poisoned]
[ghstack-poisoned]
@github-actions
Copy link
Copy Markdown

tofu plan — arc-cbr-production

❌ Plan failed · commit d0662b1d · run log

Plan output
Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


Acquiring state lock. This may take a few moments...

Error: Error acquiring the state lock

Error message: operation error DynamoDB: PutItem, https response error
StatusCode: 400, RequestID:
K5687TMKP3L251K4OOUFN7JSIFVV4KQNSO5AEMVJF66Q9ASUAAJG,
ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        d3a1f807-e72a-3e34-0bc6-625156e3efe0
  Path:      ciforge-tfstate-arc-cbr-prod/arc-cbr-production/base/terraform.tfstate
  Operation: OperationTypePlan
  Who:       runner@runnervmrw5os
  Version:   1.7.10
  Created:   2026-05-19 18:45:23.744829717 +0000 UTC
  Info:      


OpenTofu acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.
error: recipe `plan` failed with exit code 1

huydhn added a commit that referenced this pull request May 19, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0)
(oldest at bottom):
* #585
* #584
* #587
* #583
* __->__ #580
* #581
* #586

Adds the new us-west-1 prod cluster definition. Mirrors
arc-cbr-production (us-east-2) for everything except:

- region: us-west-1
- vpc_cidr: 10.8.0.0/16 (non-overlapping with us-east-2's 10.4.0.0/16
  and staging's 10.0.0.0/16)
- runner_group: "arc-cbr-prod-uw1" (distinct from us-east-2's default)
- nodepools-b200 + arc-runners-b200 dropped (no B200 capacity reserved
  in us-west-1)
- nodepools-h100.capacity_reservation_ids: [] placeholder

This commit is non-functional until the prerequisites listed in
docs/prod-cluster-ha-us-west-1.md land:
1. Org runner group `arc-cbr-prod-uw1` created at pytorch.
2. Generator support for cluster-level runner_group override.
3. Generator support for cluster-level capacity_reservation_ids
override.
4. Real H100 reservation IDs filled into the placeholder array.

Keeping the entry in main lets the design doc reference a concrete
example and lets the prerequisite PRs reference this entry while they
add the generator changes.
huydhn added a commit that referenced this pull request May 19, 2026
…581)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0)
(oldest at bottom):
* #585
* #584
* #587
* #583
* #580
* __->__ #581
* #586

Reads clusters.<id>.arc-runners.runner_group from clusters.yaml and
uses it to override the runner_group value set on each runner def.
The existing repo-scope guard at line 187 still applies — repo-scoped
githubConfigUrl values are forced to "default" even when the cluster
config asks for a custom group.

This unblocks the multi-region prod design: two clusters can advertise
the same mt-* runner labels and register in different GitHub runner
groups (e.g. us-east-2 in "default", us-west-1 in "arc-cbr-prod-uw1")
so GitHub routes runs-on jobs across both groups by capacity.

Adds three new tests:
- Cluster override wins over the def file's value
- Cluster override applies even when the def doesn't set runner_group
- Repo-scope guard still forces "default" when cluster sets a custom
group

See osdc/docs/prod-cluster-ha-us-west-1.md for the broader design.
huydhn added a commit that referenced this pull request May 19, 2026
…583)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0)
(oldest at bottom):
* #585
* #584
* #587
* __->__ #583
* #580
* #581
* #586

Adds a per-cluster override for the capacity_reservation_ids that
were previously hardcoded in nodepool def files (e.g.
modules/nodepools-h100/defs/p5-48xlarge.yaml). The override key in
clusters.yaml is namespaced per module so b200 and h100 stay
independent:

    clusters:
      arc-cbr-production-uw1:
        nodepools-h100:
          capacity_reservation_ids:
            - cr-04d3d1d84e127a562
            - cr-09a53051589034fb8

This unblocks the multi-region prod design: one shared module def
file, different reservation IDs per region. The us-east-2 IDs in the
def file remain as the fallback when no cluster-level override is
set.

Implementation:

- cluster-config.py: print list values as comma-separated for shell
  consumption. Existing string/bool behaviors unchanged.
- modules/nodepools/deploy.sh: read clusters.<id>.<MODULE_NAME>.
  capacity_reservation_ids and pass as
  NODEPOOLS_CAPACITY_RESERVATION_IDS_OVERRIDE env var. MODULE_NAME
  comes from the nodepools-b200 / nodepools-h100 delegators, so the
  override key namespaces automatically.
- generate_nodepools.py: if the env var is set and non-empty, parse
  it (comma-separated) and override the def file's value. Empty
  string leaves the def value alone.

Adds 5 new tests:
- cluster-config.py list output format (comma-separated, empty list)
- Generator override wins over def value
- Override applies when def has no value set
- Empty override env var keeps def value intact

See osdc/docs/prod-cluster-ha-us-west-1.md "Phase 1 prerequisite 2"
for the broader context.
huydhn added a commit that referenced this pull request May 19, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0)
(oldest at bottom):
* #585
* #584
* __->__ #587
* #583
* #580
* #581
* #586

Both prod clusters now declare their capacity_reservation_ids in
clusters.yaml instead of inside the shared module def files. The def
files describe nodepool shape; clusters.yaml declares per-cluster
reservations.

Moved out of def files into clusters.arc-cbr-production:
  nodepools-h100: cr-0c3f05dffb85ed832
  nodepools-b200: cr-02cf82c9a0f7fa8c0, cr-06ec9d6c14b9d9981

Added under clusters.arc-cbr-production-uw1:
  nodepools-h100: cr-04d3d1d84e127a562 (2 × p5.48xlarge, 16 H100 GPUs)
                  cr-09a53051589034fb8 (4 × p5.48xlarge, 32 H100 GPUs)

Future capacity-reservation rotations now touch clusters.yaml only,
keeping the reservation lifecycle next to the cluster config rather
than buried inside shared module data. The generator change in the
previous commit reads these values via the
NODEPOOLS_CAPACITY_RESERVATION_IDS_OVERRIDE env var that the nodepools
deploy.sh now exports per cluster.
@huydhn huydhn changed the base branch from gh/huydhn/28/base to main May 19, 2026 18:51
@huydhn huydhn merged commit d7e7815 into main May 19, 2026
10 checks passed
@huydhn huydhn deleted the gh/huydhn/28/head branch May 19, 2026 18:52
huydhn added a commit that referenced this pull request May 19, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0)
(oldest at bottom):
* __->__ #585
* #584
* #587
* #583
* #580
* #581
* #586

Extracts the plan + PR-comment logic into a reusable _osdc-plan.yml
and has osdc-plan-prod call it twice — once per cluster, in sequence
(us-east-2, then us-west-1). Each cluster's plan posts its own PR
comment with a per-cluster marker.

The second leg runs regardless of the first's outcome (via
`!cancelled()`) so reviewers see both diffs even when one is broken.
huydhn added a commit to huydhn/pytorch-ci-infra that referenced this pull request May 19, 2026
…orch#586)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.14.0)
(oldest at bottom):
* pytorch#585
* pytorch#584
* pytorch#587
* pytorch#583
* pytorch#580
* pytorch#581
* __->__ pytorch#586

Documents the active/active design for adding a second prod cluster in
us-west-1 alongside arc-cbr-production (us-east-2):

- Both clusters advertise the same mt-* runner labels but register in
  different GitHub runner groups (us-east-2 in `default`, us-west-1
  in `arc-cbr-prod-uw1`) so GitHub matches `runs-on` across both
  groups.
- GitHub routes jobs by capacity → active/active for free, failover
  as a side effect.
- B200 omitted (no capacity reservation in us-west-1); H100 supported
  via per-cluster `capacity_reservation_ids` in clusters.yaml.

The doc covers the design, prerequisites (org runner group, H100
reservation IDs, service-quota raises), the full clusters.yaml entry
to paste, deploy commands, post-deploy validation, capacity ramp
recommendation, and what's deliberately out of scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants