Add arc-staging-uw2 cluster for multi-region HA feasibility test by huydhn · Pull Request #574 · pytorch/ci-infra

huydhn · 2026-05-15T17:29:35Z

No need to review

Adds a second staging cluster in us-west-2 that shares the same runner_name_prefix and github_config_url as arc-staging. Purpose is to verify GitHub accepts duplicate ARC scale-set names and routes jobs by capacity — the load-bearing assumption for an eventual active/active prod deployment in us-west-1. Includes docs/prod-cluster-ha-us-west-1.md with the full Phase 0/Phase 1 plan, validation gates, and the H100 capacity-reservation override proposal.

Runs `just bootstrap arc-staging-uw2` followed by `just deploy-base arc-staging-uw2` from a manual workflow_dispatch trigger. Lets us bring up the new staging cluster's base infra (VPC, EKS, Harbor, base k8s resources) from CI rather than requiring a local Linux machine. Module deploy is intentionally not included — modules need the pytorch-arc-staging GitHub App secret planted into the arc-runners namespace first, which is a manual kubectl step. Delete this workflow once the feasibility test wraps up.

Adds pull_request trigger with path filters so the workflow runs automatically when this workflow, clusters.yaml, or the plan doc changes. bootstrap and deploy-base are idempotent, so the first PR push provisions the cluster (~25min) and later pushes are quick no-ops. Concurrency is scoped per-ref so two open PRs don't collide.

github-actions · 2026-05-15T17:30:34Z

tofu plan — arc-cbr-production

✅ Plan succeeded · commit a1f81ae3 · run log

Plan output

Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=8115d61b-1bc1-49ad-b5a3-e8f88fc50cb1]
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-node-role]
data.aws_availability_zones.available: Reading...
module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3]
module.eks.data.aws_caller_identity.current: Reading...
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0a126b1613758a408]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNMSO5RRNP]
data.aws_availability_zones.available: Read complete after 1s [id=us-east-2]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-eks-secrets]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936813000000004]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936734100000003]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260316204739334600000001]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role-20260308084936681500000001]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role-20260308084936685500000002]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936816800000005]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-009f1fe7d56695348]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-03eb66e57d13af64b]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-084ed6fc52db22c39]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0701693364b79c021]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-023207cd15e79c81a]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-06a70b2818e270ed8]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0610564f678f81c5f]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-0078fd5c0f6bc05eb]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0545d26e4a1d0ba89]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-04682fc890bfd4630]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0ce6f1dcb7208cad8]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-07ac52a1aa741f267]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3-20260308084938596600000006]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0d2591f24cba79e7b]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-04d9bba8d43569bbf]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-0aa6ea5c845170545]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-0f34cc1aafea8fd16]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-07e2274170282eb8c]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-086e3e66fe238d459]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0f623a6fa9d7bde45]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-000d05ecec7d4b66e]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0777285eddd2bacd1]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-00dacd13031b1f5de]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0ec9764e9015e972e]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-08ccb8cfe4bfa80d7]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production:kube-proxy]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-090bac79dddc5b77f]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production:pytorch-arc-cbr-production-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/70AA0C12C21E1A843313EF1BDE82D29A]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=2255203180]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production:coredns]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry-2026030809125509320000000c]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role-2026030809125522790000000d]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-ebs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change]
data.terraform_remote_state.base: Read complete after 0s
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-03b965bcc0c037434,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-04682fc890bfd4630"]: Refreshing state... [id=subnet-04682fc890bfd4630,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0545d26e4a1d0ba89"]: Refreshing state... [id=subnet-0545d26e4a1d0ba89,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0ce6f1dcb7208cad8"]: Refreshing state... [id=subnet-0ce6f1dcb7208cad8,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-karpenter-controller]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change-KarpenterScheduledChange]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller-20260308154648023000000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-053d2ed886d9ac92d]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role]
aws_security_group.efs: Refreshing state... [id=sg-099ef6309262a93fd]
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role]
aws_efs_mount_target.pypi_cache["subnet-0545d26e4a1d0ba89"]: Refreshing state... [id=fsmt-05b0a0d538bd49c8e]
aws_efs_mount_target.pypi_cache["subnet-04682fc890bfd4630"]: Refreshing state... [id=fsmt-0743bba60c50ed499]
aws_efs_mount_target.pypi_cache["subnet-0ce6f1dcb7208cad8"]: Refreshing state... [id=fsmt-01378a00a07852987]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role-20260403211352357700000001]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role-20260330040250456800000003]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role-20260403211352439500000002]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-efs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

Alpine has rolled forward from util-linux=2.41.2-r0 to 2.41.4-r0, so the existing pin no longer resolves and `apk add` fails. Updating to the currently-available version unblocks `just deploy-base` on any cluster (the arc-staging-uw2 deploy hit this first).

deploy-base builds image-cache-janitor (and node-compactor) for both amd64 and arm64 via `docker build --platform linux/<arch>`. On a fresh amd64 GitHub Actions runner, the arm64 build fails with "exec format error" because binfmt_misc has no handler registered for arm64 binaries. Existing cluster deploys don't hit this — Harbor caches prior builds and the deploy script's skip-if-exists check short-circuits the rebuild. arc-staging-uw2 starts with an empty Harbor, so every image gets built fresh, surfacing the missing QEMU. docker/setup-qemu-action registers the binfmt handlers via tonistiigi/ binfmt. The shared _osdc-deploy.yml will need the same fix the next time someone provisions a brand-new cluster.

Mirrors the QEMU setup from osdc-deploy-staging-uw2-base.yml into the reusable deploy workflow so existing prod/staging deploys don't break when the image-cache-janitor Dockerfile change in this PR forces every cluster's next deploy to rebuild from scratch (new content-addressed tag, empty Harbor cache for that tag). Holding this on the feasibility branch for now to verify it works before promoting to main.

Switches the deploy step from `just deploy-base` to `just deploy`, so the workflow runs base + every module in the cluster's modules: list (karpenter, arc, nodepools, arc-runners, monitoring, logging). Base is idempotent on re-run, so this is effectively the module phase after the previous base-only deploy. Assumes the pytorch-arc-staging GitHub App Secret is already planted into the arc-runners namespace (one-time manual step copying from the existing arc-staging cluster).

…ytorch#575) ## Summary Two related fixes that surface when a brand-new OSDC cluster is deployed from CI: - **`osdc/base/kubernetes/image-cache-janitor/docker/Dockerfile`** — bump `util-linux` pin from `2.41.2-r0` → `2.41.4-r0`. Alpine has rolled forward and the old pin no longer resolves; `apk add` fails. - **`.github/workflows/_osdc-deploy.yml`** — add `docker/setup-qemu-action` before the deploy step so `docker build --platform linux/arm64` works on the amd64 GitHub Actions runner. ## Why both, why now The janitor image's tag in the deploy script is content-addressed to the Dockerfile (`sha256(Dockerfile)[:12]`). Existing clusters' Harbors have the image cached against the *old* hash, so every deploy short-circuits the build via the skip-if-exists check at `osdc/base/kubernetes/image-cache-janitor/deploy.sh:80`. No build means no QEMU dependency, which is why this has worked silently for months. The Dockerfile bump in this PR rotates the hash. After merge, the next deploy of every cluster will see a new tag, miss the Harbor cache, and fall through to the build branch — which includes an arm64 build leg that fails on amd64 runners without QEMU registered. Bundling both fixes keeps the change reviewable as one cause-and-effect: pin rot forces a rebuild; rebuild forces QEMU. ## Test plan - [x] `just lint` and `just test` on the change (covered locally) - [x] arc-staging-uw2 PR (pytorch#574) exercises an identical QEMU step on the one-off bootstrap workflow and successfully built both arch variants of image-cache-janitor from scratch - [ ] First post-merge deploy of any cluster via `_osdc-deploy.yml` should rebuild image-cache-janitor for both archs and complete cleanly ## Follow-up (not in this PR) `python:3.12-alpine` is a mutable tag. The next Alpine release will rotate util-linux again. A later PR should pin the base by digest to make the build reproducible and stop this from recurring.

huydhn added 3 commits May 15, 2026 01:42

huydhn temporarily deployed to osdc-staging May 15, 2026 17:29 — with GitHub Actions Inactive

huydhn had a problem deploying to osdc-staging May 15, 2026 17:29 — with GitHub Actions Failure

huydhn had a problem deploying to osdc-staging May 15, 2026 17:48 — with GitHub Actions Failure

huydhn temporarily deployed to osdc-staging May 15, 2026 17:48 — with GitHub Actions Inactive

huydhn added 2 commits May 15, 2026 10:59

huydhn temporarily deployed to osdc-staging May 15, 2026 18:14 — with GitHub Actions Inactive

huydhn mentioned this pull request May 15, 2026

Fix cross-arch image-cache-janitor build for fresh cluster deploys #575

Merged

3 tasks

huydhn had a problem deploying to osdc-staging May 15, 2026 18:51 — with GitHub Actions Failure

huydhn temporarily deployed to osdc-staging May 15, 2026 18:51 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add arc-staging-uw2 cluster for multi-region HA feasibility test#574

Add arc-staging-uw2 cluster for multi-region HA feasibility test#574
huydhn wants to merge 7 commits into
mainfrom
arc-staging-uw2-feasibility

huydhn commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huydhn commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tofu plan — arc-cbr-production

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 15, 2026 •

edited

Loading