Skip to content

Add arc-staging-uw2 cluster for multi-region HA feasibility test#574

Draft
huydhn wants to merge 7 commits into
mainfrom
arc-staging-uw2-feasibility
Draft

Add arc-staging-uw2 cluster for multi-region HA feasibility test#574
huydhn wants to merge 7 commits into
mainfrom
arc-staging-uw2-feasibility

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented May 15, 2026

No need to review

huydhn added 3 commits May 15, 2026 01:42
Adds a second staging cluster in us-west-2 that shares the same
runner_name_prefix and github_config_url as arc-staging. Purpose is to
verify GitHub accepts duplicate ARC scale-set names and routes jobs by
capacity — the load-bearing assumption for an eventual active/active
prod deployment in us-west-1.

Includes docs/prod-cluster-ha-us-west-1.md with the full Phase 0/Phase 1
plan, validation gates, and the H100 capacity-reservation override
proposal.
Runs `just bootstrap arc-staging-uw2` followed by `just deploy-base
arc-staging-uw2` from a manual workflow_dispatch trigger. Lets us bring
up the new staging cluster's base infra (VPC, EKS, Harbor, base k8s
resources) from CI rather than requiring a local Linux machine.

Module deploy is intentionally not included — modules need the
pytorch-arc-staging GitHub App secret planted into the arc-runners
namespace first, which is a manual kubectl step. Delete this workflow
once the feasibility test wraps up.
Adds pull_request trigger with path filters so the workflow runs
automatically when this workflow, clusters.yaml, or the plan doc
changes. bootstrap and deploy-base are idempotent, so the first PR
push provisions the cluster (~25min) and later pushes are quick
no-ops. Concurrency is scoped per-ref so two open PRs don't collide.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

tofu plan — arc-cbr-production

✅ Plan succeeded · commit a1f81ae3 · run log

Plan output
Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=8115d61b-1bc1-49ad-b5a3-e8f88fc50cb1]
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-node-role]
data.aws_availability_zones.available: Reading...
module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3]
module.eks.data.aws_caller_identity.current: Reading...
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0a126b1613758a408]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNMSO5RRNP]
data.aws_availability_zones.available: Read complete after 1s [id=us-east-2]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-eks-secrets]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936813000000004]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936734100000003]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260316204739334600000001]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role-20260308084936681500000001]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role-20260308084936685500000002]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936816800000005]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-009f1fe7d56695348]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-03eb66e57d13af64b]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-084ed6fc52db22c39]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0701693364b79c021]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-023207cd15e79c81a]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-06a70b2818e270ed8]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0610564f678f81c5f]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-0078fd5c0f6bc05eb]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0545d26e4a1d0ba89]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-04682fc890bfd4630]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0ce6f1dcb7208cad8]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-07ac52a1aa741f267]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3-20260308084938596600000006]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0d2591f24cba79e7b]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-04d9bba8d43569bbf]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-0aa6ea5c845170545]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-0f34cc1aafea8fd16]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-07e2274170282eb8c]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-086e3e66fe238d459]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0f623a6fa9d7bde45]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-000d05ecec7d4b66e]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0777285eddd2bacd1]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-00dacd13031b1f5de]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0ec9764e9015e972e]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-08ccb8cfe4bfa80d7]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production:kube-proxy]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-090bac79dddc5b77f]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production:pytorch-arc-cbr-production-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/70AA0C12C21E1A843313EF1BDE82D29A]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=2255203180]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production:coredns]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry-2026030809125509320000000c]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role-2026030809125522790000000d]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-ebs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change]
data.terraform_remote_state.base: Read complete after 0s
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-03b965bcc0c037434,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-04682fc890bfd4630"]: Refreshing state... [id=subnet-04682fc890bfd4630,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0545d26e4a1d0ba89"]: Refreshing state... [id=subnet-0545d26e4a1d0ba89,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0ce6f1dcb7208cad8"]: Refreshing state... [id=subnet-0ce6f1dcb7208cad8,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-karpenter-controller]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change-KarpenterScheduledChange]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller-20260308154648023000000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-053d2ed886d9ac92d]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role]
aws_security_group.efs: Refreshing state... [id=sg-099ef6309262a93fd]
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role]
aws_efs_mount_target.pypi_cache["subnet-0545d26e4a1d0ba89"]: Refreshing state... [id=fsmt-05b0a0d538bd49c8e]
aws_efs_mount_target.pypi_cache["subnet-04682fc890bfd4630"]: Refreshing state... [id=fsmt-0743bba60c50ed499]
aws_efs_mount_target.pypi_cache["subnet-0ce6f1dcb7208cad8"]: Refreshing state... [id=fsmt-01378a00a07852987]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role-20260403211352357700000001]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role-20260330040250456800000003]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role-20260403211352439500000002]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-efs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

Alpine has rolled forward from util-linux=2.41.2-r0 to 2.41.4-r0, so
the existing pin no longer resolves and `apk add` fails. Updating to
the currently-available version unblocks `just deploy-base` on any
cluster (the arc-staging-uw2 deploy hit this first).
huydhn added 2 commits May 15, 2026 10:59
deploy-base builds image-cache-janitor (and node-compactor) for both
amd64 and arm64 via `docker build --platform linux/<arch>`. On a fresh
amd64 GitHub Actions runner, the arm64 build fails with "exec format
error" because binfmt_misc has no handler registered for arm64
binaries.

Existing cluster deploys don't hit this — Harbor caches prior builds
and the deploy script's skip-if-exists check short-circuits the
rebuild. arc-staging-uw2 starts with an empty Harbor, so every image
gets built fresh, surfacing the missing QEMU.

docker/setup-qemu-action registers the binfmt handlers via tonistiigi/
binfmt. The shared _osdc-deploy.yml will need the same fix the next
time someone provisions a brand-new cluster.
Mirrors the QEMU setup from osdc-deploy-staging-uw2-base.yml into the
reusable deploy workflow so existing prod/staging deploys don't break
when the image-cache-janitor Dockerfile change in this PR forces every
cluster's next deploy to rebuild from scratch (new content-addressed
tag, empty Harbor cache for that tag).

Holding this on the feasibility branch for now to verify it works
before promoting to main.
Switches the deploy step from `just deploy-base` to `just deploy`, so
the workflow runs base + every module in the cluster's modules: list
(karpenter, arc, nodepools, arc-runners, monitoring, logging). Base is
idempotent on re-run, so this is effectively the module phase after
the previous base-only deploy.

Assumes the pytorch-arc-staging GitHub App Secret is already planted
into the arc-runners namespace (one-time manual step copying from the
existing arc-staging cluster).
@huydhn huydhn temporarily deployed to osdc-staging May 15, 2026 18:51 — with GitHub Actions Inactive
huydhn added a commit to huydhn/pytorch-ci-infra that referenced this pull request May 18, 2026
…ytorch#575)

## Summary

Two related fixes that surface when a brand-new OSDC cluster is deployed
from CI:

- **`osdc/base/kubernetes/image-cache-janitor/docker/Dockerfile`** —
  bump `util-linux` pin from `2.41.2-r0` → `2.41.4-r0`. Alpine has
  rolled forward and the old pin no longer resolves; `apk add` fails.
- **`.github/workflows/_osdc-deploy.yml`** — add
  `docker/setup-qemu-action` before the deploy step so
  `docker build --platform linux/arm64` works on the amd64 GitHub
  Actions runner.

## Why both, why now

The janitor image's tag in the deploy script is content-addressed to the
Dockerfile (`sha256(Dockerfile)[:12]`). Existing clusters' Harbors have
the image cached against the *old* hash, so every deploy short-circuits
the build via the skip-if-exists check at
`osdc/base/kubernetes/image-cache-janitor/deploy.sh:80`. No build means
no QEMU dependency, which is why this has worked silently for months.

The Dockerfile bump in this PR rotates the hash. After merge, the next
deploy of every cluster will see a new tag, miss the Harbor cache, and
fall through to the build branch — which includes an arm64 build leg
that fails on amd64 runners without QEMU registered.

Bundling both fixes keeps the change reviewable as one
cause-and-effect: pin rot forces a rebuild; rebuild forces QEMU.

## Test plan

- [x] `just lint` and `just test` on the change (covered locally)
- [x] arc-staging-uw2 PR (pytorch#574) exercises an identical QEMU step on the
      one-off bootstrap workflow and successfully built both arch
      variants of image-cache-janitor from scratch
- [ ] First post-merge deploy of any cluster via `_osdc-deploy.yml`
      should rebuild image-cache-janitor for both archs and complete
      cleanly

## Follow-up (not in this PR)

`python:3.12-alpine` is a mutable tag. The next Alpine release will
rotate util-linux again. A later PR should pin the base by digest to
make the build reproducible and stop this from recurring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant