Add arc-staging-uw2 cluster for multi-region HA feasibility test#574
Draft
huydhn wants to merge 7 commits into
Draft
Add arc-staging-uw2 cluster for multi-region HA feasibility test#574huydhn wants to merge 7 commits into
huydhn wants to merge 7 commits into
Conversation
Adds a second staging cluster in us-west-2 that shares the same runner_name_prefix and github_config_url as arc-staging. Purpose is to verify GitHub accepts duplicate ARC scale-set names and routes jobs by capacity — the load-bearing assumption for an eventual active/active prod deployment in us-west-1. Includes docs/prod-cluster-ha-us-west-1.md with the full Phase 0/Phase 1 plan, validation gates, and the H100 capacity-reservation override proposal.
Runs `just bootstrap arc-staging-uw2` followed by `just deploy-base arc-staging-uw2` from a manual workflow_dispatch trigger. Lets us bring up the new staging cluster's base infra (VPC, EKS, Harbor, base k8s resources) from CI rather than requiring a local Linux machine. Module deploy is intentionally not included — modules need the pytorch-arc-staging GitHub App secret planted into the arc-runners namespace first, which is a manual kubectl step. Delete this workflow once the feasibility test wraps up.
Adds pull_request trigger with path filters so the workflow runs automatically when this workflow, clusters.yaml, or the plan doc changes. bootstrap and deploy-base are idempotent, so the first PR push provisions the cluster (~25min) and later pushes are quick no-ops. Concurrency is scoped per-ref so two open PRs don't collide.
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
Alpine has rolled forward from util-linux=2.41.2-r0 to 2.41.4-r0, so the existing pin no longer resolves and `apk add` fails. Updating to the currently-available version unblocks `just deploy-base` on any cluster (the arc-staging-uw2 deploy hit this first).
deploy-base builds image-cache-janitor (and node-compactor) for both amd64 and arm64 via `docker build --platform linux/<arch>`. On a fresh amd64 GitHub Actions runner, the arm64 build fails with "exec format error" because binfmt_misc has no handler registered for arm64 binaries. Existing cluster deploys don't hit this — Harbor caches prior builds and the deploy script's skip-if-exists check short-circuits the rebuild. arc-staging-uw2 starts with an empty Harbor, so every image gets built fresh, surfacing the missing QEMU. docker/setup-qemu-action registers the binfmt handlers via tonistiigi/ binfmt. The shared _osdc-deploy.yml will need the same fix the next time someone provisions a brand-new cluster.
Mirrors the QEMU setup from osdc-deploy-staging-uw2-base.yml into the reusable deploy workflow so existing prod/staging deploys don't break when the image-cache-janitor Dockerfile change in this PR forces every cluster's next deploy to rebuild from scratch (new content-addressed tag, empty Harbor cache for that tag). Holding this on the feasibility branch for now to verify it works before promoting to main.
3 tasks
Switches the deploy step from `just deploy-base` to `just deploy`, so the workflow runs base + every module in the cluster's modules: list (karpenter, arc, nodepools, arc-runners, monitoring, logging). Base is idempotent on re-run, so this is effectively the module phase after the previous base-only deploy. Assumes the pytorch-arc-staging GitHub App Secret is already planted into the arc-runners namespace (one-time manual step copying from the existing arc-staging cluster).
huydhn
added a commit
to huydhn/pytorch-ci-infra
that referenced
this pull request
May 18, 2026
…ytorch#575) ## Summary Two related fixes that surface when a brand-new OSDC cluster is deployed from CI: - **`osdc/base/kubernetes/image-cache-janitor/docker/Dockerfile`** — bump `util-linux` pin from `2.41.2-r0` → `2.41.4-r0`. Alpine has rolled forward and the old pin no longer resolves; `apk add` fails. - **`.github/workflows/_osdc-deploy.yml`** — add `docker/setup-qemu-action` before the deploy step so `docker build --platform linux/arm64` works on the amd64 GitHub Actions runner. ## Why both, why now The janitor image's tag in the deploy script is content-addressed to the Dockerfile (`sha256(Dockerfile)[:12]`). Existing clusters' Harbors have the image cached against the *old* hash, so every deploy short-circuits the build via the skip-if-exists check at `osdc/base/kubernetes/image-cache-janitor/deploy.sh:80`. No build means no QEMU dependency, which is why this has worked silently for months. The Dockerfile bump in this PR rotates the hash. After merge, the next deploy of every cluster will see a new tag, miss the Harbor cache, and fall through to the build branch — which includes an arm64 build leg that fails on amd64 runners without QEMU registered. Bundling both fixes keeps the change reviewable as one cause-and-effect: pin rot forces a rebuild; rebuild forces QEMU. ## Test plan - [x] `just lint` and `just test` on the change (covered locally) - [x] arc-staging-uw2 PR (pytorch#574) exercises an identical QEMU step on the one-off bootstrap workflow and successfully built both arch variants of image-cache-janitor from scratch - [ ] First post-merge deploy of any cluster via `_osdc-deploy.yml` should rebuild image-cache-janitor for both archs and complete cleanly ## Follow-up (not in this PR) `python:3.12-alpine` is a mutable tag. The next Alpine release will rotate util-linux again. A later PR should pin the base by digest to make the build reproducible and stop this from recurring.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No need to review