Skip to content

[release/v26.1.x] Parallelize acceptance and integration tests (#1407)#1416

Merged
andrewstucki merged 11 commits intorelease/v26.1.xfrom
backport/release/v26.1.x/pr-1407
Apr 7, 2026
Merged

[release/v26.1.x] Parallelize acceptance and integration tests (#1407)#1416
andrewstucki merged 11 commits intorelease/v26.1.xfrom
backport/release/v26.1.x/pr-1407

Conversation

@github-actions
Copy link
Copy Markdown

@github-actions github-actions bot commented Apr 7, 2026

Backport

This will backport the following commits from main to release/v26.1.x:

Questions ?

Please refer to the Backport tool documentation

## Summary

- Parallelize acceptance and integration tests, reducing CI wall time from ~90+ minutes to ~22 minutes
- Add test infrastructure for parallel execution: file-based locking, per-test namespaces, shared operator install, image import caching
- Fix flaky tests exposed by parallel execution
- Add debugging tooling: diagnostics on failure, feature-name namespaces, timing markers
- Speed up CI builds by only compiling for the host architecture and parallelizing Docker image pulls

## Details

### Acceptance test parallelization

Refactored the harpoon test framework to run BDD features in parallel. Features are partitioned by a `@serial` tag — features without it run concurrently via `t.Run` + `t.Parallel()`, while `@serial` features (decommissioning, helm-chart) run sequentially afterward since they perform k3d node operations.

- Moved the Redpanda operator from per-feature helm install to a single shared instance installed during `BeforeSuite`
- Features that need their own operator (upgrade tests) run in isolated vclusters (`@vcluster` tag)
- Setup and teardown are separated so the cluster isn't torn down between parallel and serial phases

### Integration test parallelization

- Refactored `RedpandaControllerSuite` from a testify suite with shared mutable state to parallel subtests with per-test namespaces
- Added `WatchAllNamespaces` option to `testenv.Env` so a single controller manager serves all test namespaces
- Parallelized `charts/redpanda` integration test subtests
- Parallelized `TestLicense` subtests across 4 image variants
- Removed `-p=1` gate on integration tests (now uses Go's default parallelism)

### CI build speedups

- Only compile `linux/amd64` binaries in CI (was building all 4 os/arch combos)
- Parallelize Docker image pre-pulls (~25 images pulled concurrently instead of sequentially)
- Pre-pull k3d infrastructure images (`rancher/k3s`, `k3d-tools`, `k3d-proxy`)
- Batch `k3d image import` calls (one command with all images instead of N separate calls)
- Cache imported images with marker files to avoid redundant imports across parallel test packages

### Cross-process coordination

- Added file-based locking (`flock`) around k3d cluster creation and image imports to prevent conflicts when multiple test packages run simultaneously
- Made CRD installation idempotent (tolerates "already exists" errors)
- Made cert-manager installation idempotent in `helmtest.Setup`
- Added cert-manager webhook readiness wait after k3s manifest installation

### Flaky test fixes

- Fixed `require.FailNow` inside `assert.Eventually` goroutine causing panics in `rpk.go`
- Added namespace scoping to Redpanda list queries to avoid cross-feature interference
- Added `waitForStatefulSetReady` after cluster availability checks (looks up StatefulSet by label, not name)
- Bumped k3d cluster creation timeout from 3 to 5 minutes
- Bumped vcluster creation timeout from 3 to 5 minutes
- Bumped PVC unbinder test timeouts
- Added retry for schema registry readiness in `FactoryOperatorV1`
- Overrode upgrade test operator images to use Docker Hub (pre-loaded into k3d) instead of `docker.redpanda.com`
- Added `--rerun-fails` to gotestsum for automatic retry of flaky tests
- Disabled k3s servicelb (unused in tests)

### Debugging improvements

- Feature namespaces now include the feature name (e.g. `test-basic-cluster-tests-abc123`)
- Added `=== FEATURE START/END/FAILED ===` log markers with timing
- Added `DumpDiagnostics` on feature failure: pod statuses, events, resource descriptions, pod logs
- Diagnostics written to `ACCEPTANCE_ARTIFACTS_DIR` when set (uploaded as CI artifacts)
- Added `dumpDiagnostics` to `testenv.Env` for integration test failures
- Added `DumpContainerLogsOnFailure` helper for testcontainer-based tests
- Added vcluster creation failure diagnostics (pod state + events from host namespace)
- Added `[TestName]` prefix to `waitFor` log messages for traceability in parallel output
- CI test output uses `testname` format (per-test lines with timing, verbose on failure)

### Pre-loaded images

Ensured all Docker images used by tests are pre-pulled and imported into k3d clusters to avoid in-test pulls:
- Added operator images for upgrade tests (`v25.1.3`, `v25.2.2`, `v25.3.1`)
- Added Redpanda images for license tests (`v24.2.9`, `v24.3.1-rc4`)
- Added `redpanda-nightly`, `redpanda-operator-nightly` for topic controller and factory tests

(cherry picked from commit ca83446)

# Conflicts:
#	acceptance/steps/defaults.go
#	acceptance/steps/multicluster.go
#	charts/redpanda/testdata/template-cases.golden.txtar
#	ci/scripts/run-in-nix-docker.sh
#	harpoon/suite.go
#	operator/internal/controller/redpanda/redpanda_controller_test.go
#	pkg/vcluster/vcluster.go
@andrewstucki andrewstucki enabled auto-merge (squash) April 7, 2026 19:43
@andrewstucki andrewstucki merged commit 64dca84 into release/v26.1.x Apr 7, 2026
10 checks passed
@andrewstucki andrewstucki deleted the backport/release/v26.1.x/pr-1407 branch April 7, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants