Skip to content

Commit 64dca84

Browse files
[release/v26.1.x] Parallelize acceptance and integration tests (#1407) (#1416)
* Parallelize acceptance and integration tests (#1407) ## Summary - Parallelize acceptance and integration tests, reducing CI wall time from ~90+ minutes to ~22 minutes - Add test infrastructure for parallel execution: file-based locking, per-test namespaces, shared operator install, image import caching - Fix flaky tests exposed by parallel execution - Add debugging tooling: diagnostics on failure, feature-name namespaces, timing markers - Speed up CI builds by only compiling for the host architecture and parallelizing Docker image pulls ## Details ### Acceptance test parallelization Refactored the harpoon test framework to run BDD features in parallel. Features are partitioned by a `@serial` tag — features without it run concurrently via `t.Run` + `t.Parallel()`, while `@serial` features (decommissioning, helm-chart) run sequentially afterward since they perform k3d node operations. - Moved the Redpanda operator from per-feature helm install to a single shared instance installed during `BeforeSuite` - Features that need their own operator (upgrade tests) run in isolated vclusters (`@vcluster` tag) - Setup and teardown are separated so the cluster isn't torn down between parallel and serial phases ### Integration test parallelization - Refactored `RedpandaControllerSuite` from a testify suite with shared mutable state to parallel subtests with per-test namespaces - Added `WatchAllNamespaces` option to `testenv.Env` so a single controller manager serves all test namespaces - Parallelized `charts/redpanda` integration test subtests - Parallelized `TestLicense` subtests across 4 image variants - Removed `-p=1` gate on integration tests (now uses Go's default parallelism) ### CI build speedups - Only compile `linux/amd64` binaries in CI (was building all 4 os/arch combos) - Parallelize Docker image pre-pulls (~25 images pulled concurrently instead of sequentially) - Pre-pull k3d infrastructure images (`rancher/k3s`, `k3d-tools`, `k3d-proxy`) - Batch `k3d image import` calls (one command with all images instead of N separate calls) - Cache imported images with marker files to avoid redundant imports across parallel test packages ### Cross-process coordination - Added file-based locking (`flock`) around k3d cluster creation and image imports to prevent conflicts when multiple test packages run simultaneously - Made CRD installation idempotent (tolerates "already exists" errors) - Made cert-manager installation idempotent in `helmtest.Setup` - Added cert-manager webhook readiness wait after k3s manifest installation ### Flaky test fixes - Fixed `require.FailNow` inside `assert.Eventually` goroutine causing panics in `rpk.go` - Added namespace scoping to Redpanda list queries to avoid cross-feature interference - Added `waitForStatefulSetReady` after cluster availability checks (looks up StatefulSet by label, not name) - Bumped k3d cluster creation timeout from 3 to 5 minutes - Bumped vcluster creation timeout from 3 to 5 minutes - Bumped PVC unbinder test timeouts - Added retry for schema registry readiness in `FactoryOperatorV1` - Overrode upgrade test operator images to use Docker Hub (pre-loaded into k3d) instead of `docker.redpanda.com` - Added `--rerun-fails` to gotestsum for automatic retry of flaky tests - Disabled k3s servicelb (unused in tests) ### Debugging improvements - Feature namespaces now include the feature name (e.g. `test-basic-cluster-tests-abc123`) - Added `=== FEATURE START/END/FAILED ===` log markers with timing - Added `DumpDiagnostics` on feature failure: pod statuses, events, resource descriptions, pod logs - Diagnostics written to `ACCEPTANCE_ARTIFACTS_DIR` when set (uploaded as CI artifacts) - Added `dumpDiagnostics` to `testenv.Env` for integration test failures - Added `DumpContainerLogsOnFailure` helper for testcontainer-based tests - Added vcluster creation failure diagnostics (pod state + events from host namespace) - Added `[TestName]` prefix to `waitFor` log messages for traceability in parallel output - CI test output uses `testname` format (per-test lines with timing, verbose on failure) ### Pre-loaded images Ensured all Docker images used by tests are pre-pulled and imported into k3d clusters to avoid in-test pulls: - Added operator images for upgrade tests (`v25.1.3`, `v25.2.2`, `v25.3.1`) - Added Redpanda images for license tests (`v24.2.9`, `v24.3.1-rc4`) - Added `redpanda-nightly`, `redpanda-operator-nightly` for topic controller and factory tests (cherry picked from commit ca83446) # Conflicts: # acceptance/steps/defaults.go # acceptance/steps/multicluster.go # charts/redpanda/testdata/template-cases.golden.txtar # ci/scripts/run-in-nix-docker.sh # harpoon/suite.go # operator/internal/controller/redpanda/redpanda_controller_test.go # pkg/vcluster/vcluster.go * Fix up merge conflicts and remove non-backported multicluster acceptance test * fix bad merge * fix testing t usage * fix test expectation * try to fix up backport some more * update to 26.1.2 for test runs * regen golden files * dump logs on cluster config failure * fix configuration test by backport * license constant swap --------- Co-authored-by: Andrew Stucki <andrew.stucki@redpanda.com>
1 parent 1d57ad1 commit 64dca84

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1838
-720
lines changed

.buildkite/testsuite.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ steps:
2828
queue: k8s-m6id12xlarge
2929
command: ./ci/scripts/run-in-nix-docker.sh task ci:configure ci:test:unit
3030
env:
31+
BUILD_GOARCH: amd64
32+
BUILD_GOOS: linux
3133
LOG_LEVEL: trace
3234
OTLP_DIR: /work/artifacts
3335
OTLP_METRIC_INTERVAL: 5s
@@ -73,6 +75,8 @@ steps:
7375
queue: k8s-m6id12xlarge
7476
command: ./ci/scripts/run-in-nix-docker.sh task ci:configure ci:test:integration
7577
env:
78+
BUILD_GOARCH: amd64
79+
BUILD_GOOS: linux
7680
LOG_LEVEL: trace
7781
OTLP_DIR: /work/artifacts
7882
OTLP_METRIC_INTERVAL: 5s
@@ -121,6 +125,8 @@ steps:
121125
queue: k8s-m6id12xlarge
122126
command: ./ci/scripts/run-in-nix-docker.sh task ci:configure ci:test:acceptance
123127
env:
128+
BUILD_GOARCH: amd64
129+
BUILD_GOOS: linux
124130
LOG_LEVEL: trace
125131
OTLP_DIR: /work/artifacts
126132
OTLP_METRIC_INTERVAL: 5s
@@ -169,6 +175,8 @@ steps:
169175
queue: k8s-m6id12xlarge
170176
command: ./ci/scripts/run-in-nix-docker.sh task ci:configure ci:test:kuttl-v1
171177
env:
178+
BUILD_GOARCH: amd64
179+
BUILD_GOOS: linux
172180
LOG_LEVEL: trace
173181
OTLP_DIR: /work/artifacts
174182
OTLP_METRIC_INTERVAL: 5s
@@ -218,6 +226,8 @@ steps:
218226
queue: k8s-m6id12xlarge
219227
command: ./ci/scripts/run-in-nix-docker.sh task ci:configure ci:test:kuttl-v1-nodepools
220228
env:
229+
BUILD_GOARCH: amd64
230+
BUILD_GOOS: linux
221231
LOG_LEVEL: trace
222232
OTLP_DIR: /work/artifacts
223233
OTLP_METRIC_INTERVAL: 5s

Taskfile.yml

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -264,8 +264,16 @@ tasks:
264264
_PKG:
265265
sh: go work edit -json | jq -j '.Use.[].DiskPath + "/... "'
266266
PKG: '{{ .PKG | default ._PKG }}'
267+
# When using gotestsum with --packages, packages must be a single
268+
# quoted arg and -- separates gotestsum flags from go test args.
269+
_USE_PACKAGES: '{{if contains "--packages" .GO_TEST_RUNNER}}true{{end}}'
267270
cmds:
268-
- '{{.GO_TEST_RUNNER}} {{.PKG}} {{.CLI_ARGS}}'
271+
- |
272+
{{- if ._USE_PACKAGES}}
273+
{{.GO_TEST_RUNNER}} "{{.PKG}}" -- {{.CLI_ARGS}}
274+
{{- else}}
275+
{{.GO_TEST_RUNNER}} {{.PKG}} {{.CLI_ARGS}}
276+
{{- end}}
269277
270278
test:integration:
271279
desc: "Run all integration tests (~90m)"
@@ -284,7 +292,7 @@ tasks:
284292
vars:
285293
GO_TEST_RUNNER:
286294
ref: .GO_TEST_RUNNER
287-
CLI_ARGS: '{{.CLI_ARGS}} -p=1 -run {{.RUN}} -timeout 60m -tags integration'
295+
CLI_ARGS: '{{.CLI_ARGS}} -run {{.RUN}} -timeout 60m -tags integration'
288296

289297
test:acceptance:
290298
desc: "Run all acceptance tests (~90m)"
@@ -330,6 +338,10 @@ tasks:
330338
TEST_KUBE_VERSION: '{{ .TEST_KUBE_VERSION | default .DEFAULT_TEST_KUBE_VERSION }}'
331339
TEST_COREDNS_VERSION: '{{ .TEST_COREDNS_VERSION | default .DEFAULT_TEST_COREDNS_VERSION }}'
332340
IMAGES:
341+
# k3d infrastructure images — pre-pulling avoids slow pulls during cluster creation.
342+
- rancher/k3s:v1.32.13-k3s1
343+
- ghcr.io/k3d-io/k3d-tools:5.8.3
344+
- ghcr.io/k3d-io/k3d-proxy:5.8.3
333345
- quay.io/jetstack/cert-manager-controller:{{.TEST_CERTMANAGER_VERSION}}
334346
- quay.io/jetstack/cert-manager-cainjector:{{.TEST_CERTMANAGER_VERSION}}
335347
- quay.io/jetstack/cert-manager-startupapicheck:{{.TEST_CERTMANAGER_VERSION}}
@@ -339,20 +351,38 @@ tasks:
339351
- quay.io/jetstack/cert-manager-webhook:{{.SECOND_TEST_CERTMANAGER_VERSION}}
340352
- '{{.TEST_REDPANDA_REPO}}:{{.TEST_REDPANDA_VERSION}}'
341353
- '{{.DEFAULT_TEST_UPGRADE_REDPANDA_REPO}}:{{.TEST_UPGRADE_REDPANDA_VERSION}}'
354+
- redpandadata/redpanda-operator:v25.1.3
355+
- redpandadata/redpanda-operator:v25.2.2
342356
- redpandadata/redpanda-operator:v25.3.1
343357
- redpandadata/redpanda-operator:{{.TEST_UPGRADE_OPERATOR_VERSION}}
344358
- ghcr.io/loft-sh/vcluster-pro:{{.TEST_VCLUSTER_VERSION}}
345359
- registry.k8s.io/kube-controller-manager:{{.TEST_KUBE_VERSION}}
346360
- registry.k8s.io/kube-apiserver:{{.TEST_KUBE_VERSION}}
347361
- coredns/coredns:{{.TEST_COREDNS_VERSION}}
362+
- redpandadata/redpanda-unstable:v24.3.1-rc4
348363
- redpandadata/redpanda-unstable:v24.3.1-rc8
364+
- redpandadata/redpanda-unstable:v25.2.1-rc7
349365
- redpandadata/redpanda-unstable:v25.3.1-rc2
366+
- redpandadata/redpanda-unstable:v25.3.1-rc4
350367
- redpandadata/redpanda-unstable:v26.1.1-rc5
368+
- redpandadata/redpanda-nightly:v0.0.0-20260330git0d4187b
369+
- redpandadata/redpanda-operator-nightly:v0.0.0-20250129gita89e202
370+
- redpandadata/redpanda:v23.2.8
371+
- redpandadata/redpanda:v24.2.9
372+
- redpandadata/redpanda:v25.1.1
373+
- redpandadata/redpanda:v25.2.1
374+
- redpandadata/redpanda:v25.2.11
351375
- redpandadata/redpanda:v26.1.1
376+
- redpandadata/redpanda:v26.1.2
352377

353378
cmds:
354-
- for: {var: IMAGES}
355-
cmd: docker inspect {{.ITEM}} > /dev/null || docker pull {{.ITEM}}
379+
- |
380+
pids=""
381+
{{range .IMAGES}}
382+
(docker inspect "{{.}}" > /dev/null 2>&1 || docker pull -q "{{.}}") &
383+
pids="$pids $!"
384+
{{end}}
385+
for pid in $pids; do wait "$pid" || true; done
356386
357387
pending-prs:
358388
desc: "Get all pending PRs for watched branches"

acceptance/features/cluster.feature

Lines changed: 1 addition & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Feature: Basic cluster tests
44
@skip:gke @skip:aks @skip:eks
55
Scenario: Updating admin ports
6-
# replaces e2e-v2 "upgrade-values-check"
6+
# replaces e2e-v2 "upgrade-values-check"
77
Given I apply Kubernetes manifest:
88
"""
99
---
@@ -44,32 +44,3 @@ Feature: Basic cluster tests
4444
Then cluster "upgrade" is stable with 1 nodes
4545
And service "upgrade-external" should have named port "admin-default" with value 9640
4646
And rpk is configured correctly in "upgrade" cluster
47-
48-
49-
@skip:gke @skip:aks @skip:eks
50-
Scenario: Rack Awareness
51-
Given I apply Kubernetes manifest:
52-
# NB: You wouldn't actually use kubernetes.io/os for the value of rack,
53-
# it's just a value that we know is both present and deterministic for the
54-
# purpose of testing.
55-
"""
56-
---
57-
apiVersion: cluster.redpanda.com/v1alpha2
58-
kind: Redpanda
59-
metadata:
60-
name: rack-awareness
61-
spec:
62-
clusterSpec:
63-
console:
64-
enabled: false
65-
statefulset:
66-
replicas: 1
67-
rackAwareness:
68-
enabled: true
69-
nodeAnnotation: 'kubernetes.io/os'
70-
"""
71-
And cluster "rack-awareness" is stable with 1 nodes
72-
Then running `cat /etc/redpanda/redpanda.yaml | grep -o 'rack: .*$'` will output:
73-
"""
74-
rack: linux
75-
"""

acceptance/features/console-upgrades.feature

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
1-
@operator:none
1+
@vcluster
22
Feature: Upgrading the operator with Console installed
33
@skip:gke @skip:aks @skip:eks
44
Scenario: Console v2 to v3 no warnings
55
Given I helm install "redpanda-operator" "redpanda/operator" --version v25.1.3 with values:
66
"""
7+
image:
8+
repository: redpandadata/redpanda-operator
9+
crds:
10+
enabled: true
711
"""
812
And I apply Kubernetes manifest:
913
"""
@@ -47,6 +51,10 @@ Feature: Upgrading the operator with Console installed
4751
Scenario: Console v2 to v3 with warnings
4852
Given I helm install "redpanda-operator" "redpanda/operator" --version v25.1.3 with values:
4953
"""
54+
image:
55+
repository: redpandadata/redpanda-operator
56+
crds:
57+
enabled: true
5058
"""
5159
And I apply Kubernetes manifest:
5260
"""

acceptance/features/decommissioning.feature

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1+
@serial
12
Feature: Decommissioning brokers
23
# note that this test requires both the decommissioner and pvc unbinder
3-
# run in order to pass
4+
# run in order to pass
45
@skip:gke @skip:aks @skip:eks
56
Scenario: Pruning brokers on failed nodes
67
Given I create a basic cluster "decommissioning" with 3 nodes

acceptance/features/helm-chart.feature

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
@operator:none
1+
@serial
22
Feature: Redpanda Helm Chart
33

44
Scenario: Tolerating Node Failure

acceptance/features/multicluster.feature

Lines changed: 0 additions & 16 deletions
This file was deleted.

acceptance/features/operator-upgrades.feature

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
@operator:none @vcluster
1+
@vcluster
22
Feature: Upgrading the operator
33
@skip:gke @skip:aks @skip:eks
44
Scenario: Operator upgrade from 25.2.2
55
Given I helm install "redpanda-operator" "redpanda/operator" --version v25.2.2 with values:
66
"""
7+
image:
8+
repository: redpandadata/redpanda-operator
79
crds:
810
enabled: true
911
"""
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
Feature: Rack Awareness
2+
@skip:gke @skip:aks @skip:eks
3+
Scenario: Rack Awareness
4+
Given I apply Kubernetes manifest:
5+
# NB: You wouldn't actually use kubernetes.io/os for the value of rack,
6+
# it's just a value that we know is both present and deterministic for the
7+
# purpose of testing.
8+
"""
9+
---
10+
apiVersion: cluster.redpanda.com/v1alpha2
11+
kind: Redpanda
12+
metadata:
13+
name: rack-awareness
14+
spec:
15+
clusterSpec:
16+
console:
17+
enabled: false
18+
statefulset:
19+
replicas: 1
20+
rackAwareness:
21+
enabled: true
22+
nodeAnnotation: 'kubernetes.io/os'
23+
"""
24+
And cluster "rack-awareness" is stable with 1 nodes
25+
Then running `cat /etc/redpanda/redpanda.yaml | grep -o 'rack: .*$'` will output:
26+
"""
27+
rack: linux
28+
"""
Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,3 @@ Feature: Scaling down broker nodes
55
And cluster "scaledown" is stable with 5 nodes
66
When I scale "scaledown" to 3 nodes
77
Then cluster "scaledown" should be stable with 3 nodes
8-
9-
@skip:gke @skip:aks @skip:eks
10-
Scenario: Scaling up nodes
11-
Given I create a basic cluster "scaleup" with 1 nodes
12-
And cluster "scaleup" is stable with 1 nodes
13-
When I scale "scaleup" to 3 nodes
14-
Then cluster "scaleup" should be stable with 3 nodes

0 commit comments

Comments
 (0)