Skip to content

testworkflowexecution_controller: Status.Generation never written — all TWE CRDs re-trigger on every operator restart #7473

@zelig81

Description

@zelig81

Summary

Every existing TestWorkflowExecution CRD re-triggers on every operator restart, regardless of how old it is or whether it has already executed.

Root Cause

The testkube-operator's testworkflowexecution_controller contains a deduplication guard that checks:

```go
if testWorkflowExecution.Generation == testWorkflowExecution.Status.Generation {
return ctrl.Result{}, nil
}
```

However, Status.Generation is never written after the execution completes. It stays at 0 permanently. Meanwhile, Generation is always 1 (the Kubernetes-assigned value for a CRD that was created but never had its spec updated). So the guard condition 1 == 0 is always false and never fires.

Result: When the operator pod restarts, the informer replays all existing TWE CRDs as synthetic ADD events. Since the guard never triggers, every CRD re-executes simultaneously.

Steps to Reproduce

  1. Create several TestWorkflowExecution CRDs and let them complete
  2. Restart the testkube-operator pod (e.g., via kubectl rollout restart, or a GKE rolling node pool upgrade that evicts the pod)
  3. Observe: all existing TWE CRDs execute again immediately after the operator comes back

Observed Behavior

  • 23 stale TWE CRDs re-executed simultaneously ~seconds after operator pod started
  • All executions had runningContext.interface.type = internal and actor.type = testworkflowexecution
  • Confirmed on operator v1.10.7 / API server 2.8.2

Expected Behavior

The deduplication guard at the top of Reconcile should prevent already-executed CRDs from running again. This requires writing Status.Generation = testWorkflowExecution.Generation into the CRD status after a successful (or terminal) execution.

Suggested Fix

In testworkflowexecution_controller.go, after the execution is triggered and reaches a terminal state, patch the status:

```go
testWorkflowExecution.Status.Generation = testWorkflowExecution.Generation
if err := r.Status().Update(ctx, testWorkflowExecution); err != nil {
return ctrl.Result{}, err
}
```

Workaround

We replaced the ArgoCD custom Lua resource action from creating a TestWorkflowExecution CRD to creating a short-lived batch/v1 Job that POSTs directly to the Testkube API:

```
POST http://testkube-api-server..svc.cluster.local:8088/v1/test-workflows/{name}/executions
```

With ttlSecondsAfterFinished: 300, the Job auto-deletes and leaves no persistent CRD — unaffected by operator restarts.

Environment

  • Operator: v1.10.7
  • API server: 2.8.2
  • Kubernetes: GKE

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions