Summary
Every existing TestWorkflowExecution CRD re-triggers on every operator restart, regardless of how old it is or whether it has already executed.
Root Cause
The testkube-operator's testworkflowexecution_controller contains a deduplication guard that checks:
```go
if testWorkflowExecution.Generation == testWorkflowExecution.Status.Generation {
return ctrl.Result{}, nil
}
```
However, Status.Generation is never written after the execution completes. It stays at 0 permanently. Meanwhile, Generation is always 1 (the Kubernetes-assigned value for a CRD that was created but never had its spec updated). So the guard condition 1 == 0 is always false and never fires.
Result: When the operator pod restarts, the informer replays all existing TWE CRDs as synthetic ADD events. Since the guard never triggers, every CRD re-executes simultaneously.
Steps to Reproduce
- Create several
TestWorkflowExecution CRDs and let them complete
- Restart the testkube-operator pod (e.g., via
kubectl rollout restart, or a GKE rolling node pool upgrade that evicts the pod)
- Observe: all existing TWE CRDs execute again immediately after the operator comes back
Observed Behavior
- 23 stale TWE CRDs re-executed simultaneously ~seconds after operator pod started
- All executions had
runningContext.interface.type = internal and actor.type = testworkflowexecution
- Confirmed on operator v1.10.7 / API server 2.8.2
Expected Behavior
The deduplication guard at the top of Reconcile should prevent already-executed CRDs from running again. This requires writing Status.Generation = testWorkflowExecution.Generation into the CRD status after a successful (or terminal) execution.
Suggested Fix
In testworkflowexecution_controller.go, after the execution is triggered and reaches a terminal state, patch the status:
```go
testWorkflowExecution.Status.Generation = testWorkflowExecution.Generation
if err := r.Status().Update(ctx, testWorkflowExecution); err != nil {
return ctrl.Result{}, err
}
```
Workaround
We replaced the ArgoCD custom Lua resource action from creating a TestWorkflowExecution CRD to creating a short-lived batch/v1 Job that POSTs directly to the Testkube API:
```
POST http://testkube-api-server..svc.cluster.local:8088/v1/test-workflows/{name}/executions
```
With ttlSecondsAfterFinished: 300, the Job auto-deletes and leaves no persistent CRD — unaffected by operator restarts.
Environment
- Operator: v1.10.7
- API server: 2.8.2
- Kubernetes: GKE
Summary
Every existing
TestWorkflowExecutionCRD re-triggers on every operator restart, regardless of how old it is or whether it has already executed.Root Cause
The testkube-operator's
testworkflowexecution_controllercontains a deduplication guard that checks:```go
if testWorkflowExecution.Generation == testWorkflowExecution.Status.Generation {
return ctrl.Result{}, nil
}
```
However,
Status.Generationis never written after the execution completes. It stays at0permanently. Meanwhile,Generationis always1(the Kubernetes-assigned value for a CRD that was created but never had its spec updated). So the guard condition1 == 0is alwaysfalseand never fires.Result: When the operator pod restarts, the informer replays all existing TWE CRDs as synthetic ADD events. Since the guard never triggers, every CRD re-executes simultaneously.
Steps to Reproduce
TestWorkflowExecutionCRDs and let them completekubectl rollout restart, or a GKE rolling node pool upgrade that evicts the pod)Observed Behavior
runningContext.interface.type = internalandactor.type = testworkflowexecutionExpected Behavior
The deduplication guard at the top of
Reconcileshould prevent already-executed CRDs from running again. This requires writingStatus.Generation = testWorkflowExecution.Generationinto the CRD status after a successful (or terminal) execution.Suggested Fix
In
testworkflowexecution_controller.go, after the execution is triggered and reaches a terminal state, patch the status:```go
testWorkflowExecution.Status.Generation = testWorkflowExecution.Generation
if err := r.Status().Update(ctx, testWorkflowExecution); err != nil {
return ctrl.Result{}, err
}
```
Workaround
We replaced the ArgoCD custom Lua resource action from creating a
TestWorkflowExecutionCRD to creating a short-livedbatch/v1 Jobthat POSTs directly to the Testkube API:```
POST http://testkube-api-server..svc.cluster.local:8088/v1/test-workflows/{name}/executions
```
With
ttlSecondsAfterFinished: 300, the Job auto-deletes and leaves no persistent CRD — unaffected by operator restarts.Environment