[BUG] Cassandra node gets decommissioned forever if scaling is partially done

**Describe the bug**

When scaling down, the Cassandra operator always decommissions a Cassandra node (or a Cassandra pod) before deleting the pod. However, we find sometimes the Cassandra node can be left in a decommissioned state without being deleted forever when the Cassandra operator misses certain events.

The scaling down logic is implemented as follows:
```go
if desiredSpecReplicas < currentSpecReplicas {
	...
	if len(decommissionedNodes) == 0 {
		// decommission one Cassandra node (pod)
	} else if len(decommissionedNodes) == 1 {
		// delete the decommissioned node (pod)
	}
}
```

Assume we have a Cassandra datacenter with three (`currentSpecReplicas`) nodes and the user wants to scale to two (`desiredSpecReplicas`). When seeing `desiredSpecReplicas < currentSpecReplicas`, the operator first finds there is no decommissioned node (`len(decommissionedNodes) == 0`), so it will decommission one of the Cassandra nodes and finishes this reconcile. Ideally, the operator is supposed to delete the decommissioned node in the next reconcile. 

However, if the user changes the replica back to three before the operator enters the next reconcile (this can happen when the operator runs slow or encounters a crash), the operator will find that `desiredSpecReplicas == currentSpecReplicas` in the next reconcile, and the decommissioned node will not be deleted. Thus, the node is left in the decommissioned state forever until the user issues another scale down later. There will be only two Cassandra nodes functioning, though the stateful set still hosts three Cassandra nodes (pods).

**To Reproduce**

Steps to reproduce the behavior:
1. Create a Cassandra datacenter with three replicas.
2. Scale down: three -> two. The operator decommissions the node, but has not deleted the pod yet
3. Scale up: two -> three. The operator finds `desiredSpecReplicas == currentSpecReplicas` and leaves the node decommissioned.


**Expected behavior**
The operator should check whether any node is decommissioned and bring back the node if it is not supposed to be deleted.

**Environment**
- OS Linux
- Kubernetes version v1.18.9
- `kubectl version` v1.20.1
- Go version 1.13.9
- Cassandra version 3

**Additional context**
We are willing to help fix this bug. One potential fix is to delete the pod where the node is decommissioned. Since the pod is hosted by the statefulset, the pod will be automatically recreated and get out of the decommissioned state.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cassandra node gets decommissioned forever if scaling is partially done #410

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Cassandra node gets decommissioned forever if scaling is partially done #410

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions