submariner: make start hook idempotent with ensure model#2502
submariner: make start hook idempotent with ensure model#2502raghavendra-talur wants to merge 1 commit intoRamenDR:mainfrom
Conversation
Refactor the submariner start hook to check whether the broker and cluster joins are already healthy before re-running them. This avoids the "existing joined cluster with the same ID" error when re-running start on a partially deployed environment. - Split deploy_broker/join_cluster into is_*/do_*/ensure_* functions - Add are_deployments_available() to check deployment health - Add clean_broker_registration() to remove stale broker-side state (clusters.submariner.io and endpoints.submariner.io) before re-joining - Add subctl.uninstall() wrapper in drenv/subctl.py - Fix typo: "deployuments" -> "deployments" Assisted-by: Claude Code/claude-opus-4-6 Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
| def deploy_broker(broker): | ||
| print(f"Waiting until broker '{broker}' is ready") | ||
| drenv_cluster.wait_until_ready(broker) | ||
| def is_broker_deployed(broker, broker_info): |
There was a problem hiding this comment.
I'm thinking this should be a bit more robust and could include a broker health validation and the validity of the broker info file. A corrupted or stale broker info file could cause the function to skip deployment when it shouldn't.
| ) | ||
|
|
||
|
|
||
| BROKER_NAMESPACE = "submariner-k8s-broker" |
There was a problem hiding this comment.
Are we sure this is the best practice placement for this const?
| pass # Not found is fine. | ||
|
|
||
|
|
||
| def are_deployments_available(cluster, names, namespace): |
There was a problem hiding this comment.
Looks like it doesn't distinguish between deployment doesn't exist and deployment exists but isn't available
| for name in names: | ||
| try: | ||
| out = kubectl.get( | ||
| f"deploy/{name}", | ||
| f"--namespace={namespace}", | ||
| "--output=jsonpath={.status.conditions[?(@.type=='Available')].status}", | ||
| context=cluster, | ||
| ) | ||
| if out.strip() != "True": | ||
| return False | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
Maybe this can be optimized to minimize kubectl.get calls by fetching all deployments in advance and then checking against each one?
I never seen such issue - how do you reproduce it? The start script should already be idempotent, deploying submariner twice works. |
|
@raghavendra-talur I just tried and submariner is idempotent: % drenv start envs/submariner.yaml
2026-05-04 16:57:59,926 INFO [submariner] Starting environment
2026-05-04 16:57:59,974 INFO [hub] Starting minikube cluster
2026-05-04 16:57:59,978 INFO [dr1] Starting minikube cluster
2026-05-04 16:57:59,984 INFO [dr2] Starting minikube cluster
2026-05-04 16:58:15,424 INFO [dr2] Cluster started in 15.44 seconds
2026-05-04 16:58:15,764 INFO [dr2] Configuring containerd
2026-05-04 16:58:18,643 INFO [hub] Cluster started in 18.67 seconds
2026-05-04 16:58:18,978 INFO [hub] Configuring containerd
2026-05-04 16:58:20,070 INFO [hub/0] Running addons/submariner/start
2026-05-04 16:58:21,701 INFO [dr1] Cluster started in 21.72 seconds
2026-05-04 16:58:22,044 INFO [dr1] Configuring containerd
2026-05-04 16:59:19,887 INFO [hub/0] addons/submariner/start completed in 59.82 seconds
2026-05-04 16:59:19,887 INFO [hub/0] Running addons/submariner/test
2026-05-04 16:59:39,860 INFO [hub/0] addons/submariner/test completed in 19.97 seconds
2026-05-04 16:59:39,861 INFO [submariner] Environment started in 99.93 seconds
% drenv start envs/submariner.yaml
2026-05-04 17:02:09,311 INFO [submariner] Starting environment
2026-05-04 17:02:09,629 INFO [dr1] Starting minikube cluster
2026-05-04 17:02:09,634 INFO [dr2] Starting minikube cluster
2026-05-04 17:02:09,649 INFO [hub] Starting minikube cluster
2026-05-04 17:02:32,565 INFO [dr1] Cluster started in 22.94 seconds
2026-05-04 17:02:32,678 INFO [dr1] Waiting for fresh status
2026-05-04 17:02:39,585 INFO [hub] Cluster started in 29.94 seconds
2026-05-04 17:02:39,664 INFO [hub] Waiting for fresh status
2026-05-04 17:02:40,713 INFO [dr2] Cluster started in 31.08 seconds
2026-05-04 17:02:40,788 INFO [dr2] Waiting for fresh status
2026-05-04 17:03:02,671 INFO [dr1] Looking up failed deployments
2026-05-04 17:03:09,664 INFO [hub] Looking up failed deployments
2026-05-04 17:03:10,002 INFO [hub/0] Running addons/submariner/start
2026-05-04 17:03:10,780 INFO [dr2] Looking up failed deployments
2026-05-04 17:03:57,613 INFO [hub/0] addons/submariner/start completed in 47.61 seconds
2026-05-04 17:03:57,613 INFO [hub/0] Running addons/submariner/test
2026-05-04 17:04:16,690 INFO [hub/0] addons/submariner/test completed in 19.08 seconds
2026-05-04 17:04:16,690 INFO [submariner] Environment started in 127.39 secondsCan you explain how to reproduce the issue you are trying to fix? |
Refactor the submariner start hook to check whether the broker and cluster joins are already healthy before re-running them. This avoids the "existing joined cluster with the same ID" error when re-running start on a partially deployed environment.
Assisted-by: Claude Code/claude-opus-4-6