addinig vmi network filter outages by paigerube14 · Pull Request #1234 · krkn-chaos/krkn

paigerube14 · 2026-04-17T16:43:35Z

Type of change

Refactor
New feature
Bug fix
Optimization

Description

Adding ability to run network filter scenarios but for vmi

Related Tickets & Documents

If no related issue, please create one and start the converasation on wants of

Related Issue #: Add VMI to network outage types #1191
Closes #:

Documentation

Is documentation needed for this update?

If checked, a documentation PR must be created and merged in the website repository.

Checklist before requesting a review

[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See contributing guidelines
See testing your changes and run on any Kubernetes or OpenShift cluster to validate your changes

I have performed a self-review of my code by running krkn and specific scenario
If it is a core feature, I have added thorough unit tests with above 80% coverage

Validated in prometheus

REQUIRED:
Description of combination of tests performed and output of run

python run_kraken.py
...
<---insert test results output--->

OR

python -m coverage run -a -m unittest discover -s tests -v
...
<---insert test results output--->

qodo-code-review · 2026-04-17T16:43:52Z

Review Summary by Qodo

Add VMI network chaos scenario support with nsenter-based injection

✨ Enhancement

Walkthroughs

Description

• Add VMI (Virtual Machine Instance) network chaos scenario support
• Implement get_vmi_targets() method for VMI target resolution
• Create new VmiNetworkChaosModule for injecting network chaos into VMIs
• Improve node interface recovery command execution reliability
• Register VMI network chaos module in factory

Diagram

flowchart LR
  A["NetworkChaosScenarioType"] -->|"Add VMI enum"| B["VMI = 3"]
  C["AbstractNetworkChaosModule"] -->|"Add get_vmi_targets()"| D["VMI target resolution"]
  E["VmiNetworkChaosModule"] -->|"New module"| F["Inject network chaos via nsenter"]
  G["NetworkChaosFactory"] -->|"Register"| E
  D -->|"Uses"| F

File Changes

1. krkn/scenario_plugins/network_chaos_ng/models.py ✨ Enhancement +1/-0

Add VMI enum to NetworkChaosScenarioType
• Add VMI = 3 enum value to NetworkChaosScenarioType class
• Enable VMI as a supported network chaos scenario type
krkn/scenario_plugins/network_chaos_ng/models.py

2. krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py ✨ Enhancement +29/-0

Add VMI target resolution method
• Implement get_vmi_targets() method for VMI target discovery
• Support regex matching on VMI names and namespaces
• Support optional label selector filtering in "key=value" format
• Return targets in "namespace/vmi-name" format
krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py

3. krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py 🐞 Bug fix +17/-8

Fix interface recovery and add license header
• Add Apache 2.0 license header
• Fix recovery command execution to use sh -c with stdio redirection
• Remove blocking sleep call after interface down command
• Improve comments explaining recovery process behavior
krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py

View more (2)

4. krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py ✨ Enhancement +229/-0

Implement VMI network chaos injection module

• Create new VmiNetworkChaosModule class for VMI network chaos injection
• Resolve VMI to node and virt-launcher pod mapping
• Deploy privileged chaos pod on VMI's node with nsenter support
• Apply traffic control rules via nsenter into virt-launcher network namespace
• Support latency, packet loss, and bandwidth restrictions
• Clean up chaos pod after test completion

krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py

5. krkn/scenario_plugins/network_chaos_ng/network_chaos_factory.py ✨ Enhancement +10/-0

Register VMI network chaos module in factory
• Import new VmiNetworkChaosModule class
• Add "vmi_network_chaos" to supported modules list
• Add factory method to instantiate VmiNetworkChaosModule with config validation
krkn/scenario_plugins/network_chaos_ng/network_chaos_factory.py

qodo-code-review · 2026-04-17T16:43:53Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (2)

1. label_selector lacks regex support 📎 Requirement gap ≡ Correctness

Description

The new VMI target selection only supports exact key=value label matching and does not support
regex matching for labels as required. This prevents users from flexibly selecting VMIs by labels
using regex patterns.

Code

krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[R106-117]

+        if config.label_selector:
+            try:
+                label_key, label_value = config.label_selector.split("=", 1)
+            except ValueError:
+                raise Exception(
+                    f"invalid label_selector format: '{config.label_selector}', expected 'key=value'"
+                )
+            vmis = [
+                vmi
+                for vmi in vmis
+                if vmi.get("metadata", {}).get("labels", {}).get(label_key) == label_value
+            ]

Evidence
PR Compliance ID 2 requires regex matching support for VMI selection via labels. The added code
parses label_selector as key=value and filters with strict equality (==), with no regex
matching applied to either key or value.
Allow selecting VMI by name, namespace, or labels with regex support
krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[106-117]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`get_vmi_targets()` post-filters VMIs by `label_selector` using exact `key=value` equality, but compliance requires regex matching support for label-based VMI selection.

## Issue Context
Compliance requires users can select VMIs by name/namespace/labels with regex support. Name selection uses a regex-like `target`, but label selection currently does not.

## Fix Focus Areas
- krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[106-117]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. namespace regex unsupported 📎 Requirement gap ≡ Correctness

Description

The new VMI target selection requires a single concrete namespace value and does not implement
regex-based namespace matching. This violates the requirement that namespace selection supports
regex matching.

Code

krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[R100-104]

+        if not config.namespace:
+            raise Exception("namespace not specified for VMI scenario, aborting")
+        name_regex = config.target if config.target else ".*"
+        vmis = self.kubecli.get_lib_kubernetes().get_vmis(name_regex, config.namespace)
+        if not vmis:

Evidence
PR Compliance ID 2 requires regex support for namespace-based VMI selection. The added code raises
if config.namespace is missing and passes config.namespace directly into get_vmis(...) without
any regex evaluation or namespace enumeration/filtering logic.
Allow selecting VMI by name, namespace, or labels with regex support
krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[100-104]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`get_vmi_targets()` does not implement regex matching for namespaces; it requires a single provided namespace and uses it directly.

## Issue Context
Compliance requires selecting VMIs by name, namespace, or labels with regex support for each selector.

## Fix Focus Areas
- krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[100-104]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. InterfaceDown timeout too short 🐞 Bug ≡ Correctness

Description

NodeInterfaceDownModule now starts polling node readiness immediately, but the poll loop is capped
at 5 minutes regardless of test_duration; when test_duration > 300s it will report the node never
recovered before the background recovery even runs. This also breaks existing unit tests that assert
a sleep(test_duration) call during run().

Code

krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py[R120-122]
            log_info(
                f"waiting for node {target} to become Ready after interface recovery",
                parallel,

Evidence

The recovery is scheduled with a background sleep {test_duration} before bringing links down, but
readiness polling begins immediately and only runs for 60*5 seconds (300s), so configurations with
test_duration > 300s will always hit the timeout/log error before recovery occurs. The repository’s
unit tests explicitly assert that run() sleeps for test_duration, which is no longer true after the
removal in this PR.

krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py[95-139]
tests/test_node_interface_down.py[169-203]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`NodeInterfaceDownModule.run()` schedules interface recovery after `test_duration`, but its readiness polling loop is hard-coded to 5 minutes and starts immediately. For `test_duration > 300s`, the module will time out and log an error before recovery occurs. This PR also removed the explicit `sleep(test_duration)`, breaking existing unit tests that expect it.

### Issue Context
Recovery is performed by a background command that sleeps for `test_duration`, so the readiness wait must allow at least `test_duration` plus a post-recovery readiness window.

### Fix Focus Areas
- krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py[95-149]
- tests/test_node_interface_down.py[169-203]

### Suggested fix
- Change readiness polling to wait for up to `test_duration + 300s` (or another configurable post-recovery window), e.g. compute iterations as `ceil((test_duration + 300)/5)`.
- Alternatively, reintroduce `time.sleep(test_duration)` before the readiness poll (and then poll for readiness), but still keep the improved exec command behavior.
- Update/adjust unit tests accordingly depending on the chosen behavior (if you reintroduce sleep(test_duration), existing tests should pass again).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

View more (1)

4. VMI chaos pod leaks on error 🐞 Bug ☼ Reliability

Description

VmiNetworkChaosModule deletes the privileged chaos pod only on the success path; any exception after
pod creation will skip cleanup and can leave the pod running (and possibly tc rules applied). This
can cause persistent cluster impact and require manual cleanup.

Code

krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py[R215-224]

+            self.kubecli.get_lib_kubernetes().delete_pod(
+                network_chaos_pod_name, self.config.namespace
+            )
+
+        except Exception as e:
+            if error_queue is None:
+                raise e
+            else:
+                error_queue.put(str(e))
+

Evidence
The module creates a privileged chaos pod and applies tc rules inside the try block, but on
exceptions it only re-raises/enqueues the error; there is no finally/best-effort cleanup. Therefore
failures after deploy (e.g., PID resolution) can leak the chaos pod and leave traffic shaping in
place until manually removed.
krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py[103-117]
krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py[184-224]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`VmiNetworkChaosModule.run()` deletes the chaos pod only if the happy-path completes. If an exception occurs after `deploy_network_chaos_ng_pod(...)` (e.g., cannot resolve container IDs/PIDs, tc command errors), the privileged pod is left running and tc rules may remain applied.

### Issue Context
This module creates a privileged pod and performs network shaping; leaving either behind is operationally risky.

### Fix Focus Areas
- krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py[103-224]

### Suggested fix
- Track whether the chaos pod was created (e.g., `pod_created = True` after deploy).
- Use a `try: ... finally:` that best-effort:
 - removes tc rules if they were applied (guard on `pids`/`interfaces`/a `rules_applied` flag), using `common_delete_limit_rules(...)` inside its own try/except to avoid masking the original error.
 - deletes the chaos pod if created, also best-effort.
- Preserve current parallel error reporting behavior (queue the original exception string), but still perform cleanup in finally.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

yogananth-subramanian · 2026-05-01T10:48:42Z

+  egress: true
+  protocols:
+    - tcp
+    - udp


Hello @paigerube14 since protocols and ports are not relevant for this scenario, may be they can be removed from the config.

thanks for pointing this out, I changed the code to add the protocols and ports to better align with the pod/node network scenarios

do you feel its needed or should we automatically just do all ports and tcp/udp with the way it was?

PR #1234 and #1260 are currently designed to handle bridge mode networking only (detecting k6t-* bridges). For masquerade mode VMIs, the current workaround is to use generic pod network chaos
(pod_network_filter/pod_network_chaos) on the virt-launcher pod, which blocks all pod traffic including VM traffic.

However, a better design would be to make VMI network chaos mode-aware: automatically detect the VMI's networking mode (bridge, masquerade, SR-IOV, etc.) and apply the appropriate chaos technique. This would:

Users don't need to understand VMI networking modes to choose the right scenario

same configuration works across different VMI networking modes

thanks for the background/advice here Yogi. Completely agree a user should be able to not know the backend of their networking and the scenario properly choose. Making updates on that

Signed-off-by: Paige Patton <[email protected]>

paigerube14 requested review from chaitanyaenr and tsebastiani as code owners April 17, 2026 16:43

paigerube14 force-pushed the vmi_network branch from 6c313a1 to fb521be Compare April 17, 2026 16:43

qodo-code-review Bot reviewed Apr 17, 2026

View reviewed changes

paigerube14 force-pushed the vmi_network branch 2 times, most recently from 6645656 to e9379ed Compare April 21, 2026 20:10

paigerube14 force-pushed the vmi_network branch 2 times, most recently from 16881d1 to 28a230e Compare April 29, 2026 16:15

paigerube14 changed the title ~~addinig vmi network outages~~ addinig vmi network filter outages Apr 29, 2026

paigerube14 force-pushed the vmi_network branch from 28a230e to 679bbab Compare April 29, 2026 18:23

yogananth-subramanian reviewed May 1, 2026

View reviewed changes

paigerube14 force-pushed the vmi_network branch from 0ae1a0d to 26edfbf Compare May 1, 2026 15:45

paigerube14 added 3 commits May 5, 2026 09:42

addinig vmi network outages

52256ff

Signed-off-by: Paige Patton <[email protected]>

changing to use ports and protocols

31e2ee1

adding pick mode

d1e3196

paigerube14 force-pushed the vmi_network branch from badfa40 to d1e3196 Compare May 5, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addinig vmi network filter outages#1234

addinig vmi network filter outages#1234
paigerube14 wants to merge 3 commits intokrkn-chaos:mainfrom
paigerube14:vmi_network

paigerube14 commented Apr 17, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented Apr 17, 2026

Uh oh!

qodo-code-review Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yogananth-subramanian May 1, 2026

Uh oh!

paigerube14 May 1, 2026

Uh oh!

paigerube14 May 1, 2026

Uh oh!

yogananth-subramanian May 4, 2026 •

edited

Loading

Uh oh!

paigerube14 May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paigerube14 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Description

Related Tickets & Documents

Documentation

Related Documentation PR (if applicable)

Checklist before requesting a review

Uh oh!

qodo-code-review Bot commented Apr 17, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yogananth-subramanian May 1, 2026

Choose a reason for hiding this comment

Uh oh!

paigerube14 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

paigerube14 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

yogananth-subramanian May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paigerube14 May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paigerube14 commented Apr 17, 2026 •

edited

Loading

qodo-code-review Bot commented Apr 17, 2026 •

edited

Loading

yogananth-subramanian May 4, 2026 •

edited

Loading