Skip to content

addinig vmi network filter outages#1234

Open
paigerube14 wants to merge 3 commits intokrkn-chaos:mainfrom
paigerube14:vmi_network
Open

addinig vmi network filter outages#1234
paigerube14 wants to merge 3 commits intokrkn-chaos:mainfrom
paigerube14:vmi_network

Conversation

@paigerube14
Copy link
Copy Markdown
Collaborator

@paigerube14 paigerube14 commented Apr 17, 2026

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization

Description

Adding ability to run network filter scenarios but for vmi

Related Tickets & Documents

If no related issue, please create one and start the converasation on wants of

Documentation

  • Is documentation needed for this update?

If checked, a documentation PR must be created and merged in the website repository.

Related Documentation PR (if applicable)

<-- Add the link to the corresponding documentation PR in the website repository -->

Checklist before requesting a review

[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See contributing guidelines
See testing your changes and run on any Kubernetes or OpenShift cluster to validate your changes

  • I have performed a self-review of my code by running krkn and specific scenario
  • If it is a core feature, I have added thorough unit tests with above 80% coverage

Validated in prometheus
Screenshot 2026-04-29 at 10 53 55 AM

REQUIRED:
Description of combination of tests performed and output of run

python run_kraken.py
...
<---insert test results output--->

OR

python -m coverage run -a -m unittest discover -s tests -v
...
<---insert test results output--->

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add VMI network chaos scenario support with nsenter-based injection

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add VMI (Virtual Machine Instance) network chaos scenario support
• Implement get_vmi_targets() method for VMI target resolution
• Create new VmiNetworkChaosModule for injecting network chaos into VMIs
• Improve node interface recovery command execution reliability
• Register VMI network chaos module in factory
Diagram
flowchart LR
  A["NetworkChaosScenarioType"] -->|"Add VMI enum"| B["VMI = 3"]
  C["AbstractNetworkChaosModule"] -->|"Add get_vmi_targets()"| D["VMI target resolution"]
  E["VmiNetworkChaosModule"] -->|"New module"| F["Inject network chaos via nsenter"]
  G["NetworkChaosFactory"] -->|"Register"| E
  D -->|"Uses"| F
Loading

Grey Divider

File Changes

1. krkn/scenario_plugins/network_chaos_ng/models.py ✨ Enhancement +1/-0

Add VMI enum to NetworkChaosScenarioType

• Add VMI = 3 enum value to NetworkChaosScenarioType class
• Enable VMI as a supported network chaos scenario type

krkn/scenario_plugins/network_chaos_ng/models.py


2. krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py ✨ Enhancement +29/-0

Add VMI target resolution method

• Implement get_vmi_targets() method for VMI target discovery
• Support regex matching on VMI names and namespaces
• Support optional label selector filtering in "key=value" format
• Return targets in "namespace/vmi-name" format

krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py


3. krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py 🐞 Bug fix +17/-8

Fix interface recovery and add license header

• Add Apache 2.0 license header
• Fix recovery command execution to use sh -c with stdio redirection
• Remove blocking sleep call after interface down command
• Improve comments explaining recovery process behavior

krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py


View more (2)
4. krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py ✨ Enhancement +229/-0

Implement VMI network chaos injection module

• Create new VmiNetworkChaosModule class for VMI network chaos injection
• Resolve VMI to node and virt-launcher pod mapping
• Deploy privileged chaos pod on VMI's node with nsenter support
• Apply traffic control rules via nsenter into virt-launcher network namespace
• Support latency, packet loss, and bandwidth restrictions
• Clean up chaos pod after test completion

krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py


5. krkn/scenario_plugins/network_chaos_ng/network_chaos_factory.py ✨ Enhancement +10/-0

Register VMI network chaos module in factory

• Import new VmiNetworkChaosModule class
• Add "vmi_network_chaos" to supported modules list
• Add factory method to instantiate VmiNetworkChaosModule with config validation

krkn/scenario_plugins/network_chaos_ng/network_chaos_factory.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 17, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (2)

Grey Divider


Action required

1. label_selector lacks regex support 📎 Requirement gap ≡ Correctness
Description
The new VMI target selection only supports exact key=value label matching and does not support
regex matching for labels as required. This prevents users from flexibly selecting VMIs by labels
using regex patterns.
Code

krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[R106-117]

+        if config.label_selector:
+            try:
+                label_key, label_value = config.label_selector.split("=", 1)
+            except ValueError:
+                raise Exception(
+                    f"invalid label_selector format: '{config.label_selector}', expected 'key=value'"
+                )
+            vmis = [
+                vmi
+                for vmi in vmis
+                if vmi.get("metadata", {}).get("labels", {}).get(label_key) == label_value
+            ]
Evidence
PR Compliance ID 2 requires regex matching support for VMI selection via labels. The added code
parses label_selector as key=value and filters with strict equality (==), with no regex
matching applied to either key or value.

Allow selecting VMI by name, namespace, or labels with regex support
krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[106-117]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`get_vmi_targets()` post-filters VMIs by `label_selector` using exact `key=value` equality, but compliance requires regex matching support for label-based VMI selection.

## Issue Context
Compliance requires users can select VMIs by name/namespace/labels with regex support. Name selection uses a regex-like `target`, but label selection currently does not.

## Fix Focus Areas
- krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[106-117]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. namespace regex unsupported 📎 Requirement gap ≡ Correctness
Description
The new VMI target selection requires a single concrete namespace value and does not implement
regex-based namespace matching. This violates the requirement that namespace selection supports
regex matching.
Code

krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[R100-104]

+        if not config.namespace:
+            raise Exception("namespace not specified for VMI scenario, aborting")
+        name_regex = config.target if config.target else ".*"
+        vmis = self.kubecli.get_lib_kubernetes().get_vmis(name_regex, config.namespace)
+        if not vmis:
Evidence
PR Compliance ID 2 requires regex support for namespace-based VMI selection. The added code raises
if config.namespace is missing and passes config.namespace directly into get_vmis(...) without
any regex evaluation or namespace enumeration/filtering logic.

Allow selecting VMI by name, namespace, or labels with regex support
krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[100-104]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`get_vmi_targets()` does not implement regex matching for namespaces; it requires a single provided namespace and uses it directly.

## Issue Context
Compliance requires selecting VMIs by name, namespace, or labels with regex support for each selector.

## Fix Focus Areas
- krkn/scenario_plugins/network_chaos_ng/modules/abstract_network_chaos_module.py[100-104]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. InterfaceDown timeout too short 🐞 Bug ≡ Correctness
Description
NodeInterfaceDownModule now starts polling node readiness immediately, but the poll loop is capped
at 5 minutes regardless of test_duration; when test_duration > 300s it will report the node never
recovered before the background recovery even runs. This also breaks existing unit tests that assert
a sleep(test_duration) call during run().
Code

krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py[R120-122]

            log_info(
                f"waiting for node {target} to become Ready after interface recovery",
                parallel,
Evidence
The recovery is scheduled with a background sleep {test_duration} before bringing links down, but
readiness polling begins immediately and only runs for 60*5 seconds (300s), so configurations with
test_duration > 300s will always hit the timeout/log error before recovery occurs. The repository’s
unit tests explicitly assert that run() sleeps for test_duration, which is no longer true after the
removal in this PR.

krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py[95-139]
tests/test_node_interface_down.py[169-203]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`NodeInterfaceDownModule.run()` schedules interface recovery after `test_duration`, but its readiness polling loop is hard-coded to 5 minutes and starts immediately. For `test_duration > 300s`, the module will time out and log an error before recovery occurs. This PR also removed the explicit `sleep(test_duration)`, breaking existing unit tests that expect it.

### Issue Context
Recovery is performed by a background command that sleeps for `test_duration`, so the readiness wait must allow at least `test_duration` plus a post-recovery readiness window.

### Fix Focus Areas
- krkn/scenario_plugins/network_chaos_ng/modules/node_interface_down.py[95-149]
- tests/test_node_interface_down.py[169-203]

### Suggested fix
- Change readiness polling to wait for up to `test_duration + 300s` (or another configurable post-recovery window), e.g. compute iterations as `ceil((test_duration + 300)/5)`.
- Alternatively, reintroduce `time.sleep(test_duration)` before the readiness poll (and then poll for readiness), but still keep the improved exec command behavior.
- Update/adjust unit tests accordingly depending on the chosen behavior (if you reintroduce sleep(test_duration), existing tests should pass again).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
4. VMI chaos pod leaks on error 🐞 Bug ☼ Reliability
Description
VmiNetworkChaosModule deletes the privileged chaos pod only on the success path; any exception after
pod creation will skip cleanup and can leave the pod running (and possibly tc rules applied). This
can cause persistent cluster impact and require manual cleanup.
Code

krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py[R215-224]

+            self.kubecli.get_lib_kubernetes().delete_pod(
+                network_chaos_pod_name, self.config.namespace
+            )
+
+        except Exception as e:
+            if error_queue is None:
+                raise e
+            else:
+                error_queue.put(str(e))
+
Evidence
The module creates a privileged chaos pod and applies tc rules inside the try block, but on
exceptions it only re-raises/enqueues the error; there is no finally/best-effort cleanup. Therefore
failures after deploy (e.g., PID resolution) can leak the chaos pod and leave traffic shaping in
place until manually removed.

krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py[103-117]
krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py[184-224]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`VmiNetworkChaosModule.run()` deletes the chaos pod only if the happy-path completes. If an exception occurs after `deploy_network_chaos_ng_pod(...)` (e.g., cannot resolve container IDs/PIDs, tc command errors), the privileged pod is left running and tc rules may remain applied.

### Issue Context
This module creates a privileged pod and performs network shaping; leaving either behind is operationally risky.

### Fix Focus Areas
- krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_chaos.py[103-224]

### Suggested fix
- Track whether the chaos pod was created (e.g., `pod_created = True` after deploy).
- Use a `try: ... finally:` that best-effort:
 - removes tc rules if they were applied (guard on `pids`/`interfaces`/a `rules_applied` flag), using `common_delete_limit_rules(...)` inside its own try/except to avoid masking the original error.
 - deletes the chaos pod if created, also best-effort.
- Preserve current parallel error reporting behavior (queue the original exception string), but still perform cleanup in finally.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread krkn/scenario_plugins/network_chaos_ng/modules/vmi_network_filter.py Outdated
@paigerube14 paigerube14 force-pushed the vmi_network branch 2 times, most recently from 6645656 to e9379ed Compare April 21, 2026 20:10
@paigerube14 paigerube14 force-pushed the vmi_network branch 2 times, most recently from 16881d1 to 28a230e Compare April 29, 2026 16:15
@paigerube14 paigerube14 changed the title addinig vmi network outages addinig vmi network filter outages Apr 29, 2026
egress: true
protocols:
- tcp
- udp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @paigerube14 since protocols and ports are not relevant for this scenario, may be they can be removed from the config.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing this out, I changed the code to add the protocols and ports to better align with the pod/node network scenarios

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you feel its needed or should we automatically just do all ports and tcp/udp with the way it was?

Copy link
Copy Markdown
Contributor

@yogananth-subramanian yogananth-subramanian May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1234 and #1260 are currently designed to handle bridge mode networking only (detecting k6t-* bridges). For masquerade mode VMIs, the current workaround is to use generic pod network chaos
(pod_network_filter/pod_network_chaos) on the virt-launcher pod, which blocks all pod traffic including VM traffic.

However, a better design would be to make VMI network chaos mode-aware: automatically detect the VMI's networking mode (bridge, masquerade, SR-IOV, etc.) and apply the appropriate chaos technique. This would:

  • Users don't need to understand VMI networking modes to choose the right scenario
  • same configuration works across different VMI networking modes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the background/advice here Yogi. Completely agree a user should be able to not know the backend of their networking and the scenario properly choose. Making updates on that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants