-
Notifications
You must be signed in to change notification settings - Fork 42
Health check plugin #247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
paigerube14
wants to merge
2
commits into
krkn-chaos:main
Choose a base branch
from
paigerube14:health_check_plugin
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Health check plugin #247
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
262 changes: 262 additions & 0 deletions
262
content/en/docs/developers-guide/health-check-plugins.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,262 @@ | ||
| --- | ||
| title: Health Check Plugins | ||
| date: 2024-01-01 | ||
| description: > | ||
| How to use and create health check plugins in Krkn to monitor services during chaos experiments | ||
| categories: [Developers Guide] | ||
| tags: [health-checks, plugins, docs] | ||
| weight: 5 | ||
| --- | ||
|
|
||
| # Health Check Plugins | ||
|
|
||
| Health check plugins allow Krkn to continuously monitor the health of your services and infrastructure **during** chaos experiments. They run in background threads alongside the chaos scenario, detecting outages, tracking downtime duration, and collecting telemetry data. | ||
|
|
||
| ## Overview | ||
|
|
||
| The health check system uses a plugin architecture: | ||
|
|
||
| - **`HealthCheckFactory`** — automatically discovers and loads all plugins from the `krkn.health_checks` package | ||
| - **`AbstractHealthCheckPlugin`** — base class all plugins must extend | ||
| - Plugins run in separate threads and write telemetry to a shared queue | ||
| - The factory tracks all active plugin instances and provides lifecycle management (`increment_all_iterations`, `stop_all`) | ||
|
|
||
| ### Exit Codes | ||
|
|
||
| | Code | Meaning | | ||
| |------|---------| | ||
| | `0` | Success — all health checks passed | | ||
| | `2` | Critical alert detected during the run | | ||
| | `3` | Health check failure (e.g. `exit_on_failure: true` triggered) | | ||
|
|
||
| --- | ||
|
|
||
| ## Built-in Plugins | ||
|
|
||
| ### [HTTP Health Check](../krkn/health-checks.md) (`health_checks`) | ||
|
|
||
| Monitors HTTP endpoints by making periodic GET requests. Tracks status changes, measures downtime duration, and records telemetry for each state transition. | ||
|
|
||
| **Configuration example:** | ||
|
|
||
| ```yaml | ||
| health_checks: | ||
| interval: 2 # seconds between checks | ||
| config: | ||
| - url: "http://my-service/health" | ||
| verify_url: true # SSL certificate verification (default: true) | ||
| exit_on_failure: false # exit with code 3 if endpoint goes down (default: false) | ||
|
|
||
| - url: "https://api.example.com/status" | ||
| bearer_token: "your-token" # Authorization: Bearer <token> | ||
| exit_on_failure: true | ||
|
|
||
| - url: "http://internal-service" | ||
| auth: "username,password" # HTTP basic auth | ||
| verify_url: false | ||
| ``` | ||
|
|
||
| **Config fields:** | ||
|
|
||
| | Field | Type | Required | Default | Description | | ||
| |-------|------|----------|---------|-------------| | ||
| | `url` | string | yes | — | HTTP endpoint to monitor | | ||
| | `bearer_token` | string | no | — | Bearer token for Authorization header | | ||
| | `auth` | string | no | — | Basic auth as `"username,password"` | | ||
| | `verify_url` | bool | no | `true` | Verify SSL certificates | | ||
| | `exit_on_failure` | bool | no | `false` | Set exit code 3 when endpoint returns non-200 | | ||
|
|
||
| --- | ||
|
|
||
| ### [KubeVirt VM Health Check](../krkn/virt-checks.md) (`virt_health_check`) | ||
|
|
||
| Monitors KubeVirt VirtualMachineInstance (VMI) connectivity during chaos experiments. It tracks SSH/network access to VMs, detects disconnections, and records recovery data. | ||
|
|
||
| **Configuration example:** | ||
|
|
||
| ```yaml | ||
| kubevirt_checks: | ||
| interval: 5 | ||
| config: | ||
| namespace: "my-namespace" | ||
| node_name: "worker-1" # optional: filter VMIs by node | ||
| exit_on_failure: false | ||
| disconnected_mode: false # track VMs that become unreachable | ||
| only_failures: true # only record failed checks in telemetry | ||
| batch_size: 10 # VMIs to check concurrently (0 = no limit) | ||
| ssh_port: 22 | ||
| ssh_user: "cloud-user" | ||
| ssh_private_key: "/path/to/key" | ||
| ``` | ||
|
Comment on lines
+78
to
+90
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1. Kubevirt config nesting mismatch The new Health Check Plugins page documents kubevirt_checks configuration nested under a config: key and uses field names that conflict with existing virt-check docs, so users copying the example will produce a config that doesn’t match the documented kubevirt_checks shape elsewhere in this site. Agent Prompt
|
||
|
|
||
| --- | ||
|
|
||
| ## Creating a Custom Health Check Plugin | ||
|
|
||
| You can add your own health check plugin by following these steps. | ||
|
|
||
| ### 1. Naming Conventions | ||
|
|
||
| The factory enforces strict naming conventions for auto-discovery. Both the file name and class name must follow these rules: | ||
|
|
||
| | Rule | Example | | ||
| |------|---------| | ||
| | File must be in `krkn/health_checks/` | `krkn/health_checks/` | | ||
| | File name must end with `_health_check_plugin.py` | `my_service_health_check_plugin.py` | | ||
| | Class name must be CapitalCamelCase of the file name | `MyServiceHealthCheckPlugin` | | ||
| | Class name must end with `HealthCheckPlugin` | `MyServiceHealthCheckPlugin` | | ||
| | Class must inherit from `AbstractHealthCheckPlugin` | — | | ||
|
|
||
| The snake_case-to-CamelCase conversion is automatic — `my_service_health_check_plugin` becomes `MyServiceHealthCheckPlugin`. The factory will reject your plugin if these rules are not followed. | ||
|
|
||
| ### 2. Create the Plugin File | ||
|
|
||
| Create `krkn/health_checks/my_service_health_check_plugin.py`: | ||
|
|
||
| ```python | ||
| import logging | ||
| import queue | ||
| import time | ||
| from typing import Any | ||
|
|
||
| from krkn.health_checks.abstract_health_check_plugin import AbstractHealthCheckPlugin | ||
|
|
||
|
|
||
| class MyServiceHealthCheckPlugin(AbstractHealthCheckPlugin): | ||
| """ | ||
| Health check plugin that monitors MyService during chaos experiments. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| health_check_type: str = "my_service_health_check", | ||
| iterations: int = 1, | ||
| **kwargs | ||
| ): | ||
| super().__init__(health_check_type) | ||
| self.iterations = iterations | ||
| self.current_iterations = 0 | ||
|
|
||
| def get_health_check_types(self) -> list[str]: | ||
| """ | ||
| Returns the health_check type strings this plugin handles. | ||
| These must match the `type` field in config.yaml health_checks section. | ||
| One plugin can handle multiple type strings. | ||
| """ | ||
| return ["my_service_health_check"] | ||
|
|
||
| def increment_iterations(self) -> None: | ||
| """ | ||
| Called by the main run loop after each chaos iteration. | ||
| Increment your counter so the plugin knows when to stop. | ||
| """ | ||
| self.current_iterations += 1 | ||
|
|
||
| def run_health_check( | ||
| self, | ||
| config: dict[str, Any], | ||
| telemetry_queue: queue.Queue, | ||
| ) -> None: | ||
| """ | ||
| Main health check loop. Runs in a background thread. | ||
|
|
||
| - Check `self._stop_event.is_set()` to support cooperative shutdown | ||
| - Check `self.current_iterations < self.iterations` to stop after the run | ||
| - Put telemetry results into `telemetry_queue` | ||
| - Set `self.ret_value = 3` to signal health check failure | ||
| """ | ||
| if not config or not config.get("config"): | ||
| logging.info("my_service_health_check config not defined, skipping") | ||
| return | ||
|
|
||
| interval = config.get("interval", 5) | ||
| exit_on_failure = config.get("config", {}).get("exit_on_failure", False) | ||
| endpoint = config.get("config", {}).get("endpoint") | ||
|
|
||
| telemetry_results = [] | ||
|
|
||
| while self.current_iterations < self.iterations and not self._stop_event.is_set(): | ||
| healthy = self._check_my_service(endpoint) | ||
|
|
||
| if not healthy: | ||
| logging.warning(f"MyService at {endpoint} is unhealthy") | ||
| if exit_on_failure and self.ret_value == 0: | ||
| self.ret_value = 3 | ||
|
|
||
| # Collect telemetry (structure depends on your needs) | ||
| telemetry_results.append({ | ||
| "endpoint": endpoint, | ||
| "status": healthy, | ||
| }) | ||
|
|
||
| time.sleep(interval) | ||
|
|
||
| # Always put results into the queue when done | ||
| telemetry_queue.put(telemetry_results) | ||
|
|
||
| def _check_my_service(self, endpoint: str) -> bool: | ||
| """Check if the service is healthy. Returns True if healthy.""" | ||
| try: | ||
| # Your service-specific health check logic here | ||
| return True | ||
| except Exception as e: | ||
| logging.error(f"Health check failed: {e}") | ||
| return False | ||
| ``` | ||
|
|
||
| ### 3. Register the Type in `config.yaml` | ||
|
|
||
| Reference your plugin's type string in your chaos scenario config: | ||
|
|
||
| ```yaml | ||
| health_checks: | ||
| type: my_service_health_check | ||
| interval: 5 | ||
| config: | ||
| endpoint: "http://my-service:8080" | ||
| exit_on_failure: true | ||
| ``` | ||
|
qodo-code-review[bot] marked this conversation as resolved.
Outdated
|
||
|
|
||
| The factory matches the `type` field against the strings returned by `get_health_check_types()` and automatically instantiates your plugin. | ||
|
|
||
| ### 4. AbstractHealthCheckPlugin API Reference | ||
|
|
||
| Your plugin inherits the following from `AbstractHealthCheckPlugin`: | ||
|
|
||
| | Member | Type | Description | | ||
| |--------|------|-------------| | ||
| | `self._stop_event` | `threading.Event` | Set when the main loop requests shutdown. Check `self._stop_event.is_set()` in your loop. | | ||
| | `self.ret_value` | `int` | Return code. `0` = success, `3` = health check failure. | | ||
| | `stop()` | method | Called by the factory to signal your loop to exit. Do not override — check `_stop_event` instead. | | ||
| | `get_return_value()` | method | Returns `self.ret_value`. Used by the main loop to detect failures. | | ||
| | `set_return_value(value)` | method | Sets `self.ret_value`. | | ||
|
|
||
| **Methods you must implement:** | ||
|
|
||
| | Method | Description | | ||
| |--------|-------------| | ||
| | `run_health_check(config, telemetry_queue)` | Main health check loop, runs in a background thread | | ||
| | `get_health_check_types()` | Returns list of type strings this plugin handles | | ||
| | `increment_iterations()` | Increments your iteration counter when called by the factory | | ||
|
|
||
| ### 5. Factory Auto-Discovery | ||
|
|
||
| The `HealthCheckFactory` uses `pkgutil.walk_packages` to scan the `krkn.health_checks` package at startup. Any file ending in `_health_check_plugin.py` that contains a class following the naming conventions will be automatically loaded. No registration step is needed beyond placing the file in the right directory. | ||
|
|
||
| You can verify your plugin was loaded by checking the factory's `loaded_plugins` dict: | ||
|
|
||
| ```python | ||
| from krkn.health_checks.health_check_factory import HealthCheckFactory | ||
|
|
||
| factory = HealthCheckFactory() | ||
| print(factory.loaded_plugins.keys()) | ||
| # dict_keys(['http_health_check', 'virt_health_check', 'my_service_health_check']) | ||
| ``` | ||
|
|
||
| If your plugin fails to load (naming violation, import error, duplicate type), it will appear in `factory.failed_plugins` as a list of `(module_name, class_name, error_message)` tuples. | ||
|
|
||
| --- | ||
|
|
||
| ## Questions? | ||
|
|
||
| For questions or guidance, reach out on the [Kubernetes Slack](https://kubernetes.slack.com/) in the `#krkn` channel. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.