- Habana Gaudi docs
- Kernel Module Management (KMM) operator (KMMO) repository
The Habana AI Operator fullfils the goal to seamlessly enable Habana AI accelerators on Kubernetes and OpenShift. It provides an opinionated and extendable API. It offloads the driver container and the device plugin to KMMO. It follows the software engineering practices, leading to great maintainability, reliability and development velocity.
As a user I want to enable the Habana AI hardware accelerators on Kubernetes/OpenShift on a group of nodes of my choice, with the driver version and configuration of my choice. But I want to have as minimum configuration options as possible.
The following sections describe the design decisions and trade-offs of each implementation detail of the Habana AI Operator.
The Habana AI Operator starts by designing an extremely lean API that trades configuration flexibility with reliability, as the operator takes ownership of setting up and lifecycling all required components with minimum user input.
This trade-off leads to:
- easy and robust dependency management
- simple API
- seamless user experience
- easier extendability
- small and focused codebase, as all flows are highly opinionated
The DeviceConfig is the main Custom Resource Definition (CRD) of the Habana AI Operator.
| Field | Description | Scheme | Required |
|---|---|---|---|
| DriverImage | The Habana Labs driver image to use | string | true |
| DriverVersion | The Habana Labs Driver version to use | string | true |
| NodeSelector | Specifies the node selector to be used for this DeviceConfig | map[string]string | false |
The DeviceConfig specification has the following goals:
- support multiple
DeviceConfigs on a cluster, each one targeting a unique group of nodes via a NodeSelector - each
DeviceConfigcan have a different driver configuration - the
DeviceConfigshould accept minimum user input
The Habana AI Operator supports multiple DeviceConfigs with different driver configurations on
the same cluster. This is implemented by including a node selector in its specification. But a node
can only have one driver configuration, as it cannot have more than one kernel modules owning the
same device. The operator therefore, needs to validate each DeviceConfig applied by the user, in
order to verify that its node selector does not include a node that is already part of another
DeviceConfig node selector.
To validate the uniqueness of node selectors among the DeviceConfigs, when a new DeviceConfig is
created, the following validation is performed:
The Habana AI Operator integrates with KMM to offload the management of the Habana Labs driver container and the Habana Labs device plugin on a Kubernetes or OpenShift cluser. This integration helps the Habana AI Operator focus on its user experience and features, while gaining from the KMM features and reducing its own codebase.
The current test coverage is above 70%, with the most critical parts of the operator already
covered. The frameworks and mocking tools adopted are described below.
The Habana AI Operator unit tests are written in:
Their integration in kubebuilder, huge adoption in the Kubernetes operator ecosystem and active development, make them a robust choice.
Mocking internal and external packages increases the testability of the operator. In Go one should not look further than:
Using go generate and mockgen all mocks can be automatically generated.
Linting helps to not accumulate technical debt and keep a consistent codebase. The Habana AI Operator leverages golangci-lint with its default linters enabled.
DeviceConfigStatus conditions and thrive to adhere to the
respective suggestions
of the Kubernetes community. There are currently 2 conditions:
ReadyErrored
But while these 2 conditions are all we need, they currently only track the result of the creation
or patching of the managed CRs and not their actual status. As a future enhancement, the
DeviceConfig controller will watch changes in the status of the managed CRs, e.g. the KMM
Module, in order to update the DeviceConfig's conditions with the respective reasons.


