Habana AI Operator

References

Habana Gaudi docs
Kernel Module Management (KMM) operator (KMMO) repository

Summary

The Habana AI Operator fullfils the goal to seamlessly enable Habana AI accelerators on Kubernetes and OpenShift. It provides an opinionated and extendable API. It offloads the driver container and the device plugin to KMMO. It follows the software engineering practices, leading to great maintainability, reliability and development velocity.

User Stories

User Story 1

As a user I want to enable the Habana AI hardware accelerators on Kubernetes/OpenShift on a group of nodes of my choice, with the driver version and configuration of my choice. But I want to have as minimum configuration options as possible.

Design Details

The following sections describe the design decisions and trade-offs of each implementation detail of the Habana AI Operator.

API

The Habana AI Operator starts by designing an extremely lean API that trades configuration flexibility with reliability, as the operator takes ownership of setting up and lifecycling all required components with minimum user input.

This trade-off leads to:

easy and robust dependency management
simple API
seamless user experience
easier extendability
small and focused codebase, as all flows are highly opinionated

DeviceConfig

The DeviceConfig is the main Custom Resource Definition (CRD) of the Habana AI Operator.

DeviceConfigSpec

Field	Description	Scheme	Required
DriverImage	The Habana Labs driver image to use	string	true
DriverVersion	The Habana Labs Driver version to use	string	true
NodeSelector	Specifies the node selector to be used for this DeviceConfig	map[string]string	false

The DeviceConfig specification has the following goals:

support multiple DeviceConfigs on a cluster, each one targeting a unique group of nodes via a NodeSelector
each DeviceConfig can have a different driver configuration
the DeviceConfig should accept minimum user input

Node Selector Validation

The Habana AI Operator supports multiple DeviceConfigs with different driver configurations on the same cluster. This is implemented by including a node selector in its specification. But a node can only have one driver configuration, as it cannot have more than one kernel modules owning the same device. The operator therefore, needs to validate each DeviceConfig applied by the user, in order to verify that its node selector does not include a node that is already part of another DeviceConfig node selector.

To validate the uniqueness of node selectors among the DeviceConfigs, when a new DeviceConfig is created, the following validation is performed:

Kernel Module Management (KMM) Operator Integration

The Habana AI Operator integrates with KMM to offload the management of the Habana Labs driver container and the Habana Labs device plugin on a Kubernetes or OpenShift cluser. This integration helps the Habana AI Operator focus on its user experience and features, while gaining from the KMM features and reducing its own codebase.

Unit Testing

The current test coverage is above 70%, with the most critical parts of the operator already covered. The frameworks and mocking tools adopted are described below.

Frameworks

The Habana AI Operator unit tests are written in:

Their integration in kubebuilder, huge adoption in the Kubernetes operator ecosystem and active development, make them a robust choice.

Mocks

Mocking internal and external packages increases the testability of the operator. In Go one should not look further than:

Using go generate and mockgen all mocks can be automatically generated.

Linting

Linting helps to not accumulate technical debt and keep a consistent codebase. The Habana AI Operator leverages golangci-lint with its default linters enabled.

Future Work

Conditions

DeviceConfigStatus conditions and thrive to adhere to the respective suggestions of the Kubernetes community. There are currently 2 conditions:

Ready
Errored

But while these 2 conditions are all we need, they currently only track the result of the creation or patching of the managed CRs and not their actual status. As a future enhancement, the DeviceConfig controller will watch changes in the status of the managed CRs, e.g. the KMM Module, in order to update the DeviceConfig's conditions with the respective reasons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Habana AI Operator

References

Summary

User Stories

User Story 1

Design Details

API

DeviceConfig

DeviceConfigSpec

Node Selector Validation

Kernel Module Management (KMM) Operator Integration

Unit Testing

Frameworks

Mocks

Linting

Future Work

Conditions

FilesExpand file tree

design.md

Latest commit

History

design.md

File metadata and controls

Habana AI Operator

References

Summary

User Stories

User Story 1

Design Details

API

DeviceConfig

DeviceConfigSpec

Node Selector Validation

Kernel Module Management (KMM) Operator Integration

Unit Testing

Frameworks

Mocks

Linting

Future Work

Conditions