Improving operator performance #2674

simar7 · 2025-07-29T06:27:40Z

simar7
Jul 29, 2025
Collaborator

Description

This proposal document aims to suggest some tracks in which we can improve the performance of the operator today.

Author(s)

Simar (@simar7)

Motivation

Trivy operator is a fairly scalable Kubernetes controller but there are avenues where we can do better. Some of them involve improving the current business logic, while others focus on following the correct Kubernetes design patterns and idioms.

By having this document as an open discussion, we can also gather feedback from more experienced Kubernetes controller developers and operators who can help us improve the performance of the operator.

Completed optimizations

Filtering Events with Predicates

We shold only reconcile on changes that are of importance to use in the busines logic. Thereby, avoiding reconcilations on status-only or metadata changes.

An example of this was first implemented in this PR: #727

It was further improved by another recent PR where we can improve the predicate logic itself #2669

Using protobuf for k8s api (pending upstream merge)

Today the default way to communicate with the k8s api is over JSON. This requires marshaling/unmarshaling each time a connection is established. We can change this to use protobuf over wire which would eliminate the JSON overhead. This could be especially useful when large k8s artifacts are involved.

One crucial thing to note is that CRDs can only support JSON and not protobuf. This is by Kubernetes design. Currently there are no plans to add support for custom resources to support protobuf and likely won't be as CRDs are dynamically loaded vs the core components.

There's a PR that I made to implement this here: #2676

But it's currently blocked by a bug in Kubernetes which doesn't allow to fallback to JSON for CRDs. kubernetes/kubernetes#86253

Replaced Deep Equality Checks

We should limit using reflect.DeepEqual unless necessary. We can replace them with more efficient hash-based checks or custom equality functions that only compare relevant fields.

One example of this would be how we invoke compareReports today. Currently, we run a reflect.DeepEqual which is quite expensive as seen here

trivy-operator/pkg/vulnerabilityreport/controller/helper.go

Line 66 in f6d43a8

return reflect.DeepEqual(actual, expected)

We can replace them with hash based checks which can still verify the comparison between two reports but in a more efficient manner.

Future Optimizations

Adding reconciliation benchmarks

While the recent PR improved the performance - we still need to quantify the performance improvement in terms of reconciliation performance. Reducing reconciliation attempts is a metric we cannot easily track today within the code so additional test framework will need to be written to do so.

Exponential backoff Rate Limit Requeues

We use RequeueAfter with appropriate backoff intervals, but today they are statically defined in the configuration with ScanJobRetryAfter. We can improve this by implementing a more dynamic backoff strategy that adapts based on the error type or the number of retries.

To further improve this, we can also limit the number of retries for certain error types, especially those that are unlikely to succeed on subsequent attempts (e.g., network errors).

Use controller caches wisely

Today we make use of client.Get in our code, but we need to double-check if all callsites are indeed relying on this approach. Doing so will reduce the API server load.

We also need to be cognizant of the data that gets cached. A recent PR improves this by removing unnecessary data and binary data fields #2677

Use Field Indexing

As part of the previous point, we should also ensure that we are using field indexing in our controllers. This will allow us to quickly retrieve objects based on specific fields without having to list all objects and filter them in memory.

One use case of such an approach will be to retrieve VulnerabilityReports objects for a given deployment. Upon creation time of such a vulnerabilty report, we can also save a custom field such as .metadata.owner.uid which could be the key that we can index on. Later on we can use this index to quickly retrieve all vulnerability reports for a given deployment without having to list all reports and filter them in memory.

Isolate Business Logic

We can refactor some of the core business logic components of the controllers today into a new package or library. This would allow us not only to reuse the code but also to test and benchmark it independently of the controller logic. Currently, there's tight coupling between the business logic and the controller logic, which makes it hard to test and benchmark the business logic independently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving operator performance #2674

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Improving operator performance #2674

Uh oh!

Uh oh!

simar7 Jul 29, 2025 Collaborator

Description

Author(s)

Motivation

Completed optimizations

Filtering Events with Predicates

Using protobuf for k8s api (pending upstream merge)

Replaced Deep Equality Checks

Future Optimizations

Adding reconciliation benchmarks

Exponential backoff Rate Limit Requeues

Use controller caches wisely

Use Field Indexing

Isolate Business Logic

Replies: 0 comments

simar7
Jul 29, 2025
Collaborator