Improving operator performance #2674
simar7
started this conversation in
Development
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
This proposal document aims to suggest some tracks in which we can improve the performance of the operator today.
Author(s)
Motivation
Trivy operator is a fairly scalable Kubernetes controller but there are avenues where we can do better. Some of them involve improving the current business logic, while others focus on following the correct Kubernetes design patterns and idioms.
By having this document as an open discussion, we can also gather feedback from more experienced Kubernetes controller developers and operators who can help us improve the performance of the operator.
Completed optimizations
Filtering Events with Predicates
We shold only reconcile on changes that are of importance to use in the busines logic. Thereby, avoiding reconcilations on status-only or metadata changes.
An example of this was first implemented in this PR: #727
It was further improved by another recent PR where we can improve the predicate logic itself #2669
Using protobuf for k8s api (pending upstream merge)
Today the default way to communicate with the k8s api is over JSON. This requires marshaling/unmarshaling each time a connection is established. We can change this to use protobuf over wire which would eliminate the JSON overhead. This could be especially useful when large k8s artifacts are involved.
One crucial thing to note is that CRDs can only support JSON and not protobuf. This is by Kubernetes design. Currently there are no plans to add support for custom resources to support protobuf and likely won't be as CRDs are dynamically loaded vs the core components.
There's a PR that I made to implement this here: #2676
But it's currently blocked by a bug in Kubernetes which doesn't allow to fallback to JSON for CRDs. kubernetes/kubernetes#86253
Replaced Deep Equality Checks
We should limit using
reflect.DeepEqualunless necessary. We can replace them with more efficient hash-based checks or custom equality functions that only compare relevant fields.One example of this would be how we invoke
compareReportstoday. Currently, we run areflect.DeepEqualwhich is quite expensive as seen heretrivy-operator/pkg/vulnerabilityreport/controller/helper.go
Line 66 in f6d43a8
We can replace them with hash based checks which can still verify the comparison between two reports but in a more efficient manner.
Future Optimizations
Adding reconciliation benchmarks
While the recent PR improved the performance - we still need to quantify the performance improvement in terms of reconciliation performance. Reducing reconciliation attempts is a metric we cannot easily track today within the code so additional test framework will need to be written to do so.
Exponential backoff Rate Limit Requeues
We use
RequeueAfterwith appropriate backoff intervals, but today they are statically defined in the configuration withScanJobRetryAfter. We can improve this by implementing a more dynamic backoff strategy that adapts based on the error type or the number of retries.To further improve this, we can also limit the number of retries for certain error types, especially those that are unlikely to succeed on subsequent attempts (e.g., network errors).
Use controller caches wisely
Today we make use of
client.Getin our code, but we need to double-check if all callsites are indeed relying on this approach. Doing so will reduce the API server load.We also need to be cognizant of the data that gets cached. A recent PR improves this by removing unnecessary data and binary data fields #2677
Use Field Indexing
As part of the previous point, we should also ensure that we are using field indexing in our controllers. This will allow us to quickly retrieve objects based on specific fields without having to list all objects and filter them in memory.
One use case of such an approach will be to retrieve
VulnerabilityReportsobjects for a given deployment. Upon creation time of such a vulnerabilty report, we can also save a custom field such as.metadata.owner.uidwhich could be the key that we can index on. Later on we can use this index to quickly retrieve all vulnerability reports for a given deployment without having to list all reports and filter them in memory.Isolate Business Logic
We can refactor some of the core business logic components of the controllers today into a new package or library. This would allow us not only to reuse the code but also to test and benchmark it independently of the controller logic. Currently, there's tight coupling between the business logic and the controller logic, which makes it hard to test and benchmark the business logic independently.
Beta Was this translation helpful? Give feedback.
All reactions