Skip to content

fix(kustomize,helm): Align Helm and Kustomize RBAC, args, and sideEffects via code tracing#2936

Merged
google-oss-prow[bot] merged 2 commits into
kubeflow:masterfrom
RobuRishabh:fix-kustomize-drift
May 15, 2026
Merged

fix(kustomize,helm): Align Helm and Kustomize RBAC, args, and sideEffects via code tracing#2936
google-oss-prow[bot] merged 2 commits into
kubeflow:masterfrom
RobuRishabh:fix-kustomize-drift

Conversation

@RobuRishabh
Copy link
Copy Markdown
Contributor

@RobuRishabh RobuRishabh commented May 11, 2026

Purpose of this PR

Follow-up to #2915 where the review called out: "We need to file a follow-up issue to fix the Helm chart templates too, if required to keep Helm chart in proper alignment with Kustomize." and it fixes all the drifts occurring in #2933 and eliminates all semantic drift between Helm and Kustomize.

Proposed changes:

  • Remove over-provisioned webhook ClusterRole permissions (events, resourcequotas, finalizers/status, extra get verb on pods) — code tracing proves zero usage in the webhook binary
  • Add missing pods (list/watch) and scheduledsparkapplications (get/list/watch) to Helm webhook policyRules — the controller-runtime cache unconditionally needs these (Helm bug)
  • Add events (create/update/patch) to Kustomize controller leader-election Role — controller has 3 EventRecorders and 19 recorder.Eventf calls
  • Add explicit --kube-api-qps=20, --kube-api-burst=30, --feature-gates, --scheduled-spark-application-timestamp-precision=nanos to Kustomize controller deployment
  • Add explicit --kube-api-qps=20, --kube-api-burst=30 to Kustomize webhook deployment
  • Fix Helm webhook sideEffects from NoneOnDryRunNone to match all 7 kubebuilder markers (completes fix(webhook): Fix path for mutating ScheduledSparkApplication, correct sideEffects, remove update verb in manifests #2915 Helm-side)

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

The Kustomize manifests had accumulated drift from the Helm chart over time. Rather than blindly aligning to Helm, each permission was code-traced to determine what's genuinely required:

Decision Evidence
Remove events from webhook Zero EventRecorder/Eventf in internal/webhook/ or cmd/operator/webhook/
Remove resourcequotas from webhook Only used when --enable-resource-quota-enforcement=true (defaults to false)
Add pods + scheduledsparkapplications to Helm cache.ByObject in cmd/operator/webhook/start.go:396-403
Keep controller events 19 recorder.Eventf calls in internal/controller/sparkapplication/controller.go
Scope leases by resourceNames Only touch your own lock, not all leases in namespace
sideEffects: None All 7 +kubebuilder:webhook markers declare sideEffects=None

Result: 14 semantic checks pass with zero drift between Helm and Kustomize rendered output.

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

  • The drift detection pipeline (kustomize-drift-check branch) will validate this PR passes once both branches are merged

Copilot AI review requested due to automatic review settings May 11, 2026 05:09
@google-oss-prow google-oss-prow Bot requested review from ImpSy and nabuskey May 11, 2026 05:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to eliminate semantic drift between the Helm chart and Kustomize manifests for the spark-operator controller/webhook by aligning RBAC permissions, deployment args (notably API client throttling and feature flags), and webhook sideEffects settings based on traced runtime requirements.

Changes:

  • Adjust webhook/controller RBAC rules to remove unused permissions and add missing cache-required permissions.
  • Make Kustomize controller/webhook deployments explicit about API client throttling (QPS/burst) and controller feature flags/timestamp precision.
  • Align Helm webhook configurations to sideEffects: None (matching kubebuilder markers).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
config/webhook/deployment.yaml Adds explicit --kube-api-qps / --kube-api-burst args to the Kustomize webhook deployment.
config/webhook/clusterrole.yaml Removes webhook ClusterRole permissions deemed unused (e.g., events/resourcequotas/status/finalizers).
config/rbac/leader-election-role.yaml Adds events create/update/patch to the Kustomize controller leader-election Role.
config/manager/manager.yaml Adds explicit controller args for QPS/burst, feature gates, and ScheduledSparkApplication timestamp precision.
charts/spark-operator-chart/templates/webhook/validatingwebhookconfiguration.yaml Changes validating webhooks sideEffects to None.
charts/spark-operator-chart/templates/webhook/mutatingwebhookconfiguration.yaml Changes mutating webhooks sideEffects to None.
charts/spark-operator-chart/templates/webhook/rbac.yaml Removes events permissions from the Helm webhook ClusterRole.
charts/spark-operator-chart/templates/webhook/_helpers.tpl Updates Helm webhook policyRules to include pods watch/list and scheduledsparkapplications access.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +138 to 139
- watch
- apiGroups:
Comment on lines 12 to 13
resources:
- pods
@RobuRishabh RobuRishabh force-pushed the fix-kustomize-drift branch 2 times, most recently from 7ed2b2c to fdf98de Compare May 11, 2026 14:56
@google-oss-prow google-oss-prow Bot added size/L and removed size/M labels May 11, 2026
…ects via code tracing

Remove over-provisioned webhook ClusterRole permissions (events,
resourcequotas, finalizers/status, extra verbs) that have zero code
usage in the webhook binary. Add missing pods and
scheduledsparkapplications to Helm policyRules that the
controller-runtime cache requires.
Add events permission to Kustomize controller leader-election Role
(controller has 19 recorder.Eventf calls). Add explicit --kube-api-qps,
--kube-api-burst, --feature-gates, and --timestamp-precision flags to
Kustomize deployments matching Helm defaults.
Fix Helm webhook sideEffects from NoneOnDryRun to None matching all 7
kubebuilder markers (follow-up PR#2915)

Signed-off-by: roburishabh <[email protected]>
@RobuRishabh RobuRishabh force-pushed the fix-kustomize-drift branch from fdf98de to 642a82e Compare May 11, 2026 15:02
- watch
- apiGroups:
- ""
resources:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes resourcequotas access, but the webhook still needs it. SparkApplicationValidator.validateResourceUsage() in internal/webhook/sparkapplication_validator.go:192 calls v.client.List(ctx, resourceQuotaList, ...) to list ResourceQuotas when --enable-resource-quota-enforcement is enabled. Removing this RBAC rule will cause 403 errors for anyone using that flag.
The flag defaults to false so this won't be caught by CI or default installs — it'll only break at runtime for users who opted in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @alinaryan. So the removal here was intentional. Kustomize's config/webhook/deployment.yaml doesn't expose --enable-resource-quota-enforcement at all, the flag is always at its default (false), so the webhook will never call List on ResourceQuota. But, I understand your point, replacing resourcequotas with pods instead of adding pods alongside it will cause this 403 error. I removed it to make it aligned with Kustomize, because If Helm has resourcequotas in the ClusterRole but Kustomize doesn't, that's a drift item that the drift detection CI gate (PR #2933) will flag. so what I can do, I will add resourcequotas: list to both Helm policyRules AND Kustomize ClusterRole, zero drift, and both deployment methods support the feature if a user enables it. Thanks for catching it.

Comment on lines +87 to +90
- --kube-api-qps=20
- --kube-api-burst=30
- --feature-gates=PartialRestart=false,LoadSparkDefaults=false
- --scheduled-spark-application-timestamp-precision=nanos
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values (--kube-api-qps=20, --kube-api-burst=30) override the Go client defaults and force every downstream user to either accept them or patch around them. Same for --feature-gates=PartialRestart=false,LoadSparkDefaults=false. I believe these are already the defaults. Consider moving these to an overlay or removing them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here all four values match the Go defaults and adding them doesn't change runtime behavior. The reason they're here: Helm's values.yaml renders all four as explicit args in every deployment. Without them in Kustomize, the drift detection CI gate (PR #2933) flags them as missing, Helm has them, Kustomize doesn't.

So its a tradeoff:

  • Keeping them: Zero drift, args serve as documentation of the operating parameters, and the drift test passes without special-case exclusions.
  • Removing them: Fewer args to maintain, but re-introduces drift items that the CI gate will flag, requiring either continue-on-error in the workflow or normalization logic in the test

I'd prefer to keep them for consistency with Helm, but I understand the maintenance concern. What do you prefer ?

Copy link
Copy Markdown
Contributor

@nabuskey nabuskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit 0e61185 into kubeflow:master May 15, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants