Skip to content

Configure options on add_kubernetes_metadata to wait for processor intialization#50509

Open
khushijain21 wants to merge 15 commits intoelastic:mainfrom
khushijain21:fix-async-kube
Open

Configure options on add_kubernetes_metadata to wait for processor intialization#50509
khushijain21 wants to merge 15 commits intoelastic:mainfrom
khushijain21:fix-async-kube

Conversation

@khushijain21
Copy link
Copy Markdown
Contributor

@khushijain21 khushijain21 commented May 6, 2026

Proposed commit message

This PR adds two new configuration options
wait_for_processor_ready and wait_for_processor_ready_timeout. This allows user to configure if processor initialization should occur synchronously or async - both of which has its own benefits and downside as listed in the docs. We set wait_for_processor_ready to false for now until maybe we can configure it from elastic-agent

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

Disruptive User Impact

None since by default wait_for_processor_ready is still set to false. We can choose to enable this when it is configurable from inside elastic-agent.

How to test this PR locally

Related issues

@botelastic botelastic Bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)
  • /test : Run the Buildkite pipeline.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 6, 2026

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @khushijain21? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@khushijain21 khushijain21 added backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 6, 2026
@botelastic botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 6, 2026
@khushijain21 khushijain21 marked this pull request as ready for review May 6, 2026 08:39
@khushijain21 khushijain21 requested a review from a team as a code owner May 6, 2026 08:39
@khushijain21 khushijain21 requested review from leehinman and mauri870 May 6, 2026 08:39
@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds readiness coordination to the Kubernetes metadata processor to address the race condition where events are processed before Kubernetes discovery initialization completes. The changes introduce wait_for_processor_ready and wait_for_processor_ready_timeout configuration options that enable synchronous initialization with timeout handling. When enabled, the processor blocks until Kubernetes is available or the timeout expires. When disabled, initialization runs asynchronously in the background. The enrichment logic now skips events that already contain metadata and returns early when the processor is unavailable, and the metadata existence check is refined to use proper value lookup with error handling. OCI container handling is extended to support multiple map types during transformation.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed PR addresses issue #50507 by adding configurable wait_for_processor_ready and wait_for_processor_ready_timeout options to synchronize processor initialization and prevent events from processing before Kubernetes metadata is ready.
Out of Scope Changes check ✅ Passed All changes directly support the linked issue objective: config fields, processor logic, tests, docs, and changelog are all necessary for the wait-for-ready feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@libbeat/processors/add_kubernetes_metadata/kubernetes_test.go`:
- Around line 98-116: The test TestAnnotatorRunWhenMatchersNil intends to
validate behavior when k.matchers is nil, but kubernetesAvailable defaults false
so Run exits early; fix by adding a nil-guard in kubernetesAnnotator.Run that
checks if k.matchers == nil (or k.matchers.MetadataIndex is nil) and returns the
event unchanged, and update the test to set kubernetesAvailable: true on the
kubernetesAnnotator so the Run path actually reaches the matchers check;
reference kubernetesAnnotator.Run, the kubernetesAvailable field, k.matchers and
MetadataIndex, and the TestAnnotatorRunWhenMatchersNil test when making these
changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 95c65146-c8c4-48b2-afcc-9afe743550ef

📥 Commits

Reviewing files that changed from the base of the PR and between 9702d0a and 98c553d.

📒 Files selected for processing (3)
  • changelog/fragments/1778052475-fix-k8s-processor.yaml
  • libbeat/processors/add_kubernetes_metadata/kubernetes.go
  • libbeat/processors/add_kubernetes_metadata/kubernetes_test.go

Comment thread libbeat/processors/add_kubernetes_metadata/kubernetes_test.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libbeat/processors/add_kubernetes_metadata/kubernetes_test.go (1)

1-619: ⚠️ Potential issue | 🔴 Critical

Add nil-matchers guard to kubernetes.go and test coverage for that case.

kubernetes.go Run() calls k.matchers.MetadataIndex() at line 356 without checking if k.matchers is nil. If kubernetesAvailable is true and k.matchers is nil, this causes a runtime panic. The test file should include TestAnnotatorRunWhenMatchersNil (referenced in the PR summary but absent) to cover this scenario and prevent regression.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libbeat/processors/add_kubernetes_metadata/kubernetes_test.go` around lines 1
- 619, The Run path in kubernetesAnnotator can panic because
kubernetesAnnotator.Run calls k.matchers.MetadataIndex() without checking
k.matchers for nil; update kubernetesAnnotator.Run in kubernetes.go to guard
against a nil k.matchers (e.g., if k.matchers == nil or
k.matchers.MetadataIndex() cannot be called, behave as when no matcher matches:
skip lookup and return the original event/error nil), and add a unit test named
TestAnnotatorRunWhenMatchersNil in kubernetes_test.go that constructs a
kubernetesAnnotator with kubernetesAvailable=true but matchers=nil and asserts
Run does not panic and returns the unmodified event.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@libbeat/processors/add_kubernetes_metadata/kubernetes_test.go`:
- Around line 1-619: The Run path in kubernetesAnnotator can panic because
kubernetesAnnotator.Run calls k.matchers.MetadataIndex() without checking
k.matchers for nil; update kubernetesAnnotator.Run in kubernetes.go to guard
against a nil k.matchers (e.g., if k.matchers == nil or
k.matchers.MetadataIndex() cannot be called, behave as when no matcher matches:
skip lookup and return the original event/error nil), and add a unit test named
TestAnnotatorRunWhenMatchersNil in kubernetes_test.go that constructs a
kubernetesAnnotator with kubernetesAvailable=true but matchers=nil and asserts
Run does not panic and returns the unmodified event.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: aeff278d-15f0-4d7d-95ed-b73316858703

📥 Commits

Reviewing files that changed from the base of the PR and between 98c553d and 9620b27.

📒 Files selected for processing (1)
  • libbeat/processors/add_kubernetes_metadata/kubernetes_test.go

@khushijain21
Copy link
Copy Markdown
Contributor Author

khushijain21 commented May 6, 2026

I see that we retry 10 times before we error out. Would this be an acceptable tradeoff to block the pipeline for a while until we fail/succeed the initialization so that no events pass through the pipeline un-enriched. Or we could reduce the number of attempts as well to not block the pipeline for long cc: @rdner

func isKubernetesAvailableWithRetry(client k8sclient.Interface) bool {
connectionAttempts := 1
for {
kubernetesAvailable, err := isKubernetesAvailable(client)
if kubernetesAvailable {
return true
}
if connectionAttempts > checkNodeReadyAttempts {
logp.Info("%v: could not detect kubernetes env: %v", "add_kubernetes_metadata", err)
return false
}
time.Sleep(3 * time.Second)
connectionAttempts += 1
}
}

@rdner
Copy link
Copy Markdown
Member

rdner commented May 6, 2026

@khushijain21 looks like we have 3 seconds in between and it's going to be at least 30 seconds in total. We need to check what kind of timeouts the k8sclient itself has.

We don't want to end up in a situation when a wrong k8s API address or connectivity issues result in minutes of a blocked event processing pipeline.

Waiting for 30 seconds and printing a warning sounds acceptable, but I suspect we might have hidden timeouts in this code that need to be checked.

@rdner
Copy link
Copy Markdown
Member

rdner commented May 6, 2026

@khushijain21 another concern: once we failed to initialize the processor on startup, what do we do? Do we crash the process? Do we ignore and never retry again?

I can see both of these options can be an issue.

I think it would be better to issue a warning, start ingesting and retry again later.

@cmacknz what do you think?

Comment on lines +348 to +349
// wait for kubernetes metadata processor to be initialized before processing any events
k.wg.Wait()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case scenario: We have to retry multiple times to connect to kubernetes. Until we do that, the event will not be processed and we're essentially blocked. Subsequently, filebeat cannot process more messages because the add_kubernetes_metadata.Run has not returned yet.

Is this acceptable?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would lead to a temporary reduction in throughput.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But again, k8sattributes processor blocks until it either establishes the connection, or it fails.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With k8sattributes, if it blocks forever and you want data anyway you can remove it from the configuration. For agent, you can only remove add_kubernetes_metadata once everything runs as a beats receiver. So I don't think it's completely safe to backport this, as there's no way to work around it if doesn't work as expected.

We could similarly add configuration controlling whether to block and how to handle failure but in agent you can't configure it yet.

@VihasMakwana
Copy link
Copy Markdown
Contributor

@khushijain21 another concern: once we failed to initialize the processor on startup, what do we do? Do we crash the process? Do we ignore and never retry again?

Crashing the process would be a breaking change. We don't do that right now.

@cmacknz cmacknz removed the backport-active-all Automated backport with mergify to all the active branches label May 6, 2026
@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented May 6, 2026

I think the safest thing to do is to make this behaviour configurable, the trouble is we don't have a way to configure this processor in agent yet (hopefully in 9.5.0 we will).

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 6, 2026

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @khushijain21? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@khushijain21
Copy link
Copy Markdown
Contributor Author

khushijain21 commented May 7, 2026

k8sattribute processor actually exposes two fields wait_for_metadata and wait_for_metadata_timeout https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/k8sattributesprocessor#configuration-options-reference and I think we can follow similarly. But instead of crashing the process entirely, we can log a warning and continue

the trouble is we don't have a way to configure this processor in agent yet (hopefully in 9.5.0 we will).
👍

@khushijain21 khushijain21 marked this pull request as draft May 7, 2026 07:22
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

TL;DR

The failing Buildkite job is the pipeline upload step itself, and it exits immediately with Missing agent. See: \buildkite-agent bootstrap --help``. This points to CI agent/bootstrap context for the upload job, not to the Go changes in this PR.

Remediation

  • Ensure the :pipeline::arrow_up: Upload Pipeline: .buildkite/pipeline.yml step runs in an environment where Buildkite agent bootstrap context is present (agent metadata/socket/token available to buildkite-agent pipeline upload).
  • Re-run the build after fixing the upload-step environment; no code change in libbeat/processors/add_kubernetes_metadata/* should be required for this failure.
Investigation details

Root Cause

The only failed step is the Buildkite pipeline upload step, and its log contains a single error:

  • Missing agent. See: \buildkite-agent bootstrap --help``

This indicates the upload command is being executed without a valid agent bootstrap context.

I also checked the PR commit referenced by the build (1e24afd199cce20bbfca592ea0cc20a974ea3150), which changes only:

  • libbeat/processors/add_kubernetes_metadata/config.go
  • libbeat/processors/add_kubernetes_metadata/kubernetes.go

No .buildkite/* files are changed in that commit, so this failure is not caused by the PR diff.

Evidence

  • Build: https://buildkite.com/elastic/beats/builds/45547
  • Job/step: :pipeline::arrow_up: Upload Pipeline: .buildkite/pipeline.yml
  • Failure summary: /tmp/gh-aw/buildkite-failures.txt:6-10
  • Key log excerpt: /tmp/gh-aw/buildkite-logs/beats-pipelinearrow_up-upload-pipeline-buildkitepipelineyml.txt:1
    • Missing agent. See: \buildkite-agent bootstrap --help``

Verification

  • Not run (this is a CI environment/bootstrap failure before test execution).

Follow-up

  • If this reproduces after retry, inspect the upload step’s agent image/bootstrap wiring and any recent CI platform changes affecting the beats pipeline uploader job.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

@khushijain21 khushijain21 changed the title [bug] fix add_kubernetes_metadata not enriching inital data Configure options on add_kubernetes_metadata to wait for processor intialization May 7, 2026
Comment on lines +88 to +90
if waitReady && waitReadyTimeout > 0 {
timer = time.NewTimer(waitReadyTimeout)
} else {
Copy link
Copy Markdown
Contributor

@VihasMakwana VihasMakwana May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can do some form of validation here? Something like "waitReadyTimeout should be a positive integer if waitReady is true"

@khushijain21 khushijain21 marked this pull request as ready for review May 7, 2026 13:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@libbeat/processors/add_kubernetes_metadata/docs/add_kubernetes_metadata.asciidoc`:
- Line 138: The docs state wait_for_processor_ready_timeout default is 10s but
the actual default is set to 30s in kubeAnnotatorConfig.InitDefaults(); update
the documentation to match the code (or change InitDefaults() if the intended
default is 10s) — locate the symbol wait_for_processor_ready_timeout in
add_kubernetes_metadata.asciidoc and sync its default to the value from
kubeAnnotatorConfig.InitDefaults() in config.go so doc and implementation agree.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3047d836-9f77-4ecd-b278-3815317a4a96

📥 Commits

Reviewing files that changed from the base of the PR and between 9620b27 and d2f41a2.

📒 Files selected for processing (5)
  • changelog/fragments/1778052475-fix-k8s-processor.yaml
  • libbeat/processors/add_kubernetes_metadata/config.go
  • libbeat/processors/add_kubernetes_metadata/config_test.go
  • libbeat/processors/add_kubernetes_metadata/docs/add_kubernetes_metadata.asciidoc
  • libbeat/processors/add_kubernetes_metadata/kubernetes.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • changelog/fragments/1778052475-fix-k8s-processor.yaml
  • libbeat/processors/add_kubernetes_metadata/kubernetes.go


AddResourceMetadata *metadata.AddResourceMetadataConfig `config:"add_resource_metadata"`
WaitReady bool `config:"wait_for_processor_ready"`
WaitReadyTimeout time.Duration `config:"wait_for_processor_ready_timeout"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WaitReadyTimeout time.Duration `config:"wait_for_processor_ready_timeout"`
WaitMetadata bool `config:"wait_for_metadata"`
WaitMetadata Timeout time.Duration `config:"wait_for_metadata_timeout"`

I like the k8sattributes names, they are specific about what we are waiting for.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly I don't like including the word processor again in something that is already obviously a processor configuration :)

k.DefaultIndexers = Enabled{true}
k.Scope = "node"
k.AddResourceMetadata = metadata.GetDefaultResourceMetadataConfig()
k.WaitReadyTimeout = 30 * time.Second
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think WaitReady/WaitMetadata should be true by default, that is the least surprising way for this to work. People who have problems with it can manually switch back. I don't view this as breaking because if everything is working properly it isn't.

timer = time.NewTimer(waitReadyTimeout)
} else {
// hard coding a 5 minutes timeout in case the function is called without waiting for metadata, to avoid infinite loops
timer = time.NewTimer(5 * time.Minute)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO 5 minutes is too long when someone has explicitly opted out of waiting for metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] add_kubernetes_metadata async creation can lead to events missing data

5 participants