Configure options on add_kubernetes_metadata to wait for processor intialization by khushijain21 · Pull Request #50509 · elastic/beats

khushijain21 · 2026-05-06T07:27:30Z

Proposed commit message

This PR adds two new configuration options
wait_for_processor_ready and wait_for_processor_ready_timeout. This allows user to configure if processor initialization should occur synchronously or async - both of which has its own benefits and downside as listed in the docs. We set wait_for_processor_ready to false for now until maybe we can configure it from elastic-agent

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
I have added an entry in ./changelog/fragments using the changelog tool.

Disruptive User Impact

None since by default wait_for_processor_ready is still set to false. We can choose to enable this when it is configurable from inside elastic-agent.

How to test this PR locally

Related issues

Closes [bug] add_kubernetes_metadata async creation can lead to events missing data #50507

github-actions · 2026-05-06T07:27:43Z

🤖 GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)
/test : Run the Buildkite pipeline.

mergify · 2026-05-06T07:28:12Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @khushijain21? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

infra-vault-gh-plugin-prod · 2026-05-06T08:39:04Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

coderabbitai · 2026-05-06T08:41:25Z

📝 Walkthrough

Walkthrough

This PR adds readiness coordination to the Kubernetes metadata processor to address the race condition where events are processed before Kubernetes discovery initialization completes. The changes introduce wait_for_processor_ready and wait_for_processor_ready_timeout configuration options that enable synchronous initialization with timeout handling. When enabled, the processor blocks until Kubernetes is available or the timeout expires. When disabled, initialization runs asynchronously in the background. The enrichment logic now skips events that already contain metadata and returns early when the processor is unavailable, and the metadata existence check is refined to use proper value lookup with error handling. OCI container handling is extended to support multiple map types during transformation.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	PR addresses issue `#50507` by adding configurable wait_for_processor_ready and wait_for_processor_ready_timeout options to synchronize processor initialization and prevent events from processing before Kubernetes metadata is ready.
Out of Scope Changes check	✅ Passed	All changes directly support the linked issue objective: config fields, processor logic, tests, docs, and changelog are all necessary for the wait-for-ready feature.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

🛠️ Update Documentation: Commit on current branch
🛠️ Update Documentation: Create PR

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@libbeat/processors/add_kubernetes_metadata/kubernetes_test.go`:
- Around line 98-116: The test TestAnnotatorRunWhenMatchersNil intends to
validate behavior when k.matchers is nil, but kubernetesAvailable defaults false
so Run exits early; fix by adding a nil-guard in kubernetesAnnotator.Run that
checks if k.matchers == nil (or k.matchers.MetadataIndex is nil) and returns the
event unchanged, and update the test to set kubernetesAvailable: true on the
kubernetesAnnotator so the Run path actually reaches the matchers check;
reference kubernetesAnnotator.Run, the kubernetesAvailable field, k.matchers and
MetadataIndex, and the TestAnnotatorRunWhenMatchersNil test when making these
changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 95c65146-c8c4-48b2-afcc-9afe743550ef

📥 Commits

Reviewing files that changed from the base of the PR and between 9702d0a and 98c553d.

📒 Files selected for processing (3)

changelog/fragments/1778052475-fix-k8s-processor.yaml
libbeat/processors/add_kubernetes_metadata/kubernetes.go
libbeat/processors/add_kubernetes_metadata/kubernetes_test.go

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libbeat/processors/add_kubernetes_metadata/kubernetes_test.go (1)
1-619: ⚠️ Potential issue | 🔴 Critical

Add nil-matchers guard to kubernetes.go and test coverage for that case.

kubernetes.go Run() calls k.matchers.MetadataIndex() at line 356 without checking if k.matchers is nil. If kubernetesAvailable is true and k.matchers is nil, this causes a runtime panic. The test file should include TestAnnotatorRunWhenMatchersNil (referenced in the PR summary but absent) to cover this scenario and prevent regression.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libbeat/processors/add_kubernetes_metadata/kubernetes_test.go` around lines 1
- 619, The Run path in kubernetesAnnotator can panic because
kubernetesAnnotator.Run calls k.matchers.MetadataIndex() without checking
k.matchers for nil; update kubernetesAnnotator.Run in kubernetes.go to guard
against a nil k.matchers (e.g., if k.matchers == nil or
k.matchers.MetadataIndex() cannot be called, behave as when no matcher matches:
skip lookup and return the original event/error nil), and add a unit test named
TestAnnotatorRunWhenMatchersNil in kubernetes_test.go that constructs a
kubernetesAnnotator with kubernetesAvailable=true but matchers=nil and asserts
Run does not panic and returns the unmodified event.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@libbeat/processors/add_kubernetes_metadata/kubernetes_test.go`:
- Around line 1-619: The Run path in kubernetesAnnotator can panic because
kubernetesAnnotator.Run calls k.matchers.MetadataIndex() without checking
k.matchers for nil; update kubernetesAnnotator.Run in kubernetes.go to guard
against a nil k.matchers (e.g., if k.matchers == nil or
k.matchers.MetadataIndex() cannot be called, behave as when no matcher matches:
skip lookup and return the original event/error nil), and add a unit test named
TestAnnotatorRunWhenMatchersNil in kubernetes_test.go that constructs a
kubernetesAnnotator with kubernetesAvailable=true but matchers=nil and asserts
Run does not panic and returns the unmodified event.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: aeff278d-15f0-4d7d-95ed-b73316858703

📥 Commits

Reviewing files that changed from the base of the PR and between 98c553d and 9620b27.

📒 Files selected for processing (1)

libbeat/processors/add_kubernetes_metadata/kubernetes_test.go

khushijain21 · 2026-05-06T13:55:18Z

I see that we retry 10 times before we error out. Would this be an acceptable tradeoff to block the pipeline for a while until we fail/succeed the initialization so that no events pass through the pipeline un-enriched. Or we could reduce the number of attempts as well to not block the pipeline for long cc: @rdner

beats/libbeat/processors/add_kubernetes_metadata/kubernetes.go

Lines 83 to 97 in caea6ba

    
           func isKubernetesAvailableWithRetry(client k8sclient.Interface) bool { 
        
           	connectionAttempts := 1 
        
           	for { 
        
           		kubernetesAvailable, err := isKubernetesAvailable(client) 
        
           		if kubernetesAvailable { 
        
           			return true 
        
           		} 
        
           		if connectionAttempts > checkNodeReadyAttempts { 
        
           			logp.Info("%v: could not detect kubernetes env: %v", "add_kubernetes_metadata", err) 
        
           			return false 
        
           		} 
        
           		time.Sleep(3 * time.Second) 
        
           		connectionAttempts += 1 
        
           	} 
        
           }

rdner · 2026-05-06T14:04:50Z

@khushijain21 looks like we have 3 seconds in between and it's going to be at least 30 seconds in total. We need to check what kind of timeouts the k8sclient itself has.

We don't want to end up in a situation when a wrong k8s API address or connectivity issues result in minutes of a blocked event processing pipeline.

Waiting for 30 seconds and printing a warning sounds acceptable, but I suspect we might have hidden timeouts in this code that need to be checked.

rdner · 2026-05-06T14:31:24Z

@khushijain21 another concern: once we failed to initialize the processor on startup, what do we do? Do we crash the process? Do we ignore and never retry again?

I can see both of these options can be an issue.

I think it would be better to issue a warning, start ingesting and retry again later.

@cmacknz what do you think?

VihasMakwana · 2026-05-06T14:32:00Z

+	// wait for kubernetes metadata processor to be initialized before processing any events
+	k.wg.Wait()


Worst case scenario: We have to retry multiple times to connect to kubernetes. Until we do that, the event will not be processed and we're essentially blocked. Subsequently, filebeat cannot process more messages because the add_kubernetes_metadata.Run has not returned yet.

Is this acceptable?

This would lead to a temporary reduction in throughput.

But again, k8sattributes processor blocks until it either establishes the connection, or it fails.

With k8sattributes, if it blocks forever and you want data anyway you can remove it from the configuration. For agent, you can only remove add_kubernetes_metadata once everything runs as a beats receiver. So I don't think it's completely safe to backport this, as there's no way to work around it if doesn't work as expected.

We could similarly add configuration controlling whether to block and how to handle failure but in agent you can't configure it yet.

VihasMakwana · 2026-05-06T14:41:22Z

@khushijain21 another concern: once we failed to initialize the processor on startup, what do we do? Do we crash the process? Do we ignore and never retry again?

Crashing the process would be a breaking change. We don't do that right now.

cmacknz · 2026-05-06T20:19:23Z

I think the safest thing to do is to make this behaviour configurable, the trouble is we don't have a way to configure this processor in agent yet (hopefully in 9.5.0 we will).

mergify · 2026-05-06T20:27:34Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @khushijain21? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

khushijain21 · 2026-05-07T07:22:37Z

k8sattribute processor actually exposes two fields wait_for_metadata and wait_for_metadata_timeout https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/k8sattributesprocessor#configuration-options-reference and I think we can follow similarly. But instead of crashing the process entirely, we can log a warning and continue

the trouble is we don't have a way to configure this processor in agent yet (hopefully in 9.5.0 we will).
👍

github-actions · 2026-05-07T09:16:04Z

TL;DR

The failing Buildkite job is the pipeline upload step itself, and it exits immediately with Missing agent. See: \buildkite-agent bootstrap --help``. This points to CI agent/bootstrap context for the upload job, not to the Go changes in this PR.

Remediation

Ensure the :pipeline::arrow_up: Upload Pipeline: .buildkite/pipeline.yml step runs in an environment where Buildkite agent bootstrap context is present (agent metadata/socket/token available to buildkite-agent pipeline upload).
Re-run the build after fixing the upload-step environment; no code change in libbeat/processors/add_kubernetes_metadata/* should be required for this failure.

Investigation details

Root Cause

The only failed step is the Buildkite pipeline upload step, and its log contains a single error:

Missing agent. See: \buildkite-agent bootstrap --help``

This indicates the upload command is being executed without a valid agent bootstrap context.

I also checked the PR commit referenced by the build (1e24afd199cce20bbfca592ea0cc20a974ea3150), which changes only:

libbeat/processors/add_kubernetes_metadata/config.go
libbeat/processors/add_kubernetes_metadata/kubernetes.go

No .buildkite/* files are changed in that commit, so this failure is not caused by the PR diff.

Evidence

Build: https://buildkite.com/elastic/beats/builds/45547
Job/step: :pipeline::arrow_up: Upload Pipeline: .buildkite/pipeline.yml
Failure summary: /tmp/gh-aw/buildkite-failures.txt:6-10
Key log excerpt: /tmp/gh-aw/buildkite-logs/beats-pipelinearrow_up-upload-pipeline-buildkitepipelineyml.txt:1
- Missing agent. See: \buildkite-agent bootstrap --help``

Verification

Not run (this is a CI environment/bootstrap failure before test execution).

Follow-up

If this reproduces after retry, inspect the upload step’s agent image/bootstrap wiring and any recent CI platform changes affecting the beats pipeline uploader job.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

Configure options on add_kubernetes_metadata to wait for processor intialization #50509 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
#50509 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

VihasMakwana · 2026-05-07T10:26:21Z

+	if waitReady && waitReadyTimeout > 0 {
+		timer = time.NewTimer(waitReadyTimeout)
+	} else {


Perhaps we can do some form of validation here? Something like "waitReadyTimeout should be a positive integer if waitReady is true"

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@libbeat/processors/add_kubernetes_metadata/docs/add_kubernetes_metadata.asciidoc`:
- Line 138: The docs state wait_for_processor_ready_timeout default is 10s but
the actual default is set to 30s in kubeAnnotatorConfig.InitDefaults(); update
the documentation to match the code (or change InitDefaults() if the intended
default is 10s) — locate the symbol wait_for_processor_ready_timeout in
add_kubernetes_metadata.asciidoc and sync its default to the value from
kubeAnnotatorConfig.InitDefaults() in config.go so doc and implementation agree.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3047d836-9f77-4ecd-b278-3815317a4a96

📥 Commits

Reviewing files that changed from the base of the PR and between 9620b27 and d2f41a2.

📒 Files selected for processing (5)

changelog/fragments/1778052475-fix-k8s-processor.yaml
libbeat/processors/add_kubernetes_metadata/config.go
libbeat/processors/add_kubernetes_metadata/config_test.go
libbeat/processors/add_kubernetes_metadata/docs/add_kubernetes_metadata.asciidoc
libbeat/processors/add_kubernetes_metadata/kubernetes.go

🚧 Files skipped from review as they are similar to previous changes (2)

changelog/fragments/1778052475-fix-k8s-processor.yaml
libbeat/processors/add_kubernetes_metadata/kubernetes.go

cmacknz · 2026-05-08T15:24:44Z


 	AddResourceMetadata *metadata.AddResourceMetadataConfig `config:"add_resource_metadata"`
+	WaitReady           bool                                `config:"wait_for_processor_ready"`
+	WaitReadyTimeout    time.Duration                       `config:"wait_for_processor_ready_timeout"`


Suggested change

WaitReadyTimeout time.Duration `config:"wait_for_processor_ready_timeout"`

WaitMetadata bool `config:"wait_for_metadata"`

WaitMetadata Timeout time.Duration `config:"wait_for_metadata_timeout"`

I like the k8sattributes names, they are specific about what we are waiting for.

Mainly I don't like including the word processor again in something that is already obviously a processor configuration :)

cmacknz · 2026-05-08T15:25:34Z

 	k.DefaultIndexers = Enabled{true}
 	k.Scope = "node"
 	k.AddResourceMetadata = metadata.GetDefaultResourceMetadataConfig()
+	k.WaitReadyTimeout = 30 * time.Second


I think WaitReady/WaitMetadata should be true by default, that is the least surprising way for this to work. People who have problems with it can manually switch back. I don't view this as breaking because if everything is working properly it isn't.

cmacknz · 2026-05-08T15:31:51Z

+		timer = time.NewTimer(waitReadyTimeout)
+	} else {
+		// hard coding a 5 minutes timeout in case the function is called without waiting for metadata, to avoid infinite loops
+		timer = time.NewTimer(5 * time.Minute)


IMO 5 minutes is too long when someone has explicitly opted out of waiting for metadata.

[bug] fix add_kubernetes_metadata not enriching inital data

04e6c8a

botelastic Bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 6, 2026

mergify Bot assigned khushijain21 May 6, 2026

add changelog

a34e6c0

khushijain21 added backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 6, 2026

botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 6, 2026

khushijain21 marked this pull request as ready for review May 6, 2026 08:39

khushijain21 requested a review from a team as a code owner May 6, 2026 08:39

khushijain21 requested review from leehinman and mauri870 May 6, 2026 08:39

fix linter

98c553d

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread libbeat/processors/add_kubernetes_metadata/kubernetes_test.go Outdated

remove test

9620b27

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Merge branch 'main' into fix-async-kube

281aee2

mauri870 approved these changes May 6, 2026

View reviewed changes

VihasMakwana reviewed May 6, 2026

View reviewed changes

cmacknz removed the backport-active-all Automated backport with mergify to all the active branches label May 6, 2026

khushijain21 marked this pull request as draft May 7, 2026 07:22

make it configurable

1e24afd

khushijain21 added 3 commits May 7, 2026 15:06

make it configurable

2deaac4

add docs

4663a56

fix docs

307b881

khushijain21 changed the title ~~[bug] fix add_kubernetes_metadata not enriching inital data~~ Configure options on add_kubernetes_metadata to wait for processor intialization May 7, 2026

Merge branch 'main' into fix-async-kube

eb5f031

VihasMakwana reviewed May 7, 2026

View reviewed changes

khushijain21 added 5 commits May 7, 2026 17:28

add config validation and address comment

486b795

add changelog

6032fd8

add ready timeout

281e6b0

remove fmt

62d5ac2

remove mutex

d2f41a2

khushijain21 marked this pull request as ready for review May 7, 2026 13:08

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread libbeat/processors/add_kubernetes_metadata/docs/add_kubernetes_metadata.asciidoc

cmacknz reviewed May 8, 2026

View reviewed changes

		// wait for kubernetes metadata processor to be initialized before processing any events
		k.wg.Wait()

	WaitReadyTimeout time.Duration `config:"wait_for_processor_ready_timeout"`
	WaitMetadata bool `config:"wait_for_metadata"`
	WaitMetadata Timeout time.Duration `config:"wait_for_metadata_timeout"`

Conversation

khushijain21 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Uh oh!

github-actions Bot commented May 6, 2026

🤖 GitHub comments

Uh oh!

mergify Bot commented May 6, 2026

Uh oh!

infra-vault-gh-plugin-prod Bot commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

khushijain21 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdner commented May 6, 2026

Uh oh!

rdner commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VihasMakwana commented May 6, 2026

Uh oh!

cmacknz commented May 6, 2026

Uh oh!

mergify Bot commented May 6, 2026

Uh oh!

khushijain21 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

TL;DR

Remediation

Root Cause

Evidence

Verification

Follow-up

Uh oh!

VihasMakwana May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

khushijain21 commented May 6, 2026 •

edited

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading

khushijain21 commented May 6, 2026 •

edited

Loading

rdner commented May 6, 2026 •

edited

Loading

khushijain21 commented May 7, 2026 •

edited

Loading

VihasMakwana May 7, 2026 •

edited

Loading