Skip to content

[BUG] InferenceService status misses pods for long names #575

@chengjieyao

Description

@chengjieyao

What happened?

For long InferenceService names, the controller may fail to find component pods during status propagation. The pods are actually labeled with a truncated/hashed app value to fit Kubernetes naming/label limits. This hides or delays accurate component status reporting and leaves status dependent only on higher-level deployment conditions.

What did you expect to happen?

The controller should use the same label derivation for pod lookup as deployment/pod creation, so it can always find component pods and propagate status correctly.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create an InferenceService with a long name.
  2. Let it create a raw deployment engine pod.
  3. Check controller logs during status propagation.
  4. Compare:
    • the podLabelValue used by status lookup
    • the actual pod metadata.labels.app

Listed pods while updating component model status

{
  "component": "engine",
  "podCount": 0,
  "podLabelKey": "app",
  "podLabelValue": "<full-inference-service-name>-engine"
}

But the actual pod label is truncated/hashed, for example:

yaml

labels:
  app: a5b5c2cf-jqa4tzjnvnaeioaw6ewzj5uevu2qlj6ii6vknafdarwgmfq-engine

So the selector does not match the pod.

Impact

Even when pod lookup fails, the controller still updates EngineReady from the Deployment status earlier in the reconcile via PropagateRawStatus(). That path does not depend on matching pods. As a result, status falls back to generic deployment-level conditions like MinimumReplicasUnavailable, which makes debugging harder and hides the real runtime failure reason.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions