What happened?
For long InferenceService names, the controller may fail to find component pods during status propagation. The pods are actually labeled with a truncated/hashed app value to fit Kubernetes naming/label limits. This hides or delays accurate component status reporting and leaves status dependent only on higher-level deployment conditions.
What did you expect to happen?
The controller should use the same label derivation for pod lookup as deployment/pod creation, so it can always find component pods and propagate status correctly.
How can we reproduce it (as minimally and precisely as possible)?
- Create an InferenceService with a long name.
- Let it create a raw deployment engine pod.
- Check controller logs during status propagation.
- Compare:
- the podLabelValue used by status lookup
- the actual pod metadata.labels.app
Listed pods while updating component model status
{
"component": "engine",
"podCount": 0,
"podLabelKey": "app",
"podLabelValue": "<full-inference-service-name>-engine"
}
But the actual pod label is truncated/hashed, for example:
yaml
labels:
app: a5b5c2cf-jqa4tzjnvnaeioaw6ewzj5uevu2qlj6ii6vknafdarwgmfq-engine
So the selector does not match the pod.
Impact
Even when pod lookup fails, the controller still updates EngineReady from the Deployment status earlier in the reconcile via PropagateRawStatus(). That path does not depend on matching pods. As a result, status falls back to generic deployment-level conditions like MinimumReplicasUnavailable, which makes debugging harder and hides the real runtime failure reason.
What happened?
For long InferenceService names, the controller may fail to find component pods during status propagation. The pods are actually labeled with a truncated/hashed app value to fit Kubernetes naming/label limits. This hides or delays accurate component status reporting and leaves status dependent only on higher-level deployment conditions.
What did you expect to happen?
The controller should use the same label derivation for pod lookup as deployment/pod creation, so it can always find component pods and propagate status correctly.
How can we reproduce it (as minimally and precisely as possible)?
Listed pods while updating component model status
But the actual pod label is truncated/hashed, for example:
yaml
So the selector does not match the pod.
Impact
Even when pod lookup fails, the controller still updates
EngineReadyfrom the Deployment status earlier in the reconcile viaPropagateRawStatus(). That path does not depend on matching pods. As a result, status falls back to generic deployment-level conditions likeMinimumReplicasUnavailable, which makes debugging harder and hides the real runtime failure reason.