You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When PSI (Pressure Stall Information) is not available on the host (kernel compiled without CONFIG_PSI, or booted with psi=0), cAdvisor still emits PSI Prometheus metrics (container_pressure_cpu_stalled_seconds_total, container_pressure_cpu_waiting_seconds_total, etc.) with zero values. This is misleading for monitoring and alerting systems, which cannot distinguish "PSI unavailable" from "PSI available but system is idle."
Main Cause: - statPSI() in opencontainers/cgroups (fs2/psi.go) correctly returns nil when PSI files don't exist or the kernel returns ENOTSUP. However, the nil signal is lost in cAdvisor's processing:
setPSIStats() in container/libcontainer/handler.go receives the nil *cgroups.PSIStats and silently does nothing, leaving the info.PSIStats at its zero value.
The Prometheus collector in metrics/prometheus.go reads those zero values and emits them as real metrics.
Since CpuStats.PSI, DiskIoStats.PSI, and MemoryStats.PSI in info/v1/container.go are value types (not pointers), there is no way for code to distinguish "PSI unavailable" from "PSI value is genuinely zero."
Possible approaches
Change PSI fields to pointer types (PSI *PSIStats instead of PSI PSIStats) so nil means unavailable. The Prometheus collector can then skip metrics when the pointer is nil. This is a breaking API change.
Add a PSISupported bool field alongside the existing PSI fields. Non-breaking, but adds a redundant field that consumers must remember to check.
What is the Problem ?
When PSI (Pressure Stall Information) is not available on the host (kernel compiled without
CONFIG_PSI, or booted with psi=0), cAdvisor still emits PSI Prometheus metrics (container_pressure_cpu_stalled_seconds_total, container_pressure_cpu_waiting_seconds_total, etc.) with zero values. This is misleading for monitoring and alerting systems, which cannot distinguish "PSI unavailable" from "PSI available but system is idle."Main Cause: -
statPSI()in opencontainers/cgroups (fs2/psi.go) correctly returns nil when PSI files don't exist or the kernel returnsENOTSUP. However, the nil signal is lost in cAdvisor's processing:setPSIStats()in container/libcontainer/handler.go receives the nil *cgroups.PSIStats and silently does nothing, leaving the info.PSIStats at its zero value.The Prometheus collector in metrics/prometheus.go reads those zero values and emits them as real metrics.
Since CpuStats.PSI, DiskIoStats.PSI, and MemoryStats.PSI in info/v1/container.go are value types (not pointers), there is no way for code to distinguish "PSI unavailable" from "PSI value is genuinely zero."
Possible approaches
Change PSI fields to pointer types (
PSI *PSIStatsinstead ofPSI PSIStats) so nil means unavailable. The Prometheus collector can then skip metrics when the pointer is nil. This is a breaking API change.Add a
PSISupportedbool field alongside the existing PSI fields. Non-breaking, but adds a redundant field that consumers must remember to check.Rely on callers to gate PressureMetrics in includedMetrics before creating the manager. This is what Kubernetes currently does (Fix zero PSI metrics emitted when OS doesn't enable PSI kubernetes/kubernetes#137326), but it requires every cAdvisor consumer to implement their own PSI detection.
Related
kubernetes/kubernetes#136333 (original bug report)
kubernetes/kubernetes#137326 (kubelet-side fix)