Use PromQL info function instead of resource attribute promotion#2869
Use PromQL info function instead of resource attribute promotion#2869aknuds1 wants to merge 5 commits intoopen-telemetry:mainfrom
Conversation
2cb5b84 to
93d3951
Compare
| - OTEL_EXPORTER_OTLP_ENDPOINT | ||
| - OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE | ||
| - OTEL_RESOURCE_ATTRIBUTES | ||
| - OTEL_RESOURCE_ATTRIBUTES=${OTEL_RESOURCE_ATTRIBUTES},service.instance.id=checkout |
There was a problem hiding this comment.
service.instance.id is expected to be generated by SDKs or derived from the K8s environment, moreover, it should be a GUID.
Specs: https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id
K8s naming specs: https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/
| resource/postgresql: | ||
| attributes: | ||
| - key: service.name | ||
| value: postgresql | ||
| action: upsert | ||
| - key: service.instance.id | ||
| value: ${env:POSTGRES_HOST} | ||
| action: upsert |
There was a problem hiding this comment.
Reading the service.name specs here, we could try to broaden the requirements in specs to also define service.name in infrastructure monitoring use cases and convince OTel Collector Receiver maintainers to adopt this but today, no infrastructure monitoring receiver produces service.name or service.instance.id
| - context: resource | ||
| statements: | ||
| # Set service.instance.id to service.name if not already set (needed for Prometheus info() joins) | ||
| - set(attributes["service.instance.id"], attributes["service.name"]) where attributes["service.instance.id"] == nil and attributes["service.name"] != nil |
There was a problem hiding this comment.
We would have collision if the same service type (eg a redis) is running multiple times. For infra monitoring metrics, we commonly use attributes like host.name... to differentiate the instances.
There was a problem hiding this comment.
I agree, we don't want to set the id to a name that may have a collision. This needs to be set to the pod/container/host name which should be unique.
There was a problem hiding this comment.
This needs to be set to the pod/container/host name which should be unique.
How is this implementable under Docker Compose @puckpuck?
|
Overall looks good outside of what @cyrille-leclerc already pointed out, did you test this locally (given lot of changes to the queries even just re-formatting) do all of the panel still show metrics? Would be potentially nice to show screenshots of it. |
@jmichalek132 I did some simple testing locally, but I already don't know the demo much, so I'm not a very effective tester :/ Do you know the demo well enough to look for discrepancies? I did fix the bugs I could find from checking the APM and PostegreSQL dashboards, on Docker Compose. |
93d3951 to
43cdf83
Compare
|
As discussed offline with @cyrille-leclerc, it might be better to implement |
9fd96c8 to
2b30a83
Compare
|
Can we verify that Prometheus alerts report enough context, not only |
How can we verify this @cyrille-leclerc? I would appreciate any help :D |
@cyrille-leclerc Can you check now? Claude helped me implement what I think is a fix for the |
13aead2 to
73d4ef8
Compare
|
I've fixed the OTel Collector dashboard too. |
73d4ef8 to
f906750
Compare
|
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
f906750 to
0c23eb3
Compare
|
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
7e57d77 to
2948dce
Compare
|
@aknuds1 sorry for the lack of activity on this. I'd like to get it merged, can you Resolve Conflicts, and I'll run it through a full end-end test to make sure. |
fa8de22 to
240b98c
Compare
Switch from promoting resource attributes to metric labels to using the experimental info() PromQL function, which joins metrics with target_info at query time. This reduces metric cardinality in Prometheus by storing resource attributes once in target_info rather than on every metric series. Changes: - Upgrade Prometheus to v3.11.1 (includes info() bug fix #17817) and enable promql-experimental-functions feature flag - Remove most promoted resource attributes from Prometheus config, keeping only kafka.cluster.alias, collector.instance.id, and host.name (needed for hostmetrics without service.name) - Add resource/postgresql processor to set service.name on PostgreSQL receiver metrics - Add transform/postgresql processor to construct unique service.instance.id per PostgreSQL resource scope, preventing duplicate target_info entries - Add metric_statements to set default service.instance.id for services that don't provide one - Add dedicated metrics/postgresql pipeline - Set stable service.instance.id for Go services and flagd in docker-compose.yml Signed-off-by: Arve Knudsen <[email protected]>
240b98c to
b6edb4b
Compare
|
Thanks @puckpuck - I'm fixing it up. I will let you know when it's ready to review. |
a673579 to
77d03b3
Compare
Migrate all PromQL queries in dashboards and alerts to use the info() function for accessing resource attributes instead of filtering on promoted labels directly. Updated dashboards: - APM dashboard: HTTP/RPC latency, error rate, throughput, and per-operation breakdown queries - PostgreSQL dashboard: transaction, tuple, deadlock, and background writer queries - OpenTelemetry Collector dashboard: receiver, processor, and exporter throughput/error queries - Cart service alert: p95 latency threshold query Dashboard variable queries now use label_values(target_info, ...) to populate dropdowns from resource attributes. Signed-off-by: Arve Knudsen <[email protected]>
77d03b3 to
d0e022a
Compare
Add Helm values and deployment scripts for running the demo on Kubernetes with info() function support. - values-info-function.yaml: Prometheus v3.11.1 with info() enabled, minimal resource attribute promotion, collector transform metric_statements for PostgreSQL service.name/service.instance.id and duplicate target_info prevention - values-kind.yaml: Kind-specific overrides (NodePort, memory limits) - kind-config.yaml: Kind cluster with port mapping - deploy-kind.sh: Creates Kind cluster and deploys the demo - deploy-info-function.sh: Deploys to existing k8s cluster with custom Grafana dashboards as ConfigMaps Signed-off-by: Arve Knudsen <[email protected]>
d0e022a to
8779cbc
Compare
|
@puckpuck I've finalized the PR, and checked the dashboards in Docker Compose and k8s modes. AFAICT they look fine. Please go ahead with your review :) Do you want me to also make PRs to update docs and Helm charts? |
Signed-off-by: Arve Knudsen <[email protected]>
There was a problem hiding this comment.
All of these files in Kubernetes should be removed. Ultimately we are moving all K8s support to be based on using our Helm chart. The existing manifest file here also needs to be removed since this folder really just causes overall confusion to other contributors.
There was a problem hiding this comment.
I see, thanks for making me aware.
There was a problem hiding this comment.
I thought about it. The kubernetes/deploy-kind.sh script is useful for me to install into a local k8s cluster and test my changes. It's already based around Helm, what's the argument against it? One would think it's generally useful for testing OTel demo in k8s mode?
| spike_limit_percentage: 25 | ||
| resourcedetection: | ||
| detectors: [env, docker, system] | ||
| resource/postgresql: |
There was a problem hiding this comment.
I understand why we need this, but I feel like this is alot of transform required to get common infra metrics working with the info function in PromQL.
What about Kafka, or Redis metrics. Or metrics from other infra, so we need to ensure they all have unique service.instance.id attributes as well?
There was a problem hiding this comment.
Please see my Postgres receiver issue. As @lmolkova points out, I think its current modeling of resources and metrics is broken. Not just due to emitting a non-uniquely identifying service.instance.id, but more generally due to putting attributes on the resource level which should be on the metrics level (causing there to be multiple distinct resources per service.instance.id, and unnecessary need to promote resource attributes to metric labels).
At least one of the maintainers (@antonblock) agrees that the issue's proposed solution is good.
Do you think we should hold off on this PR until the Postgres issue is fixed?
| - context: resource | ||
| statements: | ||
| # Set service.instance.id to service.name if not already set (needed for Prometheus info() joins) | ||
| - set(attributes["service.instance.id"], attributes["service.name"]) where attributes["service.instance.id"] == nil and attributes["service.name"] != nil |
There was a problem hiding this comment.
I agree, we don't want to set the id to a name that may have a collision. This needs to be set to the pod/container/host name which should be unique.
| receivers: [docker_stats, httpcheck/frontend-proxy, hostmetrics, nginx, otlp, redis, spanmetrics, kafkametrics] | ||
| processors: [resourcedetection, transform/sanitize_spans, memory_limiter] | ||
| exporters: [otlp_http/prometheus, debug] | ||
| metrics/postgresql: |
There was a problem hiding this comment.
We should not expose multiple metrics pipelines, this will cause issues with people that fork or extend the demo. Let's get everything into the standard pipelines (traces, metrics, logs), and use filters in the the processors themselves.
Also fwiw, memory_limiter should always be the first processor in the list. I noticed it is not here, but it should be.
There was a problem hiding this comment.
I've tried to fix this in the latest commit.
|
I’m aligned with @puckpuck. This PR does a great job highlighting the challenges around OTel infrastructure metrics resource attributes with Prometheus, and it reinforces the value of the OTel Demo as a realistic environment to understand how these pieces fit together. That said, it feels a bit early to replace Prometheus resource attribute promotion in the OTel Demo with the It might be helpful to align as a group on a few milestones that could guide a confident transition:
Curious to hear what others think about this direction. |
|
Yes, we should use this as an initiative to push the underlying ecosystem (ie: infra metrics receivers) to support the upcoming standard. Perhaps something we can discuss at the next SIG meeting to understand who else needs to be notified and coordinated with. |
@cyrille-leclerc If @puckpuck Please also note that when the OTel Entity Data Model becomes standard, Prometheus will no longer need |
Move PostgreSQL-specific processors into the shared transform processor using conditional where clauses, eliminating the separate metrics/postgresql pipeline. Also fix memory_limiter to be first processor in all pipelines. Signed-off-by: Arve Knudsen <[email protected]>
@puckpuck In respect to this, please see my other comment. |
|
I added this PR to the agenda as something to discuss in the next Demo SIG meeting on 4/15 @ 11am New York time. In the spirit of specifications and standards moving forward we want to support this, but we need to make sure it is ready enough today for us to support it. |
|
Appreciate it @puckpuck. |
|
We have decided to not move forward with using this yet. Ideally we can do this without the need for custom transforms for this to work. We would also like to see the Prometheus community come forward with advocate for this being the right configuration forward. I created issue #3264 to track adding this capability in the future. |
@puckpuck Could you elaborate on what this means in practice? Not sure what the expectation is. |
Changes
Switch from promoting resource attributes to metric labels to using the experimental PromQL
info()function, which joins metrics withtarget_infoat query time. This reduces metric cardinality in Prometheus by storing resource attributes once intarget_inforather than on every metric series, aligning with Prometheus recommendations.Bear also in mind that Prometheus might in the future store OTel resource attributes as native metadata — this PR would prepare for that because
info()should keep working.The motivation is that I'm a Prometheus OTLP endpoint code owner, and the
infoPromQL function creator, and would like for the demo to reflect Prometheus' recommendation to include OTel resource attributes viainfoinstead of through promotion :) Relying too much on promotion has some serious downsides, e.g. high cardinality.What changed
Prometheus configuration (
src/prometheus/prometheus-config.yaml)kafka.cluster.alias,collector.instance.id, andhost.name(needed for hostmetrics which lackservice.name)info()bug fix prometheus/prometheus#17817)--enable-feature=promql-experimental-functionsOTel Collector configuration (
src/otel-collector/otelcol-config.yml)resource/postgresqlprocessor to setservice.nameon PostgreSQL receiver metricstransform/postgresqlprocessor to construct uniqueservice.instance.idper PostgreSQL resource scope (database, table, index), preventing duplicatetarget_infoentriesmetric_statementstotransform/sanitize_spansto set defaultservice.instance.idfor services that don't provide onemetrics/postgresqlpipelinetransform/sanitize_spansto the main metrics pipelineDocker Compose (
docker-compose.yml,.env)service.instance.idfor checkout, product-catalog, and flagd (Go/flagd services that generate random IDs from SDK)Grafana dashboards and alerts
info()for resource attribute accessinfo()label_values(target_info, ...)pg_stat_bgwriter_*queries from PostgreSQL dashboard Buffers panel (pre-existing upstream issue, but made verification difficult)Kubernetes deployment (
kubernetes/)values.yamlwith Prometheus v3.11.1,info()feature flag, minimal resource attribute promotion, and collector transformmetric_statementsfor PostgreSQL and duplicatetarget_infopreventionvalues-kind.yamlwith Kind-specific overridesdeploy.shfor deploying to any k8s cluster (requires--context)deploy-kind.shfor local Kind cluster deploymentkind-config.yamlwith port mappingMerge Requirements
For new features contributions, please make sure you have completed the following
essential items:
CHANGELOG.mdupdated to document new feature additionsMaintainers will not merge until the above have been completed. If you're unsure
which docs need to be changed ping the
@open-telemetry/demo-approvers.