Grafana App Plugin
Service-centric incident correlation and reliability analysis for Grafana.
Reliability Control Plane is a Grafana App Plugin that correlates metrics, logs, and traces into a unified reliability analysis surface. It enables faster incident investigation by synthesizing Prometheus, Loki, and Tempo signals into actionable service-level insights.
- Plugin Type: Grafana App Plugin
- Frontend: React (create-plugin scaffold)
- Backend: Go (optional persistence APIs)
- Minimum Grafana Version (current manifest): 11.6.0+ (
src/plugin.json) - Signature Required: Yes (for production installation)
Reliability Control Plane provides a service-first reliability view inside Grafana by:
- Querying Prometheus, Loki, and Tempo directly through Grafana datasources
- Performing correlation logic in the frontend
- Optionally persisting incident snapshots using a lightweight backend
The plugin does not introduce a separate observability store and does not duplicate telemetry. Reliability indicators are derived from live datasource responses.
The plugin relies on existing Grafana datasources:
| Signal | Datasource | Purpose |
|---|---|---|
| Metrics | Prometheus | Error rate, latency, request rate, SLO burn |
| Logs | Loki | Error pattern grouping |
| Traces | Tempo | Trace extraction and dependency context |
No external APIs are called for core correlation features.
Correlation and scoring logic is executed in the frontend:
- Root cause confidence scoring
- Error pattern dominance calculation
- Temporal alignment between metrics and logs
- SLO burn and error budget modeling
- Blast radius estimation
The Go backend provides APIs for:
- Incident snapshots
- Incident history/listing
- Basic incident lifecycle management
Incident data is stored in SQLite (incidents.db).
- Error Rate (5xx/sec)
- Request Rate
- Latency (p95)
- SLO Burn Rate
- Remaining Error Budget
- Time to Exhaustion
All values are computed from live Prometheus metrics.
- Groups Loki logs by route, status, and message signature
- Identifies dominant failure patterns
- Calculates occurrence counts
- Highlights probable root cause signals
No mock data is used in runtime calculations.
Computes a confidence percentage based on:
- Pattern dominance
- Signal consistency
- Temporal alignment between logs and metric spikes
The score updates with time range and traffic changes.
Estimates service impact by analyzing:
- Failure distribution
- Dependency signals from traces
- Affected service count
- Impact percentage classification
- Extracts Trace IDs from logs
- Displays trace metadata
- Provides deep links to Tempo in Grafana Explore
- Parses and displays trace durations
No simulated traces are generated.
- Grafana 11.6.0+ (as currently declared in plugin manifest)
- Prometheus datasource configured
- Loki datasource configured
- Tempo datasource configured
Optional:
- OpenTelemetry instrumentation
- Structured JSON logs with trace IDs
A complete Docker-based test environment is included.
npm install
npm run builddocker compose up --buildThis starts:
- Grafana
- Prometheus
- Loki
- Tempo
- Sample application
- Traffic generator
- URL: http://localhost:3000
- Username: admin
- Password: admin
Navigate to:
Apps -> Reliability Control Plane -> Services
Select sample-app from the service dropdown and allow ~15-30 seconds for initial scraping.
| Component | Location | Purpose |
|---|---|---|
| Datasources | provisioning/datasources/ |
Auto-configures Prometheus, Loki, Tempo |
| Plugin Provisioning | provisioning/plugins/ |
Enables the App plugin on startup |
| Sample App | sample-app/ |
Generates metrics, logs, and traces |
| Docker Stack | docker-compose.yaml |
Reproducible local test environment |
The included sample app provides:
GET /-> 200 responsesGET /error-> 500 responsesGET /metrics-> Prometheus metrics export
A traffic generator produces mixed traffic to demonstrate:
- Error spikes
- Pattern grouping
- Trace extraction
- SLO burn adjustments
Traffic is generated automatically by traffic-generator in Docker Compose.
docker compose ps traffic-generator
docker compose logs -f traffic-generator# success traffic
for i in $(seq 1 40); do curl -sS -o /dev/null http://localhost:4000/; done
# error traffic
for i in $(seq 1 40); do curl -sS -o /dev/null -w '%{http_code}\n' http://localhost:4000/error; done- Open plugin page:
Apps -> Reliability Control Plane -> Services - Select
sample-app - Wait 30-90 seconds for ingestion
- Verify:
- Error Rate / Request Rate / Latency p95 update
- Loki error logs visible
- Tempo traces visible with Explore links
- Root Cause Confidence recalculates
- SLO Burn Rate and Remaining Budget update
- Blast Radius impact updates
# Frontend dev
npm run dev
# Unit tests
npm run test
# CI-safe unit tests
npm run test:ci
# Type checks
npm run typecheck
# End-to-end tests
npm run e2e
# Sign plugin (required for production distribution)
export GRAFANA_API_KEY=...
npm run sign- No mock runtime data in reliability calculations
- No external API dependency for core plugin behavior
- No embedded credentials required for operation
- No telemetry exfiltration by design
- No
evalor dynamic code execution - No Angular
- Queries are executed through Grafana datasource access patterns
- Incident writes are role-restricted (Editor/Admin) and org-scoped
To validate behavior end-to-end:
- Generate error traffic (
/error) using auto or manual traffic - Observe:
- Error rate increases
- Pattern occurrence counts increase
- Confidence score recalculates
- Burn rate updates
- Blast radius adjusts
- Change Grafana time range
- Confirm recomputation across all panels
If traffic is stopped, panels should degrade gracefully (lower rates / empty states depending on selected window).
/src Frontend React application
/pkg Go backend (incident APIs)
/sample-app Test application
/observability Prometheus, Loki, Tempo configs
/provisioning Datasource and plugin provisioning
Semantic Versioning is used:
MAJOR.MINOR.PATCH
- Breaking changes increment MAJOR
- Backward-compatible features increment MINOR
- Fixes increment PATCH
Licensed under Apache 2.0.