Reliability Control Plane

Grafana App Plugin
Service-centric incident correlation and reliability analysis for Grafana.

Short Description

Reliability Control Plane is a Grafana App Plugin that correlates metrics, logs, and traces into a unified reliability analysis surface. It enables faster incident investigation by synthesizing Prometheus, Loki, and Tempo signals into actionable service-level insights.

Plugin Information

Plugin Type: Grafana App Plugin
Frontend: React (create-plugin scaffold)
Backend: Go (optional persistence APIs)
Minimum Grafana Version (current manifest): 11.6.0+ (src/plugin.json)
Signature Required: Yes (for production installation)

Overview

Reliability Control Plane provides a service-first reliability view inside Grafana by:

Querying Prometheus, Loki, and Tempo directly through Grafana datasources
Performing correlation logic in the frontend
Optionally persisting incident snapshots using a lightweight backend

The plugin does not introduce a separate observability store and does not duplicate telemetry. Reliability indicators are derived from live datasource responses.

Architecture

1. Datasource-Driven Model

The plugin relies on existing Grafana datasources:

Signal	Datasource	Purpose
Metrics	Prometheus	Error rate, latency, request rate, SLO burn
Logs	Loki	Error pattern grouping
Traces	Tempo	Trace extraction and dependency context

No external APIs are called for core correlation features.

2. Frontend Correlation Engine

Correlation and scoring logic is executed in the frontend:

Root cause confidence scoring
Error pattern dominance calculation
Temporal alignment between metrics and logs
SLO burn and error budget modeling
Blast radius estimation

3. Backend (Optional Persistence Layer)

The Go backend provides APIs for:

Incident snapshots
Incident history/listing
Basic incident lifecycle management

Incident data is stored in SQLite (incidents.db).

Core Features

Unified Reliability Dashboard

Error Rate (5xx/sec)
Request Rate
Latency (p95)
SLO Burn Rate
Remaining Error Budget
Time to Exhaustion

All values are computed from live Prometheus metrics.

Error Pattern Analysis

Groups Loki logs by route, status, and message signature
Identifies dominant failure patterns
Calculates occurrence counts
Highlights probable root cause signals

No mock data is used in runtime calculations.

Root Cause Confidence Scoring

Computes a confidence percentage based on:

Pattern dominance
Signal consistency
Temporal alignment between logs and metric spikes

The score updates with time range and traffic changes.

Blast Radius Estimation

Estimates service impact by analyzing:

Failure distribution
Dependency signals from traces
Affected service count
Impact percentage classification

Distributed Trace Integration

Extracts Trace IDs from logs
Displays trace metadata
Provides deep links to Tempo in Grafana Explore
Parses and displays trace durations

No simulated traces are generated.

Requirements

Grafana 11.6.0+ (as currently declared in plugin manifest)
Prometheus datasource configured
Loki datasource configured
Tempo datasource configured

Optional:

OpenTelemetry instrumentation
Structured JSON logs with trace IDs

Reviewer Quick Start (5-Minute Setup)

A complete Docker-based test environment is included.

1. Build the Plugin

npm install
npm run build

2. Start the Full Test Stack

docker compose up --build

This starts:

Grafana
Prometheus
Loki
Tempo
Sample application
Traffic generator

3. Access Grafana

URL: http://localhost:3000
Username: admin
Password: admin

Navigate to:

Apps -> Reliability Control Plane -> Services

Select sample-app from the service dropdown and allow ~15-30 seconds for initial scraping.

Included Provisioning Resources

Component	Location	Purpose
Datasources	`provisioning/datasources/`	Auto-configures Prometheus, Loki, Tempo
Plugin Provisioning	`provisioning/plugins/`	Enables the App plugin on startup
Sample App	`sample-app/`	Generates metrics, logs, and traces
Docker Stack	`docker-compose.yaml`	Reproducible local test environment

Sample Application Behavior

The included sample app provides:

GET / -> 200 responses
GET /error -> 500 responses
GET /metrics -> Prometheus metrics export

A traffic generator produces mixed traffic to demonstrate:

Error spikes
Pattern grouping
Trace extraction
SLO burn adjustments

Traffic Generation and Testing Flow

Auto Traffic (Default)

Traffic is generated automatically by traffic-generator in Docker Compose.

docker compose ps traffic-generator
docker compose logs -f traffic-generator

Manual Traffic (Optional)

# success traffic
for i in $(seq 1 40); do curl -sS -o /dev/null http://localhost:4000/; done

# error traffic
for i in $(seq 1 40); do curl -sS -o /dev/null -w '%{http_code}\n' http://localhost:4000/error; done

App Selection and Validation

Open plugin page: Apps -> Reliability Control Plane -> Services
Select sample-app
Wait 30-90 seconds for ingestion
Verify:
- Error Rate / Request Rate / Latency p95 update
- Loki error logs visible
- Tempo traces visible with Explore links
- Root Cause Confidence recalculates
- SLO Burn Rate and Remaining Budget update
- Blast Radius impact updates

Development Workflow

# Frontend dev
npm run dev

# Unit tests
npm run test

# CI-safe unit tests
npm run test:ci

# Type checks
npm run typecheck

# End-to-end tests
npm run e2e

# Sign plugin (required for production distribution)
export GRAFANA_API_KEY=...
npm run sign

Security and Compliance

No mock runtime data in reliability calculations
No external API dependency for core plugin behavior
No embedded credentials required for operation
No telemetry exfiltration by design
No eval or dynamic code execution
No Angular
Queries are executed through Grafana datasource access patterns
Incident writes are role-restricted (Editor/Admin) and org-scoped

Testing Guidance

To validate behavior end-to-end:

Generate error traffic (/error) using auto or manual traffic
Observe:
- Error rate increases
- Pattern occurrence counts increase
- Confidence score recalculates
- Burn rate updates
- Blast radius adjusts
Change Grafana time range
Confirm recomputation across all panels

If traffic is stopped, panels should degrade gracefully (lower rates / empty states depending on selected window).

Codebase Structure

/src            Frontend React application
/pkg            Go backend (incident APIs)
/sample-app     Test application
/observability  Prometheus, Loki, Tempo configs
/provisioning   Datasource and plugin provisioning

Versioning

Semantic Versioning is used:

MAJOR.MINOR.PATCH

Breaking changes increment MAJOR
Backward-compatible features increment MINOR
Fixes increment PATCH

License

Licensed under Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
dist		dist
observability		observability
pkg		pkg
playwright-report		playwright-report
provisioning		provisioning
sample-app		sample-app
src		src
test-results		test-results
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
REVIEW_RUNBOOK.md		REVIEW_RUNBOOK.md
diagnose.sh		diagnose.sh
docker-compose.yaml		docker-compose.yaml
eslint.config.mjs		eslint.config.mjs
go.mod		go.mod
go.sum		go.sum
jest-setup.js		jest-setup.js
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Reliability Control Plane

Short Description

Plugin Information

Overview

Architecture

1. Datasource-Driven Model

2. Frontend Correlation Engine

3. Backend (Optional Persistence Layer)

Core Features

Unified Reliability Dashboard

Error Pattern Analysis

Root Cause Confidence Scoring

Blast Radius Estimation

Distributed Trace Integration

Requirements

Reviewer Quick Start (5-Minute Setup)

1. Build the Plugin

2. Start the Full Test Stack

3. Access Grafana

Included Provisioning Resources

Sample Application Behavior

Traffic Generation and Testing Flow

Auto Traffic (Default)

Manual Traffic (Optional)

App Selection and Validation

Development Workflow

Security and Compliance

Testing Guidance

Codebase Structure

Versioning

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages