Skip to content

sarika-03/reliability-control-plane

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reliability Control Plane

Grafana App Plugin
Service-centric incident correlation and reliability analysis for Grafana.


Short Description

Reliability Control Plane is a Grafana App Plugin that correlates metrics, logs, and traces into a unified reliability analysis surface. It enables faster incident investigation by synthesizing Prometheus, Loki, and Tempo signals into actionable service-level insights.


Plugin Information

  • Plugin Type: Grafana App Plugin
  • Frontend: React (create-plugin scaffold)
  • Backend: Go (optional persistence APIs)
  • Minimum Grafana Version (current manifest): 11.6.0+ (src/plugin.json)
  • Signature Required: Yes (for production installation)

Overview

Reliability Control Plane provides a service-first reliability view inside Grafana by:

  • Querying Prometheus, Loki, and Tempo directly through Grafana datasources
  • Performing correlation logic in the frontend
  • Optionally persisting incident snapshots using a lightweight backend

The plugin does not introduce a separate observability store and does not duplicate telemetry. Reliability indicators are derived from live datasource responses.


Architecture

1. Datasource-Driven Model

The plugin relies on existing Grafana datasources:

Signal Datasource Purpose
Metrics Prometheus Error rate, latency, request rate, SLO burn
Logs Loki Error pattern grouping
Traces Tempo Trace extraction and dependency context

No external APIs are called for core correlation features.

2. Frontend Correlation Engine

Correlation and scoring logic is executed in the frontend:

  • Root cause confidence scoring
  • Error pattern dominance calculation
  • Temporal alignment between metrics and logs
  • SLO burn and error budget modeling
  • Blast radius estimation

3. Backend (Optional Persistence Layer)

The Go backend provides APIs for:

  • Incident snapshots
  • Incident history/listing
  • Basic incident lifecycle management

Incident data is stored in SQLite (incidents.db).


Core Features

Unified Reliability Dashboard

  • Error Rate (5xx/sec)
  • Request Rate
  • Latency (p95)
  • SLO Burn Rate
  • Remaining Error Budget
  • Time to Exhaustion

All values are computed from live Prometheus metrics.

Error Pattern Analysis

  • Groups Loki logs by route, status, and message signature
  • Identifies dominant failure patterns
  • Calculates occurrence counts
  • Highlights probable root cause signals

No mock data is used in runtime calculations.

Root Cause Confidence Scoring

Computes a confidence percentage based on:

  • Pattern dominance
  • Signal consistency
  • Temporal alignment between logs and metric spikes

The score updates with time range and traffic changes.

Blast Radius Estimation

Estimates service impact by analyzing:

  • Failure distribution
  • Dependency signals from traces
  • Affected service count
  • Impact percentage classification

Distributed Trace Integration

  • Extracts Trace IDs from logs
  • Displays trace metadata
  • Provides deep links to Tempo in Grafana Explore
  • Parses and displays trace durations

No simulated traces are generated.


Requirements

  • Grafana 11.6.0+ (as currently declared in plugin manifest)
  • Prometheus datasource configured
  • Loki datasource configured
  • Tempo datasource configured

Optional:

  • OpenTelemetry instrumentation
  • Structured JSON logs with trace IDs

Reviewer Quick Start (5-Minute Setup)

A complete Docker-based test environment is included.

1. Build the Plugin

npm install
npm run build

2. Start the Full Test Stack

docker compose up --build

This starts:

  • Grafana
  • Prometheus
  • Loki
  • Tempo
  • Sample application
  • Traffic generator

3. Access Grafana

Navigate to:

Apps -> Reliability Control Plane -> Services

Select sample-app from the service dropdown and allow ~15-30 seconds for initial scraping.


Included Provisioning Resources

Component Location Purpose
Datasources provisioning/datasources/ Auto-configures Prometheus, Loki, Tempo
Plugin Provisioning provisioning/plugins/ Enables the App plugin on startup
Sample App sample-app/ Generates metrics, logs, and traces
Docker Stack docker-compose.yaml Reproducible local test environment

Sample Application Behavior

The included sample app provides:

  • GET / -> 200 responses
  • GET /error -> 500 responses
  • GET /metrics -> Prometheus metrics export

A traffic generator produces mixed traffic to demonstrate:

  • Error spikes
  • Pattern grouping
  • Trace extraction
  • SLO burn adjustments

Traffic Generation and Testing Flow

Auto Traffic (Default)

Traffic is generated automatically by traffic-generator in Docker Compose.

docker compose ps traffic-generator
docker compose logs -f traffic-generator

Manual Traffic (Optional)

# success traffic
for i in $(seq 1 40); do curl -sS -o /dev/null http://localhost:4000/; done

# error traffic
for i in $(seq 1 40); do curl -sS -o /dev/null -w '%{http_code}\n' http://localhost:4000/error; done

App Selection and Validation

  1. Open plugin page: Apps -> Reliability Control Plane -> Services
  2. Select sample-app
  3. Wait 30-90 seconds for ingestion
  4. Verify:
    • Error Rate / Request Rate / Latency p95 update
    • Loki error logs visible
    • Tempo traces visible with Explore links
    • Root Cause Confidence recalculates
    • SLO Burn Rate and Remaining Budget update
    • Blast Radius impact updates

Development Workflow

# Frontend dev
npm run dev

# Unit tests
npm run test

# CI-safe unit tests
npm run test:ci

# Type checks
npm run typecheck

# End-to-end tests
npm run e2e

# Sign plugin (required for production distribution)
export GRAFANA_API_KEY=...
npm run sign

Security and Compliance

  • No mock runtime data in reliability calculations
  • No external API dependency for core plugin behavior
  • No embedded credentials required for operation
  • No telemetry exfiltration by design
  • No eval or dynamic code execution
  • No Angular
  • Queries are executed through Grafana datasource access patterns
  • Incident writes are role-restricted (Editor/Admin) and org-scoped

Testing Guidance

To validate behavior end-to-end:

  1. Generate error traffic (/error) using auto or manual traffic
  2. Observe:
    • Error rate increases
    • Pattern occurrence counts increase
    • Confidence score recalculates
    • Burn rate updates
    • Blast radius adjusts
  3. Change Grafana time range
  4. Confirm recomputation across all panels

If traffic is stopped, panels should degrade gracefully (lower rates / empty states depending on selected window).


Codebase Structure

/src            Frontend React application
/pkg            Go backend (incident APIs)
/sample-app     Test application
/observability  Prometheus, Loki, Tempo configs
/provisioning   Datasource and plugin provisioning

Versioning

Semantic Versioning is used:

MAJOR.MINOR.PATCH

  • Breaking changes increment MAJOR
  • Backward-compatible features increment MINOR
  • Fixes increment PATCH

License

Licensed under Apache 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors