Operations Runbook

Overview

This document provides operational procedures for monitoring, troubleshooting, and maintaining the Flash Sale Engine in production.

Monitoring

Prometheus Metrics

Both services expose Prometheus metrics on /metrics endpoint:

Gateway Metrics (:8080/metrics):

gateway_orders_received_total - Total orders received
gateway_orders_successful_total - Orders successfully queued
gateway_orders_failed_total - Orders that failed to queue
gateway_orders_validation_failed_total - Validation failures
gateway_orders_idempotency_rejected_total - Duplicate requests rejected
gateway_request_duration_seconds - Request processing time histogram
gateway_circuit_breaker_state - Circuit breaker state (0=closed, 1=open, 2=half-open)

Processor Metrics (:9090/metrics):

processor_orders_processed_total - Total orders processed
processor_orders_processed_success_total - Successfully processed
processor_orders_processed_failed_total - Failed processing
processor_orders_sold_out_total - Orders rejected due to sold out
processor_orders_moved_to_dlq_total - Orders moved to DLQ
processor_order_processing_duration_seconds - Processing time histogram
processor_dlq_size - Current DLQ depth
processor_dlq_oldest_message_age_seconds - Age of oldest DLQ message
processor_inventory_level{item_id="..."} - Inventory level per item

Health Checks

Gateway Health (GET /health):

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "redis": true,
  "kafka": true,
  "circuit_breaker_state": "closed"
}

200 OK: All services healthy
503 Service Unavailable: One or more services unhealthy

Logging

All services use structured JSON logging with correlation IDs:

{
  "timestamp": "2025-11-29T21:56:00.000Z",
  "level": "INFO",
  "message": "Order queued successfully",
  "correlation_id": "uuid-123",
  "service": "gateway",
  "event": "order_queued",
  "user_id": "u1",
  "item_id": "101",
  "processing_time_ms": 145
}

Key Fields:

correlation_id: Trace requests across services
service: Service name (gateway/processor)
event: Event type (order_received, order_queued, order_processed, etc.)
processing_time_ms: Request processing time

Alerting Thresholds

Critical Alerts

Circuit Breaker Open
- Metric: gateway_circuit_breaker_state == 1
- Action: Check Kafka connectivity, restart if needed
- Impact: All orders rejected with 503
DLQ Size Exceeds Threshold
- Metric: processor_dlq_size > 100
- Action: Investigate failure reasons, process DLQ manually
- Impact: Orders not being processed
DLQ Age Too High
- Metric: processor_dlq_oldest_message_age_seconds > 3600
- Action: Process oldest messages first
- Impact: Stale orders in DLQ
High Failure Rate
- Metric: gateway_orders_failed_total / gateway_orders_received_total > 0.1
- Action: Check service health, review logs
- Impact: 10%+ of orders failing
Processing Time High
- Metric: processor_order_processing_duration_seconds{p99} > 5
- Action: Check Redis/Kafka latency, scale processor
- Impact: Slow order processing

Warning Alerts

Rate Limit Approaching
- Monitor: Rate limit rejections increasing
- Action: Review rate limit configuration
Inventory Low
- Metric: processor_inventory_level < 10
- Action: Restock or prepare for sold out

Troubleshooting

Issue: Circuit Breaker Open

Symptoms:

All requests return 503 Service Unavailable
Health check shows circuit_breaker_state: "open"

Diagnosis:

# Check Kafka connectivity
docker exec flash-sale-engine-redpanda-1 rpk cluster info

# Check gateway logs
docker-compose logs gateway | grep -i "circuit"

Resolution:

Check if Kafka/Redpanda is running: docker-compose ps redpanda
Restart Kafka if needed: docker-compose restart redpanda
Wait 30 seconds for circuit breaker to attempt recovery
Check health endpoint: curl http://localhost:8080/health

Issue: Orders Not Processing

Symptoms:

Orders accepted but not processed
Inventory not decreasing

Diagnosis:

# Check processor logs
docker-compose logs processor

# Check Kafka topic
docker exec flash-sale-engine-redpanda-1 rpk topic consume orders

# Check processor metrics
curl http://localhost:9090/metrics | grep processor_orders_processed

Resolution:

Check processor is running: docker-compose ps processor
Check Kafka connectivity from processor
Verify Redis connection
Restart processor if needed: docker-compose restart processor

Issue: High DLQ Size

Symptoms:

processor_dlq_size metric increasing
Many failed orders

Diagnosis:

# Check DLQ messages
docker exec flash-sale-engine-redpanda-1 rpk topic consume orders-dlq

# Check failure reasons in logs
docker-compose logs processor | grep -i "dlq"

Resolution:

Identify failure pattern (check DLQ message headers for error reasons)
Common reasons:
- Payment Timeout: Expected (10% simulation), can be ignored
- Redis Failure: Check Redis health
- Invalid Order Format: Check gateway message format
Process DLQ manually or implement retry logic

Issue: Inventory Mismatch

Symptoms:

Inventory count doesn't match expected value
Negative inventory (shouldn't happen with Lua scripts)

Diagnosis:

# Check current inventory
docker exec flash-sale-engine-redis-1 redis-cli GET inventory:101

# Check order status keys
docker exec flash-sale-engine-redis-1 redis-cli KEYS "order_status:*"

Resolution:

Verify Lua scripts are being used (check processor logs)
Check for Redis connection issues during script execution

Manually correct inventory if needed:

docker exec flash-sale-engine-redis-1 redis-cli SET inventory:101 100

Issue: Rate Limiting Too Aggressive

Symptoms:

Many 429 Too Many Requests responses
Legitimate users being blocked

Diagnosis:

# Check rate limit configuration
docker-compose exec gateway env | grep RATE_LIMIT

# Check rate limit keys in Redis
docker exec flash-sale-engine-redis-1 redis-cli KEYS "ratelimit:*"

Resolution:

Adjust rate limit via environment variables:

# docker-compose.yml
environment:
  RATE_LIMIT_MAX_REQUESTS: 120  # Increase from default 60
  RATE_LIMIT_WINDOW: 1m

Restart gateway: docker-compose restart gateway

Configuration

Environment Variables

Gateway:

REDIS_ADDR: Redis address (default: redis-service:6379)
KAFKA_ADDR: Kafka address (default: kafka-service:9092)
LOG_LEVEL: Log level (default: info)
CIRCUIT_BREAKER_FAILURE_THRESHOLD: Failures before opening (default: 5)
CIRCUIT_BREAKER_SUCCESS_THRESHOLD: Successes in half-open (default: 2)
CIRCUIT_BREAKER_BASE_TIMEOUT: Base timeout (default: 30s)
CIRCUIT_BREAKER_MAX_TIMEOUT: Max timeout (default: 300s)
RATE_LIMIT_MAX_REQUESTS: Max requests per window (default: 60)
RATE_LIMIT_WINDOW: Rate limit window (default: 1m)

Processor:

REDIS_ADDR: Redis address (default: redis-service:6379)
KAFKA_ADDR: Kafka address (default: kafka-service:9092)
LOG_LEVEL: Log level (default: info)

Backup and Recovery

Redis Backup

# Create backup
docker exec flash-sale-engine-redis-1 redis-cli SAVE
docker cp flash-sale-engine-redis-1:/data/dump.rdb ./backup-$(date +%Y%m%d).rdb

# Restore backup
docker cp ./backup-20251129.rdb flash-sale-engine-redis-1:/data/dump.rdb
docker-compose restart redis

Kafka/Redpanda Backup

Redpanda data is stored in volumes. Backup the volume:

docker run --rm -v flash-sale-engine_redpanda-data:/data -v $(pwd):/backup alpine tar czf /backup/redpanda-backup-$(date +%Y%m%d).tar.gz /data

Performance Tuning

Scaling

Horizontal Scaling:

Gateway: Stateless, can scale horizontally
Processor: Use Kafka consumer groups for parallel processing

Vertical Scaling:

Increase Redis memory for larger inventory
Increase Kafka partitions for higher throughput

Optimization

Redis Connection Pooling: Already configured in go-redis
Kafka Batch Size: Adjust producer batch size for throughput
Lua Script Caching: Redis caches Lua scripts automatically
Circuit Breaker Tuning: Adjust thresholds based on failure patterns

Maintenance Windows

Zero-Downtime Deployment

Deploy new version to new pods
Wait for health checks to pass
Gradually shift traffic
Monitor metrics for issues
Rollback if problems detected

Graceful Shutdown

Services handle SIGTERM gracefully:

Gateway: Stops accepting new requests, waits for in-flight (30s timeout)
Processor: Stops consuming, processes current message (30s timeout)

Emergency Procedures

Complete System Failure

Stop all services: docker-compose down
Check data integrity: Verify Redis and Kafka data
Restore from backup if needed
Restart services: docker-compose up -d
Verify health: Check all health endpoints
Monitor metrics: Watch for anomalies

Data Corruption

Stop services: Prevent further corruption
Restore from backup
Verify inventory counts
Replay DLQ messages if needed
Restart services

Contact and Escalation

On-Call Engineer: Check team rotation schedule
Critical Issues: Escalate immediately
Documentation: Update this runbook with new procedures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operations Runbook

Overview

Monitoring

Prometheus Metrics

Health Checks

Logging

Alerting Thresholds

Critical Alerts

Warning Alerts

Troubleshooting

Issue: Circuit Breaker Open

Issue: Orders Not Processing

Issue: High DLQ Size

Issue: Inventory Mismatch

Issue: Rate Limiting Too Aggressive

Configuration

Environment Variables

Backup and Recovery

Redis Backup

Kafka/Redpanda Backup

Performance Tuning

Scaling

Optimization

Maintenance Windows

Zero-Downtime Deployment

Graceful Shutdown

Emergency Procedures

Complete System Failure

Data Corruption

Contact and Escalation

FilesExpand file tree

OPERATIONS.md

Latest commit

History

OPERATIONS.md

File metadata and controls

Operations Runbook

Overview

Monitoring

Prometheus Metrics

Health Checks

Logging

Alerting Thresholds

Critical Alerts

Warning Alerts

Troubleshooting

Issue: Circuit Breaker Open

Issue: Orders Not Processing

Issue: High DLQ Size

Issue: Inventory Mismatch

Issue: Rate Limiting Too Aggressive

Configuration

Environment Variables

Backup and Recovery

Redis Backup

Kafka/Redpanda Backup

Performance Tuning

Scaling

Optimization

Maintenance Windows

Zero-Downtime Deployment

Graceful Shutdown

Emergency Procedures

Complete System Failure

Data Corruption

Contact and Escalation