This document provides operational procedures for monitoring, troubleshooting, and maintaining the Flash Sale Engine in production.
Both services expose Prometheus metrics on /metrics endpoint:
Gateway Metrics (:8080/metrics):
gateway_orders_received_total- Total orders receivedgateway_orders_successful_total- Orders successfully queuedgateway_orders_failed_total- Orders that failed to queuegateway_orders_validation_failed_total- Validation failuresgateway_orders_idempotency_rejected_total- Duplicate requests rejectedgateway_request_duration_seconds- Request processing time histogramgateway_circuit_breaker_state- Circuit breaker state (0=closed, 1=open, 2=half-open)
Processor Metrics (:9090/metrics):
processor_orders_processed_total- Total orders processedprocessor_orders_processed_success_total- Successfully processedprocessor_orders_processed_failed_total- Failed processingprocessor_orders_sold_out_total- Orders rejected due to sold outprocessor_orders_moved_to_dlq_total- Orders moved to DLQprocessor_order_processing_duration_seconds- Processing time histogramprocessor_dlq_size- Current DLQ depthprocessor_dlq_oldest_message_age_seconds- Age of oldest DLQ messageprocessor_inventory_level{item_id="..."}- Inventory level per item
Gateway Health (GET /health):
curl http://localhost:8080/healthResponse:
{
"status": "healthy",
"redis": true,
"kafka": true,
"circuit_breaker_state": "closed"
}200 OK: All services healthy503 Service Unavailable: One or more services unhealthy
All services use structured JSON logging with correlation IDs:
{
"timestamp": "2025-11-29T21:56:00.000Z",
"level": "INFO",
"message": "Order queued successfully",
"correlation_id": "uuid-123",
"service": "gateway",
"event": "order_queued",
"user_id": "u1",
"item_id": "101",
"processing_time_ms": 145
}Key Fields:
correlation_id: Trace requests across servicesservice: Service name (gateway/processor)event: Event type (order_received, order_queued, order_processed, etc.)processing_time_ms: Request processing time
-
Circuit Breaker Open
- Metric:
gateway_circuit_breaker_state == 1 - Action: Check Kafka connectivity, restart if needed
- Impact: All orders rejected with 503
- Metric:
-
DLQ Size Exceeds Threshold
- Metric:
processor_dlq_size > 100 - Action: Investigate failure reasons, process DLQ manually
- Impact: Orders not being processed
- Metric:
-
DLQ Age Too High
- Metric:
processor_dlq_oldest_message_age_seconds > 3600 - Action: Process oldest messages first
- Impact: Stale orders in DLQ
- Metric:
-
High Failure Rate
- Metric:
gateway_orders_failed_total / gateway_orders_received_total > 0.1 - Action: Check service health, review logs
- Impact: 10%+ of orders failing
- Metric:
-
Processing Time High
- Metric:
processor_order_processing_duration_seconds{p99} > 5 - Action: Check Redis/Kafka latency, scale processor
- Impact: Slow order processing
- Metric:
-
Rate Limit Approaching
- Monitor: Rate limit rejections increasing
- Action: Review rate limit configuration
-
Inventory Low
- Metric:
processor_inventory_level < 10 - Action: Restock or prepare for sold out
- Metric:
Symptoms:
- All requests return 503 Service Unavailable
- Health check shows
circuit_breaker_state: "open"
Diagnosis:
# Check Kafka connectivity
docker exec flash-sale-engine-redpanda-1 rpk cluster info
# Check gateway logs
docker-compose logs gateway | grep -i "circuit"Resolution:
- Check if Kafka/Redpanda is running:
docker-compose ps redpanda - Restart Kafka if needed:
docker-compose restart redpanda - Wait 30 seconds for circuit breaker to attempt recovery
- Check health endpoint:
curl http://localhost:8080/health
Symptoms:
- Orders accepted but not processed
- Inventory not decreasing
Diagnosis:
# Check processor logs
docker-compose logs processor
# Check Kafka topic
docker exec flash-sale-engine-redpanda-1 rpk topic consume orders
# Check processor metrics
curl http://localhost:9090/metrics | grep processor_orders_processedResolution:
- Check processor is running:
docker-compose ps processor - Check Kafka connectivity from processor
- Verify Redis connection
- Restart processor if needed:
docker-compose restart processor
Symptoms:
processor_dlq_sizemetric increasing- Many failed orders
Diagnosis:
# Check DLQ messages
docker exec flash-sale-engine-redpanda-1 rpk topic consume orders-dlq
# Check failure reasons in logs
docker-compose logs processor | grep -i "dlq"Resolution:
- Identify failure pattern (check DLQ message headers for error reasons)
- Common reasons:
Payment Timeout: Expected (10% simulation), can be ignoredRedis Failure: Check Redis healthInvalid Order Format: Check gateway message format
- Process DLQ manually or implement retry logic
Symptoms:
- Inventory count doesn't match expected value
- Negative inventory (shouldn't happen with Lua scripts)
Diagnosis:
# Check current inventory
docker exec flash-sale-engine-redis-1 redis-cli GET inventory:101
# Check order status keys
docker exec flash-sale-engine-redis-1 redis-cli KEYS "order_status:*"Resolution:
- Verify Lua scripts are being used (check processor logs)
- Check for Redis connection issues during script execution
- Manually correct inventory if needed:
docker exec flash-sale-engine-redis-1 redis-cli SET inventory:101 100
Symptoms:
- Many 429 Too Many Requests responses
- Legitimate users being blocked
Diagnosis:
# Check rate limit configuration
docker-compose exec gateway env | grep RATE_LIMIT
# Check rate limit keys in Redis
docker exec flash-sale-engine-redis-1 redis-cli KEYS "ratelimit:*"Resolution:
- Adjust rate limit via environment variables:
# docker-compose.yml environment: RATE_LIMIT_MAX_REQUESTS: 120 # Increase from default 60 RATE_LIMIT_WINDOW: 1m
- Restart gateway:
docker-compose restart gateway
Gateway:
REDIS_ADDR: Redis address (default:redis-service:6379)KAFKA_ADDR: Kafka address (default:kafka-service:9092)LOG_LEVEL: Log level (default:info)CIRCUIT_BREAKER_FAILURE_THRESHOLD: Failures before opening (default:5)CIRCUIT_BREAKER_SUCCESS_THRESHOLD: Successes in half-open (default:2)CIRCUIT_BREAKER_BASE_TIMEOUT: Base timeout (default:30s)CIRCUIT_BREAKER_MAX_TIMEOUT: Max timeout (default:300s)RATE_LIMIT_MAX_REQUESTS: Max requests per window (default:60)RATE_LIMIT_WINDOW: Rate limit window (default:1m)
Processor:
REDIS_ADDR: Redis address (default:redis-service:6379)KAFKA_ADDR: Kafka address (default:kafka-service:9092)LOG_LEVEL: Log level (default:info)
# Create backup
docker exec flash-sale-engine-redis-1 redis-cli SAVE
docker cp flash-sale-engine-redis-1:/data/dump.rdb ./backup-$(date +%Y%m%d).rdb
# Restore backup
docker cp ./backup-20251129.rdb flash-sale-engine-redis-1:/data/dump.rdb
docker-compose restart redisRedpanda data is stored in volumes. Backup the volume:
docker run --rm -v flash-sale-engine_redpanda-data:/data -v $(pwd):/backup alpine tar czf /backup/redpanda-backup-$(date +%Y%m%d).tar.gz /dataHorizontal Scaling:
- Gateway: Stateless, can scale horizontally
- Processor: Use Kafka consumer groups for parallel processing
Vertical Scaling:
- Increase Redis memory for larger inventory
- Increase Kafka partitions for higher throughput
- Redis Connection Pooling: Already configured in go-redis
- Kafka Batch Size: Adjust producer batch size for throughput
- Lua Script Caching: Redis caches Lua scripts automatically
- Circuit Breaker Tuning: Adjust thresholds based on failure patterns
- Deploy new version to new pods
- Wait for health checks to pass
- Gradually shift traffic
- Monitor metrics for issues
- Rollback if problems detected
Services handle SIGTERM gracefully:
- Gateway: Stops accepting new requests, waits for in-flight (30s timeout)
- Processor: Stops consuming, processes current message (30s timeout)
- Stop all services:
docker-compose down - Check data integrity: Verify Redis and Kafka data
- Restore from backup if needed
- Restart services:
docker-compose up -d - Verify health: Check all health endpoints
- Monitor metrics: Watch for anomalies
- Stop services: Prevent further corruption
- Restore from backup
- Verify inventory counts
- Replay DLQ messages if needed
- Restart services
- On-Call Engineer: Check team rotation schedule
- Critical Issues: Escalate immediately
- Documentation: Update this runbook with new procedures