|
| 1 | +# Operations Runbook |
| 2 | + |
| 3 | +## Overview |
| 4 | +This document provides operational procedures for monitoring, troubleshooting, and maintaining the Flash Sale Engine in production. |
| 5 | + |
| 6 | +## Monitoring |
| 7 | + |
| 8 | +### Prometheus Metrics |
| 9 | + |
| 10 | +Both services expose Prometheus metrics on `/metrics` endpoint: |
| 11 | + |
| 12 | +**Gateway Metrics** (`:8080/metrics`): |
| 13 | +- `gateway_orders_received_total` - Total orders received |
| 14 | +- `gateway_orders_successful_total` - Orders successfully queued |
| 15 | +- `gateway_orders_failed_total` - Orders that failed to queue |
| 16 | +- `gateway_orders_validation_failed_total` - Validation failures |
| 17 | +- `gateway_orders_idempotency_rejected_total` - Duplicate requests rejected |
| 18 | +- `gateway_request_duration_seconds` - Request processing time histogram |
| 19 | +- `gateway_circuit_breaker_state` - Circuit breaker state (0=closed, 1=open, 2=half-open) |
| 20 | + |
| 21 | +**Processor Metrics** (`:9090/metrics`): |
| 22 | +- `processor_orders_processed_total` - Total orders processed |
| 23 | +- `processor_orders_processed_success_total` - Successfully processed |
| 24 | +- `processor_orders_processed_failed_total` - Failed processing |
| 25 | +- `processor_orders_sold_out_total` - Orders rejected due to sold out |
| 26 | +- `processor_orders_moved_to_dlq_total` - Orders moved to DLQ |
| 27 | +- `processor_order_processing_duration_seconds` - Processing time histogram |
| 28 | +- `processor_dlq_size` - Current DLQ depth |
| 29 | +- `processor_dlq_oldest_message_age_seconds` - Age of oldest DLQ message |
| 30 | +- `processor_inventory_level{item_id="..."}` - Inventory level per item |
| 31 | + |
| 32 | +### Health Checks |
| 33 | + |
| 34 | +**Gateway Health** (`GET /health`): |
| 35 | +```bash |
| 36 | +curl http://localhost:8080/health |
| 37 | +``` |
| 38 | + |
| 39 | +Response: |
| 40 | +```json |
| 41 | +{ |
| 42 | + "status": "healthy", |
| 43 | + "redis": true, |
| 44 | + "kafka": true, |
| 45 | + "circuit_breaker_state": "closed" |
| 46 | +} |
| 47 | +``` |
| 48 | + |
| 49 | +- `200 OK`: All services healthy |
| 50 | +- `503 Service Unavailable`: One or more services unhealthy |
| 51 | + |
| 52 | +### Logging |
| 53 | + |
| 54 | +All services use structured JSON logging with correlation IDs: |
| 55 | + |
| 56 | +```json |
| 57 | +{ |
| 58 | + "timestamp": "2025-11-29T21:56:00.000Z", |
| 59 | + "level": "INFO", |
| 60 | + "message": "Order queued successfully", |
| 61 | + "correlation_id": "uuid-123", |
| 62 | + "service": "gateway", |
| 63 | + "event": "order_queued", |
| 64 | + "user_id": "u1", |
| 65 | + "item_id": "101", |
| 66 | + "processing_time_ms": 145 |
| 67 | +} |
| 68 | +``` |
| 69 | + |
| 70 | +**Key Fields**: |
| 71 | +- `correlation_id`: Trace requests across services |
| 72 | +- `service`: Service name (gateway/processor) |
| 73 | +- `event`: Event type (order_received, order_queued, order_processed, etc.) |
| 74 | +- `processing_time_ms`: Request processing time |
| 75 | + |
| 76 | +## Alerting Thresholds |
| 77 | + |
| 78 | +### Critical Alerts |
| 79 | + |
| 80 | +1. **Circuit Breaker Open** |
| 81 | + - Metric: `gateway_circuit_breaker_state == 1` |
| 82 | + - Action: Check Kafka connectivity, restart if needed |
| 83 | + - Impact: All orders rejected with 503 |
| 84 | + |
| 85 | +2. **DLQ Size Exceeds Threshold** |
| 86 | + - Metric: `processor_dlq_size > 100` |
| 87 | + - Action: Investigate failure reasons, process DLQ manually |
| 88 | + - Impact: Orders not being processed |
| 89 | + |
| 90 | +3. **DLQ Age Too High** |
| 91 | + - Metric: `processor_dlq_oldest_message_age_seconds > 3600` |
| 92 | + - Action: Process oldest messages first |
| 93 | + - Impact: Stale orders in DLQ |
| 94 | + |
| 95 | +4. **High Failure Rate** |
| 96 | + - Metric: `gateway_orders_failed_total / gateway_orders_received_total > 0.1` |
| 97 | + - Action: Check service health, review logs |
| 98 | + - Impact: 10%+ of orders failing |
| 99 | + |
| 100 | +5. **Processing Time High** |
| 101 | + - Metric: `processor_order_processing_duration_seconds{p99} > 5` |
| 102 | + - Action: Check Redis/Kafka latency, scale processor |
| 103 | + - Impact: Slow order processing |
| 104 | + |
| 105 | +### Warning Alerts |
| 106 | + |
| 107 | +1. **Rate Limit Approaching** |
| 108 | + - Monitor: Rate limit rejections increasing |
| 109 | + - Action: Review rate limit configuration |
| 110 | + |
| 111 | +2. **Inventory Low** |
| 112 | + - Metric: `processor_inventory_level < 10` |
| 113 | + - Action: Restock or prepare for sold out |
| 114 | + |
| 115 | +## Troubleshooting |
| 116 | + |
| 117 | +### Issue: Circuit Breaker Open |
| 118 | + |
| 119 | +**Symptoms**: |
| 120 | +- All requests return 503 Service Unavailable |
| 121 | +- Health check shows `circuit_breaker_state: "open"` |
| 122 | + |
| 123 | +**Diagnosis**: |
| 124 | +```bash |
| 125 | +# Check Kafka connectivity |
| 126 | +docker exec flash-sale-engine-redpanda-1 rpk cluster info |
| 127 | + |
| 128 | +# Check gateway logs |
| 129 | +docker-compose logs gateway | grep -i "circuit" |
| 130 | +``` |
| 131 | + |
| 132 | +**Resolution**: |
| 133 | +1. Check if Kafka/Redpanda is running: `docker-compose ps redpanda` |
| 134 | +2. Restart Kafka if needed: `docker-compose restart redpanda` |
| 135 | +3. Wait 30 seconds for circuit breaker to attempt recovery |
| 136 | +4. Check health endpoint: `curl http://localhost:8080/health` |
| 137 | + |
| 138 | +### Issue: Orders Not Processing |
| 139 | + |
| 140 | +**Symptoms**: |
| 141 | +- Orders accepted but not processed |
| 142 | +- Inventory not decreasing |
| 143 | + |
| 144 | +**Diagnosis**: |
| 145 | +```bash |
| 146 | +# Check processor logs |
| 147 | +docker-compose logs processor |
| 148 | + |
| 149 | +# Check Kafka topic |
| 150 | +docker exec flash-sale-engine-redpanda-1 rpk topic consume orders |
| 151 | + |
| 152 | +# Check processor metrics |
| 153 | +curl http://localhost:9090/metrics | grep processor_orders_processed |
| 154 | +``` |
| 155 | + |
| 156 | +**Resolution**: |
| 157 | +1. Check processor is running: `docker-compose ps processor` |
| 158 | +2. Check Kafka connectivity from processor |
| 159 | +3. Verify Redis connection |
| 160 | +4. Restart processor if needed: `docker-compose restart processor` |
| 161 | + |
| 162 | +### Issue: High DLQ Size |
| 163 | + |
| 164 | +**Symptoms**: |
| 165 | +- `processor_dlq_size` metric increasing |
| 166 | +- Many failed orders |
| 167 | + |
| 168 | +**Diagnosis**: |
| 169 | +```bash |
| 170 | +# Check DLQ messages |
| 171 | +docker exec flash-sale-engine-redpanda-1 rpk topic consume orders-dlq |
| 172 | + |
| 173 | +# Check failure reasons in logs |
| 174 | +docker-compose logs processor | grep -i "dlq" |
| 175 | +``` |
| 176 | + |
| 177 | +**Resolution**: |
| 178 | +1. Identify failure pattern (check DLQ message headers for error reasons) |
| 179 | +2. Common reasons: |
| 180 | + - `Payment Timeout`: Expected (10% simulation), can be ignored |
| 181 | + - `Redis Failure`: Check Redis health |
| 182 | + - `Invalid Order Format`: Check gateway message format |
| 183 | +3. Process DLQ manually or implement retry logic |
| 184 | + |
| 185 | +### Issue: Inventory Mismatch |
| 186 | + |
| 187 | +**Symptoms**: |
| 188 | +- Inventory count doesn't match expected value |
| 189 | +- Negative inventory (shouldn't happen with Lua scripts) |
| 190 | + |
| 191 | +**Diagnosis**: |
| 192 | +```bash |
| 193 | +# Check current inventory |
| 194 | +docker exec flash-sale-engine-redis-1 redis-cli GET inventory:101 |
| 195 | + |
| 196 | +# Check order status keys |
| 197 | +docker exec flash-sale-engine-redis-1 redis-cli KEYS "order_status:*" |
| 198 | +``` |
| 199 | + |
| 200 | +**Resolution**: |
| 201 | +1. Verify Lua scripts are being used (check processor logs) |
| 202 | +2. Check for Redis connection issues during script execution |
| 203 | +3. Manually correct inventory if needed: |
| 204 | + ```bash |
| 205 | + docker exec flash-sale-engine-redis-1 redis-cli SET inventory:101 100 |
| 206 | + ``` |
| 207 | + |
| 208 | +### Issue: Rate Limiting Too Aggressive |
| 209 | + |
| 210 | +**Symptoms**: |
| 211 | +- Many 429 Too Many Requests responses |
| 212 | +- Legitimate users being blocked |
| 213 | + |
| 214 | +**Diagnosis**: |
| 215 | +```bash |
| 216 | +# Check rate limit configuration |
| 217 | +docker-compose exec gateway env | grep RATE_LIMIT |
| 218 | + |
| 219 | +# Check rate limit keys in Redis |
| 220 | +docker exec flash-sale-engine-redis-1 redis-cli KEYS "ratelimit:*" |
| 221 | +``` |
| 222 | + |
| 223 | +**Resolution**: |
| 224 | +1. Adjust rate limit via environment variables: |
| 225 | + ```yaml |
| 226 | + # docker-compose.yml |
| 227 | + environment: |
| 228 | + RATE_LIMIT_MAX_REQUESTS: 120 # Increase from default 60 |
| 229 | + RATE_LIMIT_WINDOW: 1m |
| 230 | + ``` |
| 231 | +2. Restart gateway: `docker-compose restart gateway` |
| 232 | + |
| 233 | +## Configuration |
| 234 | + |
| 235 | +### Environment Variables |
| 236 | + |
| 237 | +**Gateway**: |
| 238 | +- `REDIS_ADDR`: Redis address (default: `redis-service:6379`) |
| 239 | +- `KAFKA_ADDR`: Kafka address (default: `kafka-service:9092`) |
| 240 | +- `LOG_LEVEL`: Log level (default: `info`) |
| 241 | +- `CIRCUIT_BREAKER_FAILURE_THRESHOLD`: Failures before opening (default: `5`) |
| 242 | +- `CIRCUIT_BREAKER_SUCCESS_THRESHOLD`: Successes in half-open (default: `2`) |
| 243 | +- `CIRCUIT_BREAKER_BASE_TIMEOUT`: Base timeout (default: `30s`) |
| 244 | +- `CIRCUIT_BREAKER_MAX_TIMEOUT`: Max timeout (default: `300s`) |
| 245 | +- `RATE_LIMIT_MAX_REQUESTS`: Max requests per window (default: `60`) |
| 246 | +- `RATE_LIMIT_WINDOW`: Rate limit window (default: `1m`) |
| 247 | + |
| 248 | +**Processor**: |
| 249 | +- `REDIS_ADDR`: Redis address (default: `redis-service:6379`) |
| 250 | +- `KAFKA_ADDR`: Kafka address (default: `kafka-service:9092`) |
| 251 | +- `LOG_LEVEL`: Log level (default: `info`) |
| 252 | + |
| 253 | +## Backup and Recovery |
| 254 | + |
| 255 | +### Redis Backup |
| 256 | + |
| 257 | +```bash |
| 258 | +# Create backup |
| 259 | +docker exec flash-sale-engine-redis-1 redis-cli SAVE |
| 260 | +docker cp flash-sale-engine-redis-1:/data/dump.rdb ./backup-$(date +%Y%m%d).rdb |
| 261 | +
|
| 262 | +# Restore backup |
| 263 | +docker cp ./backup-20251129.rdb flash-sale-engine-redis-1:/data/dump.rdb |
| 264 | +docker-compose restart redis |
| 265 | +``` |
| 266 | + |
| 267 | +### Kafka/Redpanda Backup |
| 268 | + |
| 269 | +Redpanda data is stored in volumes. Backup the volume: |
| 270 | +```bash |
| 271 | +docker run --rm -v flash-sale-engine_redpanda-data:/data -v $(pwd):/backup alpine tar czf /backup/redpanda-backup-$(date +%Y%m%d).tar.gz /data |
| 272 | +``` |
| 273 | + |
| 274 | +## Performance Tuning |
| 275 | + |
| 276 | +### Scaling |
| 277 | + |
| 278 | +**Horizontal Scaling**: |
| 279 | +- Gateway: Stateless, can scale horizontally |
| 280 | +- Processor: Use Kafka consumer groups for parallel processing |
| 281 | + |
| 282 | +**Vertical Scaling**: |
| 283 | +- Increase Redis memory for larger inventory |
| 284 | +- Increase Kafka partitions for higher throughput |
| 285 | + |
| 286 | +### Optimization |
| 287 | + |
| 288 | +1. **Redis Connection Pooling**: Already configured in go-redis |
| 289 | +2. **Kafka Batch Size**: Adjust producer batch size for throughput |
| 290 | +3. **Lua Script Caching**: Redis caches Lua scripts automatically |
| 291 | +4. **Circuit Breaker Tuning**: Adjust thresholds based on failure patterns |
| 292 | + |
| 293 | +## Maintenance Windows |
| 294 | + |
| 295 | +### Zero-Downtime Deployment |
| 296 | + |
| 297 | +1. Deploy new version to new pods |
| 298 | +2. Wait for health checks to pass |
| 299 | +3. Gradually shift traffic |
| 300 | +4. Monitor metrics for issues |
| 301 | +5. Rollback if problems detected |
| 302 | + |
| 303 | +### Graceful Shutdown |
| 304 | + |
| 305 | +Services handle SIGTERM gracefully: |
| 306 | +- Gateway: Stops accepting new requests, waits for in-flight (30s timeout) |
| 307 | +- Processor: Stops consuming, processes current message (30s timeout) |
| 308 | + |
| 309 | +## Emergency Procedures |
| 310 | + |
| 311 | +### Complete System Failure |
| 312 | + |
| 313 | +1. **Stop all services**: `docker-compose down` |
| 314 | +2. **Check data integrity**: Verify Redis and Kafka data |
| 315 | +3. **Restore from backup** if needed |
| 316 | +4. **Restart services**: `docker-compose up -d` |
| 317 | +5. **Verify health**: Check all health endpoints |
| 318 | +6. **Monitor metrics**: Watch for anomalies |
| 319 | + |
| 320 | +### Data Corruption |
| 321 | + |
| 322 | +1. **Stop services**: Prevent further corruption |
| 323 | +2. **Restore from backup** |
| 324 | +3. **Verify inventory counts** |
| 325 | +4. **Replay DLQ messages** if needed |
| 326 | +5. **Restart services** |
| 327 | + |
| 328 | +## Contact and Escalation |
| 329 | + |
| 330 | +- **On-Call Engineer**: Check team rotation schedule |
| 331 | +- **Critical Issues**: Escalate immediately |
| 332 | +- **Documentation**: Update this runbook with new procedures |
| 333 | + |
0 commit comments