Skip to content

Commit 500a594

Browse files
committed
feat: Add production enhancement
1 parent a04a5d7 commit 500a594

15 files changed

Lines changed: 1477 additions & 94 deletions

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM golang:1.22-alpine AS builder
1+
FROM golang:1.23-alpine AS builder
22
WORKDIR /app
33
COPY go.mod go.sum ./
44
RUN go mod download

OPERATIONS.md

Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
# Operations Runbook
2+
3+
## Overview
4+
This document provides operational procedures for monitoring, troubleshooting, and maintaining the Flash Sale Engine in production.
5+
6+
## Monitoring
7+
8+
### Prometheus Metrics
9+
10+
Both services expose Prometheus metrics on `/metrics` endpoint:
11+
12+
**Gateway Metrics** (`:8080/metrics`):
13+
- `gateway_orders_received_total` - Total orders received
14+
- `gateway_orders_successful_total` - Orders successfully queued
15+
- `gateway_orders_failed_total` - Orders that failed to queue
16+
- `gateway_orders_validation_failed_total` - Validation failures
17+
- `gateway_orders_idempotency_rejected_total` - Duplicate requests rejected
18+
- `gateway_request_duration_seconds` - Request processing time histogram
19+
- `gateway_circuit_breaker_state` - Circuit breaker state (0=closed, 1=open, 2=half-open)
20+
21+
**Processor Metrics** (`:9090/metrics`):
22+
- `processor_orders_processed_total` - Total orders processed
23+
- `processor_orders_processed_success_total` - Successfully processed
24+
- `processor_orders_processed_failed_total` - Failed processing
25+
- `processor_orders_sold_out_total` - Orders rejected due to sold out
26+
- `processor_orders_moved_to_dlq_total` - Orders moved to DLQ
27+
- `processor_order_processing_duration_seconds` - Processing time histogram
28+
- `processor_dlq_size` - Current DLQ depth
29+
- `processor_dlq_oldest_message_age_seconds` - Age of oldest DLQ message
30+
- `processor_inventory_level{item_id="..."}` - Inventory level per item
31+
32+
### Health Checks
33+
34+
**Gateway Health** (`GET /health`):
35+
```bash
36+
curl http://localhost:8080/health
37+
```
38+
39+
Response:
40+
```json
41+
{
42+
"status": "healthy",
43+
"redis": true,
44+
"kafka": true,
45+
"circuit_breaker_state": "closed"
46+
}
47+
```
48+
49+
- `200 OK`: All services healthy
50+
- `503 Service Unavailable`: One or more services unhealthy
51+
52+
### Logging
53+
54+
All services use structured JSON logging with correlation IDs:
55+
56+
```json
57+
{
58+
"timestamp": "2025-11-29T21:56:00.000Z",
59+
"level": "INFO",
60+
"message": "Order queued successfully",
61+
"correlation_id": "uuid-123",
62+
"service": "gateway",
63+
"event": "order_queued",
64+
"user_id": "u1",
65+
"item_id": "101",
66+
"processing_time_ms": 145
67+
}
68+
```
69+
70+
**Key Fields**:
71+
- `correlation_id`: Trace requests across services
72+
- `service`: Service name (gateway/processor)
73+
- `event`: Event type (order_received, order_queued, order_processed, etc.)
74+
- `processing_time_ms`: Request processing time
75+
76+
## Alerting Thresholds
77+
78+
### Critical Alerts
79+
80+
1. **Circuit Breaker Open**
81+
- Metric: `gateway_circuit_breaker_state == 1`
82+
- Action: Check Kafka connectivity, restart if needed
83+
- Impact: All orders rejected with 503
84+
85+
2. **DLQ Size Exceeds Threshold**
86+
- Metric: `processor_dlq_size > 100`
87+
- Action: Investigate failure reasons, process DLQ manually
88+
- Impact: Orders not being processed
89+
90+
3. **DLQ Age Too High**
91+
- Metric: `processor_dlq_oldest_message_age_seconds > 3600`
92+
- Action: Process oldest messages first
93+
- Impact: Stale orders in DLQ
94+
95+
4. **High Failure Rate**
96+
- Metric: `gateway_orders_failed_total / gateway_orders_received_total > 0.1`
97+
- Action: Check service health, review logs
98+
- Impact: 10%+ of orders failing
99+
100+
5. **Processing Time High**
101+
- Metric: `processor_order_processing_duration_seconds{p99} > 5`
102+
- Action: Check Redis/Kafka latency, scale processor
103+
- Impact: Slow order processing
104+
105+
### Warning Alerts
106+
107+
1. **Rate Limit Approaching**
108+
- Monitor: Rate limit rejections increasing
109+
- Action: Review rate limit configuration
110+
111+
2. **Inventory Low**
112+
- Metric: `processor_inventory_level < 10`
113+
- Action: Restock or prepare for sold out
114+
115+
## Troubleshooting
116+
117+
### Issue: Circuit Breaker Open
118+
119+
**Symptoms**:
120+
- All requests return 503 Service Unavailable
121+
- Health check shows `circuit_breaker_state: "open"`
122+
123+
**Diagnosis**:
124+
```bash
125+
# Check Kafka connectivity
126+
docker exec flash-sale-engine-redpanda-1 rpk cluster info
127+
128+
# Check gateway logs
129+
docker-compose logs gateway | grep -i "circuit"
130+
```
131+
132+
**Resolution**:
133+
1. Check if Kafka/Redpanda is running: `docker-compose ps redpanda`
134+
2. Restart Kafka if needed: `docker-compose restart redpanda`
135+
3. Wait 30 seconds for circuit breaker to attempt recovery
136+
4. Check health endpoint: `curl http://localhost:8080/health`
137+
138+
### Issue: Orders Not Processing
139+
140+
**Symptoms**:
141+
- Orders accepted but not processed
142+
- Inventory not decreasing
143+
144+
**Diagnosis**:
145+
```bash
146+
# Check processor logs
147+
docker-compose logs processor
148+
149+
# Check Kafka topic
150+
docker exec flash-sale-engine-redpanda-1 rpk topic consume orders
151+
152+
# Check processor metrics
153+
curl http://localhost:9090/metrics | grep processor_orders_processed
154+
```
155+
156+
**Resolution**:
157+
1. Check processor is running: `docker-compose ps processor`
158+
2. Check Kafka connectivity from processor
159+
3. Verify Redis connection
160+
4. Restart processor if needed: `docker-compose restart processor`
161+
162+
### Issue: High DLQ Size
163+
164+
**Symptoms**:
165+
- `processor_dlq_size` metric increasing
166+
- Many failed orders
167+
168+
**Diagnosis**:
169+
```bash
170+
# Check DLQ messages
171+
docker exec flash-sale-engine-redpanda-1 rpk topic consume orders-dlq
172+
173+
# Check failure reasons in logs
174+
docker-compose logs processor | grep -i "dlq"
175+
```
176+
177+
**Resolution**:
178+
1. Identify failure pattern (check DLQ message headers for error reasons)
179+
2. Common reasons:
180+
- `Payment Timeout`: Expected (10% simulation), can be ignored
181+
- `Redis Failure`: Check Redis health
182+
- `Invalid Order Format`: Check gateway message format
183+
3. Process DLQ manually or implement retry logic
184+
185+
### Issue: Inventory Mismatch
186+
187+
**Symptoms**:
188+
- Inventory count doesn't match expected value
189+
- Negative inventory (shouldn't happen with Lua scripts)
190+
191+
**Diagnosis**:
192+
```bash
193+
# Check current inventory
194+
docker exec flash-sale-engine-redis-1 redis-cli GET inventory:101
195+
196+
# Check order status keys
197+
docker exec flash-sale-engine-redis-1 redis-cli KEYS "order_status:*"
198+
```
199+
200+
**Resolution**:
201+
1. Verify Lua scripts are being used (check processor logs)
202+
2. Check for Redis connection issues during script execution
203+
3. Manually correct inventory if needed:
204+
```bash
205+
docker exec flash-sale-engine-redis-1 redis-cli SET inventory:101 100
206+
```
207+
208+
### Issue: Rate Limiting Too Aggressive
209+
210+
**Symptoms**:
211+
- Many 429 Too Many Requests responses
212+
- Legitimate users being blocked
213+
214+
**Diagnosis**:
215+
```bash
216+
# Check rate limit configuration
217+
docker-compose exec gateway env | grep RATE_LIMIT
218+
219+
# Check rate limit keys in Redis
220+
docker exec flash-sale-engine-redis-1 redis-cli KEYS "ratelimit:*"
221+
```
222+
223+
**Resolution**:
224+
1. Adjust rate limit via environment variables:
225+
```yaml
226+
# docker-compose.yml
227+
environment:
228+
RATE_LIMIT_MAX_REQUESTS: 120 # Increase from default 60
229+
RATE_LIMIT_WINDOW: 1m
230+
```
231+
2. Restart gateway: `docker-compose restart gateway`
232+
233+
## Configuration
234+
235+
### Environment Variables
236+
237+
**Gateway**:
238+
- `REDIS_ADDR`: Redis address (default: `redis-service:6379`)
239+
- `KAFKA_ADDR`: Kafka address (default: `kafka-service:9092`)
240+
- `LOG_LEVEL`: Log level (default: `info`)
241+
- `CIRCUIT_BREAKER_FAILURE_THRESHOLD`: Failures before opening (default: `5`)
242+
- `CIRCUIT_BREAKER_SUCCESS_THRESHOLD`: Successes in half-open (default: `2`)
243+
- `CIRCUIT_BREAKER_BASE_TIMEOUT`: Base timeout (default: `30s`)
244+
- `CIRCUIT_BREAKER_MAX_TIMEOUT`: Max timeout (default: `300s`)
245+
- `RATE_LIMIT_MAX_REQUESTS`: Max requests per window (default: `60`)
246+
- `RATE_LIMIT_WINDOW`: Rate limit window (default: `1m`)
247+
248+
**Processor**:
249+
- `REDIS_ADDR`: Redis address (default: `redis-service:6379`)
250+
- `KAFKA_ADDR`: Kafka address (default: `kafka-service:9092`)
251+
- `LOG_LEVEL`: Log level (default: `info`)
252+
253+
## Backup and Recovery
254+
255+
### Redis Backup
256+
257+
```bash
258+
# Create backup
259+
docker exec flash-sale-engine-redis-1 redis-cli SAVE
260+
docker cp flash-sale-engine-redis-1:/data/dump.rdb ./backup-$(date +%Y%m%d).rdb
261+
262+
# Restore backup
263+
docker cp ./backup-20251129.rdb flash-sale-engine-redis-1:/data/dump.rdb
264+
docker-compose restart redis
265+
```
266+
267+
### Kafka/Redpanda Backup
268+
269+
Redpanda data is stored in volumes. Backup the volume:
270+
```bash
271+
docker run --rm -v flash-sale-engine_redpanda-data:/data -v $(pwd):/backup alpine tar czf /backup/redpanda-backup-$(date +%Y%m%d).tar.gz /data
272+
```
273+
274+
## Performance Tuning
275+
276+
### Scaling
277+
278+
**Horizontal Scaling**:
279+
- Gateway: Stateless, can scale horizontally
280+
- Processor: Use Kafka consumer groups for parallel processing
281+
282+
**Vertical Scaling**:
283+
- Increase Redis memory for larger inventory
284+
- Increase Kafka partitions for higher throughput
285+
286+
### Optimization
287+
288+
1. **Redis Connection Pooling**: Already configured in go-redis
289+
2. **Kafka Batch Size**: Adjust producer batch size for throughput
290+
3. **Lua Script Caching**: Redis caches Lua scripts automatically
291+
4. **Circuit Breaker Tuning**: Adjust thresholds based on failure patterns
292+
293+
## Maintenance Windows
294+
295+
### Zero-Downtime Deployment
296+
297+
1. Deploy new version to new pods
298+
2. Wait for health checks to pass
299+
3. Gradually shift traffic
300+
4. Monitor metrics for issues
301+
5. Rollback if problems detected
302+
303+
### Graceful Shutdown
304+
305+
Services handle SIGTERM gracefully:
306+
- Gateway: Stops accepting new requests, waits for in-flight (30s timeout)
307+
- Processor: Stops consuming, processes current message (30s timeout)
308+
309+
## Emergency Procedures
310+
311+
### Complete System Failure
312+
313+
1. **Stop all services**: `docker-compose down`
314+
2. **Check data integrity**: Verify Redis and Kafka data
315+
3. **Restore from backup** if needed
316+
4. **Restart services**: `docker-compose up -d`
317+
5. **Verify health**: Check all health endpoints
318+
6. **Monitor metrics**: Watch for anomalies
319+
320+
### Data Corruption
321+
322+
1. **Stop services**: Prevent further corruption
323+
2. **Restore from backup**
324+
3. **Verify inventory counts**
325+
4. **Replay DLQ messages** if needed
326+
5. **Restart services**
327+
328+
## Contact and Escalation
329+
330+
- **On-Call Engineer**: Check team rotation schedule
331+
- **Critical Issues**: Escalate immediately
332+
- **Documentation**: Update this runbook with new procedures
333+

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ A high-concurrency distributed system for handling flash sales with **idempotenc
2929
┌─────────┐ ┌─────────┐
3030
│ Redis │ │ DLQ │
3131
│(Idempot │ │(Failed │
32-
│ ency) │ │ Orders) │
32+
│ ency) │ │ Orders) │
3333
└─────────┘ └─────────┘
3434
```
3535

0 commit comments

Comments
 (0)