You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add stale job lock cleanup and production safeguards
This addresses a production incident where stale locks accumulated over
time, blocking job processing. The root cause was twofold:
1. Cleanup only targeted queue locks, not job locks
2. Cleanup only ran on startup, not periodically
Changes:
- Add release_stale_job_locks() to clean orphaned job-level locks
- Add periodic cleanup task (default: every 60 seconds)
- Add separate configurable timeouts for queue vs job locks
- Add minimum timeout enforcement with warnings/errors
- Add comprehensive metrics for cleanup operations
- Add cleanup health gauge for alerting
- Add integration tests proving the SQL actually works
- Document new config options and metrics
New WorkerConfig options:
- stale_lock_cleanup_interval: periodic cleanup interval (default: 60s)
- stale_queue_lock_timeout: queue lock staleness threshold (default: 5 min)
- stale_job_lock_timeout: job lock staleness threshold (default: 30 min)
New metrics:
- backfill_cleanup_queue_locks_released (counter)
- backfill_cleanup_job_locks_released (counter)
- backfill_cleanup_failed_jobs_deleted (counter)
- backfill_cleanup_failures (counter with operation/error_type labels)
- backfill_cleanup_last_success_timestamp (gauge for health alerting)
BREAKING: startup_cleanup() now returns (u64, u64, u64) instead of (u64, u64)
.with_stale_queue_lock_timeout(Duration::from_secs(300)) // 5 min (queue locks)
88
+
.with_stale_job_lock_timeout(Duration::from_secs(1800)); // 30 min (job locks)
89
+
90
+
letworker=WorkerRunner::builder(config).await?
91
+
.define_job::<MyJob>()
92
+
.build().await?;
93
+
```
94
+
95
+
#### Stale Lock Cleanup
96
+
97
+
When workers crash without graceful shutdown, they can leave locks behind that prevent jobs from being processed. Backfill automatically cleans these up:
98
+
99
+
-**Startup cleanup**: Runs when the worker starts
100
+
-**Periodic cleanup**: Runs every 60 seconds by default (configurable)
101
+
102
+
**Configuration options:**
103
+
104
+
| Option | Default | Description |
105
+
|--------|---------|-------------|
106
+
|`stale_lock_cleanup_interval`| 60s | How often to check for stale locks. Set to `None` to disable periodic cleanup. |
107
+
|`stale_queue_lock_timeout`| 5 min | Queue locks older than this are considered stale. Queue locks are normally held for milliseconds. |
108
+
|`stale_job_lock_timeout`| 30 min | Job locks older than this are considered stale. **Set this longer than your longest-running job!**|
109
+
110
+
**⚠️ Warning:** Setting `stale_job_lock_timeout` too short can cause duplicate job execution if jobs legitimately run longer than the timeout. This can lead to data corruption.
111
+
72
112
### SQLx Compile-Time Query Verification
73
113
74
114
This library uses SQLx's compile-time query verification for production safety. Set `DATABASE_URL` during compilation to enable type-safe, compile-time checked SQL queries:
0 commit comments