ceejbot
diff --git a/‎README.md‎
Lines changed: 40 additions & 0 deletions b/‎README.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎docs/03-metrics.md‎
Lines changed: 50 additions & 0 deletions b/‎docs/03-metrics.md‎
Lines changed: 50 additions & 0 deletions
diff --git a/‎examples/basic_worker.rs‎
Lines changed: 1 addition & 0 deletions b/‎examples/basic_worker.rs‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/client/cleanup.rs‎
Lines changed: 192 additions & 11 deletions b/‎src/client/cleanup.rs‎
Lines changed: 192 additions & 11 deletions
diff --git a/‎src/client/mod.rs‎
Lines changed: 1 addition & 2 deletions b/‎src/client/mod.rs‎
Lines changed: 1 addition & 2 deletions
@@ -25,6 +25,7 @@ Built on top of `graphile_worker` (v0.8.6), backfill adds these production-ready
 - 🏃 **Flexible Worker Patterns** - `WorkerRunner` supporting tokio::select!, background tasks, and one-shot processing
 - 🔧 **Admin API** - Optional Axum router for HTTP-based job management (experimental)
 - 📝 **Convenience Functions** - `enqueue_fast()`, `enqueue_bulk()`, `enqueue_critical()`, etc.
+- 🧹 **Stale Lock Cleanup** - Automatic cleanup of orphaned locks from crashed workers (startup + periodic)
 
 All built on graphile_worker's rock-solid foundation of PostgreSQL SKIP LOCKED and LISTEN/NOTIFY.
 
@@ -69,6 +70,45 @@ All configuration is passed in via environment variables:
 - `POLL_INTERVAL_MS`: Job polling interval (default: 200ms)
 - `RUST_LOG`: Logging configuration
 
+### WorkerConfig Options
+
+When building a `WorkerRunner`, you can configure additional options:
+
+```rust
+use std::time::Duration;
+use backfill::{WorkerConfig, WorkerRunner};
+
+let config = WorkerConfig::new(&database_url)
+    .with_schema("graphile_worker")           // PostgreSQL schema (default)
+    .with_poll_interval(Duration::from_millis(200))  // Job polling interval
+    .with_dlq_processor_interval(Some(Duration::from_secs(60)))  // DLQ processing
+    // Stale lock cleanup configuration
+    .with_stale_lock_cleanup_interval(Some(Duration::from_secs(60)))  // Periodic cleanup
+    .with_stale_queue_lock_timeout(Duration::from_secs(300))   // 5 min (queue locks)
+    .with_stale_job_lock_timeout(Duration::from_secs(1800));   // 30 min (job locks)
+
+let worker = WorkerRunner::builder(config).await?
+    .define_job::<MyJob>()
+    .build().await?;
+```
+
+#### Stale Lock Cleanup
+
+When workers crash without graceful shutdown, they can leave locks behind that prevent jobs from being processed. Backfill automatically cleans these up:
+
+- **Startup cleanup**: Runs when the worker starts
+- **Periodic cleanup**: Runs every 60 seconds by default (configurable)
+
+**Configuration options:**
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `stale_lock_cleanup_interval` | 60s | How often to check for stale locks. Set to `None` to disable periodic cleanup. |
+| `stale_queue_lock_timeout` | 5 min | Queue locks older than this are considered stale. Queue locks are normally held for milliseconds. |
+| `stale_job_lock_timeout` | 30 min | Job locks older than this are considered stale. **Set this longer than your longest-running job!** |
+
+**⚠️ Warning:** Setting `stale_job_lock_timeout` too short can cause duplicate job execution if jobs legitimately run longer than the timeout. This can lead to data corruption.
+
 ### SQLx Compile-Time Query Verification
 
 This library uses SQLx's compile-time query verification for production safety. Set `DATABASE_URL` during compilation to enable type-safe, compile-time checked SQL queries:
 
@@ -350,6 +350,42 @@ Monitor worker pool health and utilization.
   - `result`: Poll result (jobs_found, empty, error)
 - **Use**: Monitor polling efficiency, detect issues
 
+### Cleanup Metrics
+
+Track stale lock cleanup operations. These are critical for detecting when cleanup isn't working properly.
+
+#### `backfill_cleanup_queue_locks_released`
+- **Type**: Counter
+- **Description**: Total number of stale queue locks released
+- **Labels**: None
+- **Use**: Monitor cleanup activity, detect stuck queues
+
+#### `backfill_cleanup_job_locks_released`
+- **Type**: Counter
+- **Description**: Total number of stale job locks released
+- **Labels**: None
+- **Use**: Monitor cleanup activity, detect crashed workers leaving orphaned jobs
+
+#### `backfill_cleanup_failed_jobs_deleted`
+- **Type**: Counter
+- **Description**: Total number of permanently failed jobs cleaned up from main queue
+- **Labels**: None
+- **Use**: Track cleanup of exhausted-retry jobs
+
+#### `backfill_cleanup_failures`
+- **Type**: Counter
+- **Description**: Cleanup operations that failed
+- **Labels**:
+  - `operation`: Which cleanup operation failed (queue_locks, job_locks)
+  - `error_type`: Error classification (timeout, network, etc.)
+- **Use**: Alert on cleanup failures
+
+#### `backfill_cleanup_last_success_timestamp`
+- **Type**: Gauge
+- **Description**: Unix timestamp of last successful cleanup run
+- **Labels**: None
+- **Use**: **Critical health signal** - alert if cleanup hasn't succeeded recently
+
 ### Retry Metrics
 
 Understand retry patterns and effectiveness.
@@ -511,6 +547,20 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
   for: 5m
   severity: critical
   description: Job failure rate >10% for 5+ minutes
+
+# Cleanup not running (CRITICAL - can cause job lock buildup)
+- alert: CleanupNotRunning
+  expr: time() - backfill_cleanup_last_success_timestamp > 300
+  for: 5m
+  severity: critical
+  description: Stale lock cleanup hasn't succeeded in 5+ minutes
+
+# Cleanup releasing locks (indicates crashed workers)
+- alert: StaleLocksReleased
+  expr: increase(backfill_cleanup_job_locks_released[5m]) > 0
+  for: 0m
+  severity: warning
+  description: Cleanup released stale job locks - indicates worker crash
 ```
 
 ### Warning Alerts
 
@@ -118,6 +118,7 @@ impl From<ExampleWorkerConfig> for WorkerConfig {
             ],
             poll_interval: value.poll_interval,
             dlq_processor_interval: Some(value.dlq_processor_interval),
+            ..Default::default()
         }
     }
 }
 
@@ -2,17 +2,43 @@
 //!
 //! Provides functions to clean up stale state that can accumulate when workers
 //! crash or are forcibly terminated without graceful shutdown.
+//!
+//! ## Lock Types
+//!
+//! Graphile Worker uses two types of locks:
+//!
+//! 1. **Queue locks** (`_private_job_queues.locked_at`) - Brief locks held
+//!    during job selection. These are typically held for milliseconds.
+//!
+//! 2. **Job locks** (`_private_jobs.locked_at`) - Locks held while jobs are
+//!    being processed. These can be held for the duration of job execution
+//!    (minutes).
+//!
+//! When a worker crashes, both types of locks can become orphaned. This module
+//! provides functions to clean up both.
 
 use std::time::Duration;
 
+use tokio::task::JoinHandle;
+use tokio_util::sync::CancellationToken;
+
 use super::BackfillClient;
 use crate::BackfillError;
 
 /// Default timeout for considering a queue lock stale.
 ///
 /// Queue locks are held briefly during job selection (milliseconds), so any
 /// lock older than this is almost certainly from a crashed worker.
-pub const DEFAULT_STALE_LOCK_TIMEOUT: Duration = Duration::from_secs(300); // 5 minutes
+pub const DEFAULT_STALE_QUEUE_LOCK_TIMEOUT: Duration = Duration::from_secs(300); // 5 minutes
+
+/// Default timeout for considering a job lock stale.
+///
+/// Job locks are held while jobs execute, which can take minutes. This timeout
+/// should be longer than your longest-running job. Default: 30 minutes.
+pub const DEFAULT_STALE_JOB_LOCK_TIMEOUT: Duration = Duration::from_secs(1800); // 30 minutes
+
+/// Default interval for periodic stale lock cleanup.
+pub const DEFAULT_STALE_LOCK_CLEANUP_INTERVAL: Duration = Duration::from_secs(60); // 1 minute
 
 impl BackfillClient {
     /// Release stale queue locks that were left behind by crashed workers.
@@ -45,6 +71,9 @@ impl BackfillClient {
         let result = sqlx::query(&query).execute(&self.pool).await?;
         let released = result.rows_affected();
 
+        // Always emit metrics (counter increments even for 0)
+        crate::metrics::record_cleanup_queue_locks_released(released);
+
         if released > 0 {
             log::info!(
                 "Released stale queue locks (count: {}, timeout_secs: {})",
@@ -56,6 +85,49 @@ impl BackfillClient {
         Ok(released)
     }
 
+    /// Release stale job locks that were left behind by crashed workers.
+    ///
+    /// When a worker crashes while processing a job, the job remains locked
+    /// in `_private_jobs` and will never be retried. This function releases
+    /// any job locks older than the specified timeout, allowing the jobs to
+    /// be picked up again by other workers.
+    ///
+    /// # Arguments
+    /// * `timeout` - Locks older than this duration are considered stale
+    ///
+    /// # Returns
+    /// Number of job locks that were released
+    pub async fn release_stale_job_locks(&self, timeout: Duration) -> Result<u64, BackfillError> {
+        let timeout_secs = timeout.as_secs();
+
+        let query = format!(
+            r#"
+            UPDATE {schema}._private_jobs
+            SET locked_at = NULL, locked_by = NULL
+            WHERE locked_at IS NOT NULL
+              AND locked_at < NOW() - INTERVAL '{timeout_secs} seconds'
+            "#,
+            schema = self.schema,
+            timeout_secs = timeout_secs
+        );
+
+        let result = sqlx::query(&query).execute(&self.pool).await?;
+        let released = result.rows_affected();
+
+        // Always emit metrics (counter increments even for 0)
+        crate::metrics::record_cleanup_job_locks_released(released);
+
+        if released > 0 {
+            log::info!(
+                "Released stale job locks (count: {}, timeout_secs: {})",
+                released,
+                timeout_secs
+            );
+        }
+
+        Ok(released)
+    }
+
     /// Delete permanently failed jobs from the main queue.
     ///
     /// Jobs that have exhausted all retry attempts (attempts >= max_attempts)
@@ -81,6 +153,9 @@ impl BackfillClient {
         let result = sqlx::query(&query).execute(&self.pool).await?;
         let deleted = result.rows_affected();
 
+        // Emit metric
+        crate::metrics::record_cleanup_failed_jobs_deleted(deleted);
+
         if deleted > 0 {
             log::info!(
                 "Cleaned up permanently failed jobs from main queue (count: {})",
@@ -91,27 +166,133 @@ impl BackfillClient {
         Ok(deleted)
     }
 
-    /// Run all startup cleanup tasks.
+    /// Run all startup cleanup tasks with default timeouts.
     ///
     /// This should be called when a worker starts to clean up any stale state
     /// left behind by previous workers. It performs:
-    /// 1. Release stale queue locks (using default timeout)
-    /// 2. Delete permanently failed jobs from main queue
+    /// 1. Release stale queue locks (5 minute timeout)
+    /// 2. Release stale job locks (30 minute timeout)
+    /// 3. Delete permanently failed jobs from main queue
     ///
     /// # Returns
-    /// Tuple of (stale_locks_released, failed_jobs_deleted)
-    pub async fn startup_cleanup(&self) -> Result<(u64, u64), BackfillError> {
-        log::info!("Running startup cleanup tasks");
+    /// Tuple of (queue_locks_released, job_locks_released, failed_jobs_deleted)
+    pub async fn startup_cleanup(&self) -> Result<(u64, u64, u64), BackfillError> {
+        self.startup_cleanup_with_timeouts(DEFAULT_STALE_QUEUE_LOCK_TIMEOUT, DEFAULT_STALE_JOB_LOCK_TIMEOUT)
+            .await
+    }
 
-        let locks_released = self.release_stale_queue_locks(DEFAULT_STALE_LOCK_TIMEOUT).await?;
+    /// Run all startup cleanup tasks with custom timeouts.
+    ///
+    /// This allows configuring the stale lock thresholds for environments
+    /// where the defaults aren't appropriate.
+    ///
+    /// # Arguments
+    /// * `queue_lock_timeout` - Timeout for queue locks (normally held for ms)
+    /// * `job_lock_timeout` - Timeout for job locks (held during job execution)
+    ///
+    /// # Returns
+    /// Tuple of (queue_locks_released, job_locks_released, failed_jobs_deleted)
+    pub async fn startup_cleanup_with_timeouts(
+        &self,
+        queue_lock_timeout: Duration,
+        job_lock_timeout: Duration,
+    ) -> Result<(u64, u64, u64), BackfillError> {
+        log::info!(
+            "Running startup cleanup (queue_lock_timeout: {}s, job_lock_timeout: {}s)",
+            queue_lock_timeout.as_secs(),
+            job_lock_timeout.as_secs()
+        );
+
+        let queue_locks_released = self.release_stale_queue_locks(queue_lock_timeout).await?;
+        let job_locks_released = self.release_stale_job_locks(job_lock_timeout).await?;
         let jobs_deleted = self.cleanup_permanently_failed_jobs().await?;
 
         log::info!(
-            "Startup cleanup completed (stale_locks_released: {}, failed_jobs_deleted: {})",
-            locks_released,
+            "Startup cleanup completed (queue_locks: {}, job_locks: {}, failed_jobs: {})",
+            queue_locks_released,
+            job_locks_released,
             jobs_deleted
         );
 
-        Ok((locks_released, jobs_deleted))
+        Ok((queue_locks_released, job_locks_released, jobs_deleted))
+    }
+
+    /// Start a background task that periodically cleans up stale locks.
+    ///
+    /// This spawns a task that runs at the specified interval, cleaning up
+    /// both queue-level and job-level locks using separate timeout thresholds.
+    ///
+    /// # Arguments
+    /// * `interval` - How often to check for stale locks
+    /// * `queue_lock_timeout` - Timeout for queue locks (normally held for ms)
+    /// * `job_lock_timeout` - Timeout for job locks (held during job execution)
+    /// * `cancellation_token` - Token to signal when to stop the background
+    ///   task
+    ///
+    /// # Returns
+    /// A JoinHandle for the background task
+    pub fn start_stale_lock_cleanup(
+        &self,
+        interval: Duration,
+        queue_lock_timeout: Duration,
+        job_lock_timeout: Duration,
+        cancellation_token: CancellationToken,
+    ) -> JoinHandle<()> {
+        let client = self.clone();
+
+        tokio::spawn(async move {
+            let mut interval_timer = tokio::time::interval(interval);
+            interval_timer.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
+
+            log::info!(
+                "Starting stale lock cleanup task (interval: {}s, queue_timeout: {}s, job_timeout: {}s)",
+                interval.as_secs(),
+                queue_lock_timeout.as_secs(),
+                job_lock_timeout.as_secs()
+            );
+
+            loop {
+                tokio::select! {
+                    _ = cancellation_token.cancelled() => {
+                        log::info!("Stale lock cleanup task shutting down");
+                        break;
+                    }
+                    _ = interval_timer.tick() => {
+                        let mut all_succeeded = true;
+
+                        // Clean queue locks (short timeout - these are normally held briefly)
+                        match client.release_stale_queue_locks(queue_lock_timeout).await {
+                            Ok(_) => {}
+                            Err(e) => {
+                                log::warn!("Failed to release stale queue locks: {}", e);
+                                crate::metrics::record_cleanup_failure(
+                                    "queue_locks",
+                                    crate::metrics::classify_error_for_metrics(&e),
+                                );
+                                all_succeeded = false;
+                            }
+                        }
+
+                        // Clean job locks (longer timeout - jobs can run for a while)
+                        match client.release_stale_job_locks(job_lock_timeout).await {
+                            Ok(_) => {}
+                            Err(e) => {
+                                log::warn!("Failed to release stale job locks: {}", e);
+                                crate::metrics::record_cleanup_failure(
+                                    "job_locks",
+                                    crate::metrics::classify_error_for_metrics(&e),
+                                );
+                                all_succeeded = false;
+                            }
+                        }
+
+                        // Update health timestamp if both operations succeeded
+                        if all_succeeded {
+                            crate::metrics::update_cleanup_health_timestamp();
+                        }
+                    }
+                }
+            }
+        })
     }
 }
@@ -1,10 +1,9 @@
 //! The backfill client, split across a couple of files.
 
-mod cleanup;
+pub mod cleanup;
 mod dlq;
 mod enqueue;
 
-pub use cleanup::DEFAULT_STALE_LOCK_TIMEOUT;
 pub use dlq::*;
 
 /// High-level client for the backfill job queue system.
Original file line number	Diff line number	Diff line change
`@@ -118,6 +118,7 @@ impl From<ExampleWorkerConfig> for WorkerConfig {`
`118`	`118`	`],`
`119`	`119`	`poll_interval: value.poll_interval,`
`120`	`120`	`dlq_processor_interval: Some(value.dlq_processor_interval),`
	`121`	`+ ..Default::default()`
`121`	`122`	`}`
`122`	`123`	`}`
`123`	`124`	`}`