Eliminate Event_Summaries / bucket-table deadlocks (zmstats, zmaudit, Event::delete, zmDbDo retry) by connortechnology · Pull Request #4810 · ZoneMinder/zoneminder

connortechnology · 2026-05-07T02:38:04Z

Summary

Eliminates the recurring InnoDB deadlock cycle observed under load between zma's event_update_trigger path, zmstats.pl's Event_Summaries maintenance, zmaudit.pl's aggregations, and zmfilter.pl (Event::delete).

The root cause was a multi-table UPDATE Event_Summaries es JOIN Events_Hour ... pattern: MariaDB takes S-locks on the joined bucket rows and holds them to TX commit regardless of isolation level, which inverts the canonical lock order

Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId]

against the trigger writers that hold X-locks on those same bucket rows. The InnoDB deadlock log captured the exact zma vs zmstats cycle.

What changes

zmDbDo auto-retries on errno 1213 (5 attempts, exponential backoff with jitter) when called outside an explicit transaction. Most call sites are autocommit-style and already idempotent on retry.
Event::delete sets READ COMMITTED for the per-event delete TX, runs Stats/Event_Data/Frames/Events deletes, and retries the whole TX on 1213. Also fixes a latent commit() -> rollback() bug on the error path.
zmstats.pl Event_Summaries maintenance is rewritten to:
1. Prune aged rows from Events_Hour/Day/Week/Month (RC, no gap locks).
2. Snapshot per-monitor counts/disk-space via plain SELECT (consistent read at RC, no locks).
3. Apply one UPDATE Event_Summaries per MonitorId using the snapshotted values.
This keeps the lock prefix (buckets -> Event_Summaries) consistent with the trigger path and never holds bucket S-locks while writing ES. Wrapped in 5-retry deadlock loop.
zmaudit.pl receives the same snapshot-then-per-row treatment for its Event_Summaries and Storage DiskSpace recomputations, replacing five correlated/JOINed UPDATEs that had the same lock-ordering hazard.
db/triggers.sql picks up a comment block above event_update_trigger documenting the canonical lock acquisition order so future writers don't reintroduce the cycle.

Why isolation level alone wasn't enough

SET TRANSACTION ISOLATION LEVEL READ COMMITTED drops the next-key/gap locks on plain DELETE/UPDATE range scans, but a multi-table UPDATE always takes S-locks on the JOINed rows for as long as the transaction lives. The structural fix had to be removing the JOIN, not lowering isolation.

Notes for reviewers

No schema changes. Pure script + Perl module + retry-loop changes plus one comment block in triggers.sql.
The zmDbDo retry is conditional on AutoCommit — it will not silently re-run statements that are part of an explicit caller transaction (those callers are responsible for their own retry, e.g. the Event::delete loop).
Event::delete's isolation set + retry path is bypassed when the caller is already inside a transaction ($in_transaction short-circuit), so callers that wrap multiple deletes in their own TX keep their semantics.

Test plan

Build clean (cmake --build build) — Perl scripts go through cmake configure_file.
perl -c on the touched modules with a real install (placeholders substituted).
Run zmstats.pl, zmaudit.pl against a busy install and confirm no Deadlock found errors in /var/log/zm/.
Force contention by deleting a hot event via filter while zma is updating the same monitor; confirm Event::delete retries succeed and the row is removed.
Verify Event_Summaries totals match recomputed SUM(DiskSpace)/COUNT(*) from the bucket tables after a full zmstats cycle.
Confirm zmaudit.pl --interactive Storage DiskSpace UPDATE still produces correct totals.

🤖 Generated with Claude Code

Deadlock detection (errno 1213) is part of normal InnoDB operation under contention; the engine rolls back the loser and expects the caller to re-run the statement on a fresh transaction. Most callers into zmDbDo go through autocommit, where there's no caller-managed transaction state for a retry to disturb. When AutoCommit is on, retry the statement up to 5 times with exponential backoff (~100ms -> ~1.6s, jittered). When AutoCommit is off, the caller owns the transaction and a unilateral retry would silently succeed against a TX that no longer reflects the work the caller staged before this statement; preserve the existing behavior of logging and returning undef so the caller can rebuild the TX itself. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Three issues in the existing Stats/Event_Data/Frames/Events delete sequence: - On any zmDbDo error inside the transaction the code called dbh->commit() instead of dbh->rollback(). The server-side transaction was already rolled back when InnoDB picked us as the deadlock victim, so the commit() was effectively against a fresh auto-started TX, but the bug pattern leaked through as confusing state and prevented any retry. - There was no retry. errno 1213 is expected under contention with zmstats and zma touching the same Event_Summaries[MonitorId] row, and the loser is supposed to re-run. - At REPEATABLE READ, two concurrent filter workers deleting events with adjacent EventIds take next-key/gap locks on each other's rows in the bucket tables. Rewrite the delete block as a retry loop: SET TRANSACTION ISOLATION LEVEL READ COMMITTED, begin_work, run the four DELETEs, commit on success. On any error rollback (was: commit). On errno 1213 retry up to 5 times with backoff. Skip both the isolation switch and the rollback-then-retry when the caller is managing their own transaction (in_transaction); they would be the wrong scope to act in. Falls through to the storage DiskSpace adjustment only on commit, so a deadlocked delete leaves the event for the next filter pass instead of orphaning the row with stale storage accounting. Note: do NOT pre-lock Event_Summaries[MonitorId] FOR UPDATE here, even though the trigger touches it last. Pre-locking puts ES before buckets[Id] in the lock acquisition order, which inverts against zma's event_update_trigger path (Events[A] -> buckets[A] -> ES[N]) and re-introduces the cycle the rest of this work is removing. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…s resync The previous resync code in zmstats and zmaudit used multi-table UPDATEs against Event_Summaries that joined the bucket tables: UPDATE Event_Summaries es LEFT JOIN (SELECT ... FROM Events_Hour ...) h ON ... LEFT JOIN (SELECT ... FROM Events_Day ...) d ON ... ... SET es.HourEvents = h.c, ... zmaudit additionally used scalar correlated subqueries against Events for the Total/Archived columns and against Events for Storage.DiskSpace. MariaDB takes S-locks on the joined and sub-queried rows for the duration of any multi-table UPDATE statement, regardless of isolation level. event_update_trigger and event_delete_trigger hold X-locks on those same bucket rows while they walk the trigger body, so the resync deadlocks against active event lifecycle traffic. Captured a textbook example in SHOW ENGINE INNODB STATUS: TX(1) zma: HOLDS X Events_Hour[42229643] WAITS X Event_Summaries[28] TX(2) zmstats: HOLDS X Event_Summaries[2,4,5,...,28,...,73] WAITS S Events_Hour[42229643] Replace the JOIN/subquery pattern in both scripts with a snapshot phase followed by per-monitor UPDATEs: 1. SELECT MonitorId, COUNT(*), SUM(DiskSpace) FROM each bucket (and the equivalent Total/Archived aggregate from Events). Plain SELECTs do consistent reads and take no row locks. 2. SELECT MonitorId FROM Event_Summaries to widen the universe so monitors with empty buckets still get zeroed out. 3. For each monitor, UPDATE Event_Summaries SET ... WHERE MonitorId=?. Each UPDATE only X-locks one ES row and reads no other table. zmaudit's five separate UPDATEs collapse to one snapshot phase plus one UPDATE per monitor. Storage DiskSpace gets the same treatment. zmstats keeps the same outer transaction (BEGIN ... COMMIT, RC isolation, retry on 1213) so the bucket DELETEs and the resync stay atomic, but the resync no longer reads the bucket tables under lock. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Records the InnoDB X-lock order that all writers (zma trigger path, zmstats prune+resync, zmfilter/Event::delete) must follow to avoid the deadlock cycles fixed in the surrounding commits. Comment only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Copilot

Pull request overview

Reduces recurring MariaDB/InnoDB deadlocks between event-trigger writers (zma), maintenance scripts (zmstats.pl, zmaudit.pl), and event deletion by removing join-based multi-table UPDATE patterns and adding targeted deadlock retries.

Changes:

Add deadlock (errno 1213) retry with exponential backoff to ZoneMinder::Database::zmDbDo when running in autocommit mode.
Rework Event::delete to run under READ COMMITTED (when it owns the transaction) and retry the whole delete transaction on deadlock.
Rewrite zmstats.pl and zmaudit.pl Event_Summaries (and Storage DiskSpace in zmaudit.pl) maintenance to “snapshot via SELECT, then per-row UPDATE” to avoid lock-order inversions; add deadlock retry loop in zmstats.pl.
Add documentation in db/triggers.sql describing intended lock acquisition order.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
scripts/ZoneMinder/lib/ZoneMinder/Event.pm	Adds RC isolation + deadlock retry loop around per-event delete transaction and updates trigger/locking rationale comments.
scripts/ZoneMinder/lib/ZoneMinder/Database.pm	Adds autocommit-only deadlock retry/backoff behavior to `zmDbDo`.
scripts/zmstats.pl.in	Rewrites bucket prune + Event_Summaries resync to snapshot aggregates then update per MonitorId inside an RC transaction with retries.
scripts/zmaudit.pl.in	Rewrites Event_Summaries and Storage DiskSpace recomputation to snapshot aggregates then update per row.
db/triggers.sql	Adds lock-order comment block near `event_update_trigger` definition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # event_delete_trigger fires BEFORE DELETE on Events; after the
+    # accompanying triggers.sql change, event_update_trigger now also fires
+    # BEFORE UPDATE. That gives every Events writer the same lock acquisition
+    # order: buckets[Id] -> Event_Summaries[MonitorId] -> Events[Id] (the
+    # outer DML row last). zmstats.pl runs at the same RC isolation and takes
+    # bucket-row X-locks first, then UPDATE Event_Summaries, which is the
+    # same prefix order — so no cycle is possible across filter / zma /
+    # zmstats. Do NOT pre-lock Event_Summaries here: that puts ES before
+    # buckets and re-introduces the inversion against zma's UPDATE path.
+    #


 drop trigger if exists event_update_trigger//

-CREATE TRIGGER event_update_trigger AFTER UPDATE ON Events 
+CREATE TRIGGER event_update_trigger AFTER UPDATE ON Events


+    my $events_rows = $dbh->selectall_arrayref(q{
+      SELECT MonitorId,
+             COUNT(Id),
+             COALESCE(SUM(CASE WHEN DiskSpace IS NOT NULL THEN DiskSpace ELSE 0 END), 0),
+             COUNT(CASE WHEN Archived = 1 THEN 1 END),
+             COALESCE(SUM(CASE WHEN Archived = 1 AND DiskSpace IS NOT NULL THEN DiskSpace ELSE 0 END), 0)
+      FROM Events
+      GROUP BY MonitorId
+    });
+    if ($dbh->err()) {
+      Error("zmaudit Events aggregate failed: ".$dbh->errstr());
+    } else {
+      for my $r (@$events_rows) {
+        $agg{$r->[0]}{total_c}    = $r->[1];
+        $agg{$r->[0]}{total_s}    = $r->[2];
+        $agg{$r->[0]}{archived_c} = $r->[3];
+        $agg{$r->[0]}{archived_s} = $r->[4];
+      }


The previous comment block claimed event_update_trigger fires BEFORE UPDATE and that the lock order was buckets -> Event_Summaries -> Events. Neither matches the code: triggers.sql defines event_update_trigger as AFTER UPDATE, and InnoDB X-locks the matched Events row during WHERE evaluation before either BEFORE or AFTER trigger bodies fire — so the canonical chain is Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId] which is what triggers.sql already documents. Comment-only change. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…in zmaudit Previously, if any of the five aggregate SELECTs (Events, Events_Hour/Day/ Week/Month) failed transiently, the per-monitor UPDATE phase still ran and wrote `// 0` for every column from those failed groups, destroying valid counters across all monitors. Track per-aggregate success and build the UPDATE SET clause from only the column groups whose SELECT succeeded. zmaudit re-runs on its normal interval, so a missed group is corrected on the next pass instead of being overwritten with zeros now. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

connortechnology · 2026-05-07T13:40:10Z

Pushed two follow-up commits addressing the review feedback:

19c5bcbbd docs: correct Event::delete lock-order comment — fixes both review comments about the inconsistent comment vs. trigger definition. The block in Event.pm previously claimed event_update_trigger was changed to BEFORE UPDATE; that change was considered then reverted (lock order doesn't depend on trigger timing — InnoDB X-locks the matched row during WHERE evaluation before either trigger fires), but the comment never got rewritten. Now matches the chain documented in triggers.sql: Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId].
943114da9 fix: don't zero Event_Summaries counters on aggregate SELECT failure in zmaudit — addresses the third review comment. Now tracks per-aggregate success and only updates the column groups whose SELECT succeeded; if every aggregate fails the resync is skipped entirely with an error logged, and the next zmaudit pass corrects it.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

+        my ($table, $interval) = @$bucket;
+        my $event_ids = $dbh->selectcol_arrayref(
+          "SELECT EventId FROM $table WHERE StartDateTime < DATE_SUB(NOW(), INTERVAL $interval)"
+        );


selectcol_arrayref returns undef on a DB error. The previous code only checked truthiness of the result before deciding to DELETE, so a transient SELECT failure would silently skip the prune for that bucket and let the transaction continue to commit an incomplete state. Capture \$dbh->err() after the SELECT and bail out the same way the DELETE error path does. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

connortechnology · 2026-05-07T13:57:06Z

Pushed 0dcd500b7 addressing the new review comment: the bucket-prune SELECT now checks $dbh->err() and bails the same way the DELETE error path does. (Same hazard wasn't present in the snapshot/aggregate sections — those already check explicitly.)

Inline comments at every occurrence so future readers don't have to look up the errno. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (5)

scripts/zmaudit.pl.in:1025

The loop ignores whether a per-monitor UPDATE ultimately failed after all zmDbDo retries. If one row fails, the script continues updating later rows and prints the success message, leaving Event_Summaries only partially resynced with no clear failure at the resync level.

        ZoneMinder::Database::zmDbDo(
          $sql,
          (map { $a->{$_} // 0 } @bind_template),
          $mid
        );

scripts/zmaudit.pl.in:1050

This Storage overwrite uses a previously snapshotted sum, so a concurrent event writer can update Storage.DiskSpace after the SELECT and then be overwritten here with the stale value. That can lose the live storage increment/decrement until a later audit pass.

          ZoneMinder::Database::zmDbDo(
            'UPDATE Storage SET DiskSpace=? WHERE Id=?',
            $disk{$sid} // 0, $sid

scripts/zmaudit.pl.in:1055

This success message is emitted even when the aggregate SELECT or Storage enumeration failed, because it is outside the success branch. In those cases no Storage rows were updated, so the message can mislead interactive users and logs into thinking the resync completed.

    aud_print('Finished updating Storage DiskSpace');

scripts/zmaudit.pl.in:1050

The return value from the per-storage UPDATE is ignored. If one Storage row still fails after zmDbDo's retries, the script silently continues with later rows and then reports completion, leaving DiskSpace only partially resynced.

          ZoneMinder::Database::zmDbDo(
            'UPDATE Storage SET DiskSpace=? WHERE Id=?',
            $disk{$sid} // 0, $sid

scripts/zmaudit.pl.in:1027

This reports a full Event_Summaries resync whenever at least one aggregate group succeeded, even if other aggregate SELECTs failed and their column groups were intentionally skipped. The message should distinguish a partial resync so operators do not assume all counters were refreshed.

      aud_print('Finished resyncing Event_Summaries from Events + bucket tables');

+      Debug("Deadlock on '"._sql_with_bind_values($sql, @params)."' attempt $attempt/$max_attempts, retrying");
+      select(undef, undef, undef, 0.05 * (1 << $attempt) + rand(0.05));
+      next;
+    }


+          my $rows = $dbh->selectall_arrayref(
+            "SELECT MonitorId, COUNT(*), COALESCE(SUM(DiskSpace), 0) FROM $table GROUP BY MonitorId"


+    my $existing = $dbh->selectcol_arrayref('SELECT MonitorId FROM Event_Summaries');
+    if ($dbh->err()) {
+      Error("zmaudit Event_Summaries enumerate failed: ".$dbh->errstr());
+    } else {
+      $agg{$_} ||= {} for @$existing;
+    }
+
+    # Build the SET clause from only the column groups we successfully read.
+    my @set;
+    my @bind_template;
+    if ($ok{events}) {
+      push @set, 'TotalEvents=?, TotalEventDiskSpace=?, ArchivedEvents=?, ArchivedEventDiskSpace=?';
+      push @bind_template, qw(total_c total_s archived_c archived_s);
+    }
+    for my $bucket (['h','Hour'], ['d','Day'], ['w','Week'], ['m','Month']) {
+      my ($key, $col) = @$bucket;
+      next unless $ok{$key};
+      push @set, "${col}Events=?, ${col}EventDiskSpace=?";
+      push @bind_template, $key.'_c', $key.'_s';
+    }
+
+    if (@set) {
+      my $sql = 'UPDATE Event_Summaries SET '.join(', ', @set).' WHERE MonitorId=?';
+      for my $mid (sort { $a <=> $b } keys %agg) {
+        my $a = $agg{$mid};
+        ZoneMinder::Database::zmDbDo(
+          $sql,
+          (map { $a->{$_} // 0 } @bind_template),
+          $mid
+        );
+      }
+      aud_print('Finished resyncing Event_Summaries from Events + bucket tables');
+    } else {
+      Error('zmaudit: every Event_Summaries aggregate SELECT failed; skipping resync');
+    }


+ * scripts that touch the same tables). InnoDB X-locks the matched Events row
+ * during WHERE evaluation, before either BEFORE or AFTER trigger bodies fire,
+ * so the order is the same regardless of trigger timing:
+ *   Events[Id] -> Events_Hour/Day/Week/Month[EventId] -> Event_Summaries[MonitorId]
+ * zmstats.pl prune+resync follows the matching prefix (bucket DELETEs then
+ * UPDATE Event_Summaries) and crucially does NOT pre-lock Event_Summaries —
+ * pre-locking ES would invert against the trigger body order and reintroduce
+ * deadlocks against filter / zma writers. The bucket update/delete triggers
+ * also propagate into Event_Summaries[MonitorId] in the same direction.


When zmDbDo is called inside a caller-managed transaction (AutoCommit off), max_attempts is 1 and the loop falls through to Error on a 1213 deadlock — which is misleading, because the caller (Event::delete, the zmstats prune+resync TX) has its own outer retry loop that will roll back and succeed. Downgrade to a Debug message in that path; Error is still emitted for non-deadlock failures and for autocommit calls that exhaust their retries.

A concurrent trigger writer can adjust Event_Summaries between our snapshot SELECTs and the per-monitor UPDATE; the UPDATE then overwrites that adjustment with the older snapshot. Drift is bounded by the zmstats/zmaudit interval and corrected on the next pass, because the incremental triggers continue to maintain ES correctly between resyncs. Locking ES before reading aggregates would invert the canonical lock order and re-introduce the deadlock cycle the resync rewrite eliminated.

Adds zm_event.cpp's create path (Events INSERT -> bucket INSERTs -> Event_Summaries UPDATE; event_insert_trigger itself is commented out so zmc does this directly), and lists zmaudit.pl alongside zmstats.pl, so the comment enumerates every writer obligated to follow the canonical order rather than just the trigger paths.

connortechnology · 2026-05-17T02:45:58Z

Addressed the four new review comments:

628bcb3e8 fix: don't log Error on deadlock inside a caller-managed transaction — zmDbDo no longer emits the misleading Error('Failed ... Deadlock found') when called from Event::delete / zmstats prune+resync TXs; those have their own outer retry loops. Now a Debug line in that path; Error still fires for non-1213 failures and for autocommit calls that exhaust retries.
7b4617b1a docs: document atomicity tradeoff in zmstats/zmaudit ES resync — agreed the snapshot-then-update split is non-atomic; that's intentional. The trigger writers maintain ES incrementally between resyncs, drift is bounded by the resync interval and corrected on the next pass, and locking ES before reading aggregates would invert the canonical order and re-introduce the deadlock we just removed. Comment block on both scripts now spells this out.
e254ed41d docs: expand triggers.sql lock-order block with full writer list — added src/zm_event.cpp's create path (event_insert_trigger is commented out below, so zmc inserts Events then bucket rows then Event_Summaries directly), the bucket triggers cascading into ES, and zmaudit.pl alongside zmstats.pl. The doc now enumerates every writer obligated to follow the canonical order, not just the Events-level triggers.

PR is now at 11 commits.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

scripts/zmaudit.pl.in:1064

This success message is emitted even when the aggregate SELECT or Storage enumeration fails and no Storage rows were updated. Move it into the successful update path (or make it conditional) so operators do not see “Finished updating Storage DiskSpace” immediately after an error.

    aud_print('Finished updating Storage DiskSpace');

+      if ($in_transaction or $err != 1213 or $attempt >= $max_attempts) { # 1213 = ER_LOCK_DEADLOCK
+        return;


+ *   - src/zm_event.cpp Event::createNotification path: INSERT Events, then
+ *     INSERT Events_Hour/Day/Week/Month, then INSERT/UPDATE Event_Summaries
+ *     (event_insert_trigger is commented out below; zmc does it directly)


+      # One UPDATE per monitor: each takes a single ES X-lock and reads
+      # nothing else, so it can't hold any bucket-row lock that would
+      # deadlock with the trigger path.


+          ZoneMinder::Database::zmDbDo(
+            'UPDATE Storage SET DiskSpace=? WHERE Id=?',
+            $disk{$sid} // 0, $sid
+          );


+      for my $mid (sort { $a <=> $b } keys %agg) {
+        my $a = $agg{$mid};
+        ZoneMinder::Database::zmDbDo(
+          $sql,
+          (map { $a->{$_} // 0 } @bind_template),
+          $mid
+        );
+      }
+      aud_print('Finished resyncing Event_Summaries from Events + bucket tables');


zmDbDo suppresses its Error log on 1213 inside a caller-managed TX (the caller owns the retry), and the previous fallthrough at the end of the retry loop just `return`ed silently. After 5 failed attempts on persistent contention the event was effectively un-deleted with no record of the failure. Capture errstr before rollback (some drivers clear it) and emit an Error on the bail path.

Previous wording named a fictional Event::createNotification function; the bucket+ES insert sequence in src/zm_event.cpp lives in the Event constructor (line 46). Point future readers at the right symbol so they can grep and verify the path.

The previous comment claimed each UPDATE couldn't hold any bucket lock that would deadlock with the trigger path, which conflated statement- level locks with TX-level locks. By the time we reach this loop the TX already holds bucket-row X-locks from the earlier DELETEs plus any ES X-locks acquired by the bucket DELETE triggers cascading. Rewrite the comment to distinguish those TX-held locks from the locks acquired by the new UPDATE statement and to be explicit that the TX's lock acquisition direction is preserved.

Previously zmaudit logged "Finished resyncing Event_Summaries" / "Finished updating Storage DiskSpace" unconditionally as long as the aggregate SELECTs succeeded, masking per-row UPDATE failures (e.g. zmDbDo exhausting its deadlock retries) and skipped aggregate column groups. Track which aggregates were skipped and which per-monitor / per-storage UPDATEs failed (zmDbDo returns undef on failure), and surface that in the audit log instead of claiming the resync is complete.

connortechnology · 2026-05-17T03:15:02Z

Five new comments addressed in four follow-up commits:

`d01300832` fix: log Error in Event::delete when deadlock retries are exhausted — addresses the "silent return after 5 failed attempts" hazard. zmDbDo intentionally suppresses Error on 1213 inside a caller-managed TX, so Event::delete now logs its own Error (with errstr captured before rollback) on the exhausted-retries path.
`1323e9f02` docs: reference the actual Event::Event constructor in lock-order block — fixed the bogus `Event::createNotification` symbol; the bucket+ES insert sequence lives in `Event::Event` at `src/zm_event.cpp:46`.
`52d5f80e0` docs: clarify per-monitor UPDATE lock state in zmstats resync — agreed the previous comment conflated statement-level locks with TX-level locks. Rewrote it to distinguish the TX's already-held bucket and ES X-locks (acquired in canonical order by the earlier DELETEs and their cascading triggers) from the new statement-level UPDATE.
`ccc0bcf82` fix: track per-row UPDATE failures in zmaudit ES and Storage resync — both `aud_print('Finished ...')` messages now reflect actual completion: skipped aggregate column groups and per-row `zmDbDo` failures are counted and reported instead of being masked. An Error is logged if Storage resync didn't complete.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

+      $dbh->rollback();
+      if ($err != 1213 or $attempt >= $max_attempts) { # 1213 = ER_LOCK_DEADLOCK
+        Error("Event_Summaries prune+resync gave up after $attempt attempt(s): ".$dbh->errstr());


+      if (!$err) {
+        for my $mid (sort { $a <=> $b } keys %agg) {
+          my $a = $agg{$mid};
+          zmDbDo(
+            'UPDATE Event_Summaries SET '.


+          my $rows = $dbh->selectall_arrayref(
+            "SELECT MonitorId, COUNT(*), COALESCE(SUM(DiskSpace), 0) FROM $table GROUP BY MonitorId"
+          );


+    my $existing = $dbh->selectcol_arrayref('SELECT MonitorId FROM Event_Summaries');
+    if ($dbh->err()) {
+      Error("zmaudit Event_Summaries enumerate failed: ".$dbh->errstr());
+    } else {


+      # Caller-managed TX owns the retry loop; don't pollute logs with a
+      # misleading "Failed ... Deadlock found" Error before they roll back.
+      Debug('Deadlock in caller-managed TX on '._sql_with_bind_values($sql, @params).'; deferring to caller');


…ansaction ZoneMinder::Logger->logPrint runs INSERT INTO Logs on the same dbh. Calling Debug()/Error() from zmDbDo's failure path inside a caller-managed transaction would execute another statement on the connection, clearing the err/errstr state the caller needs to see for rollback/retry. The result could be a caller observing err=0 after a deadlock-victim TX and committing what looks like success but is actually a rolled-back no-op. Bail silently from zmDbDo when AutoCommit is off; the caller owns the retry loop and is responsible for logging. Logging in the autocommit path is still safe because each statement is its own TX.

Two issues with the previous implementation: 1. Aggregate SELECTs ran GROUP BY MonitorId across the full bucket tables every zmstats cycle (default 60s). Events_Month grows for weeks; this turned the stats daemon into a constant full-scan workload on busy installs. 2. The per-monitor UPDATE loop X-locked every Event_Summaries row on every cycle even when nothing changed, adding avoidable contention with the trigger writers this rewrite is supposed to protect. Capture MonitorIds as we SELECT bucket rows for pruning, then skip the resync entirely if no rows were pruned. When rows were pruned, restrict the aggregate SELECTs (WHERE MonitorId IN ...) and the per-monitor UPDATEs to that touched set. zmaudit remains the periodic deep-resync safety net for drift in untouched monitors. Also capture errstr before rollback so the gave-up Error reports the actual reason instead of an empty string on drivers that clear errstr on rollback.

@Skipped

…udit Without this, an enumerate failure (SELECT MonitorId FROM Event_Summaries) left @Skipped empty and the per-monitor UPDATE loop ran for whatever rows the bucket aggregates returned — but monitors that exist only in Event_Summaries (no current bucket rows) never got zeroed, while the audit log claimed a full resync. Track enumerate success and report partial resync when it fails.

connortechnology · 2026-05-17T13:12:26Z

Five new comments addressed in three follow-up commits (the zmstats commit folds K + L together because the touched-monitors restriction fixes both at once):

`411b6916b` fix: avoid DB-backed logging inside zmDbDo when caller manages the transaction — the most serious of the batch. `ZoneMinder::Logger->logPrint` does `INSERT INTO Logs` on the same `$dbh`, so calling `Debug()`/`Error()` from zmDbDo's failure path inside a caller-managed TX would execute another statement, clearing the err/errstr the caller needs for rollback/retry. A caller could then read err=0 after a deadlock-victim TX and `commit()` what looks like success but is actually a rolled-back no-op. Now zmDbDo bails silently when AutoCommit is off; the caller's own retry loop handles logging.
`8fb4ad79e` perf: restrict zmstats ES resync to monitors with pruned buckets — addresses both the "full Events_Month scan every 60s" workload concern and the "X-lock every ES row every pass" contention concern. Capture MonitorIds during the prune SELECT, skip the resync entirely if nothing was pruned, and restrict the aggregate SELECTs (`WHERE MonitorId IN (...)`) and per-monitor UPDATEs to the touched set. zmaudit remains the deep-resync safety net for drift in untouched monitors. Also captures errstr before rollback so the gave-up Error reports the actual reason on drivers that clear errstr on rollback.
`d098a055b` fix: track Event_Summaries enumerate failure as partial resync in zmaudit — without it, the enumerate failure left `@skipped` empty and the audit log still claimed a full resync while monitors with no current bucket rows never got zeroed. Now tracked alongside the per-aggregate flags and surfaced in the partial-resync message.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

+        ZoneMinder::Database::zmDbDo('SET TRANSACTION ISOLATION LEVEL READ COMMITTED');
+        $ZoneMinder::Database::dbh->begin_work();


+      # SET TRANSACTION ... applies only to the next transaction, so it must
+      # be issued before begin_work and re-issued on each retry.
+      zmDbDo('SET TRANSACTION ISOLATION LEVEL READ COMMITTED');
+      $dbh->begin_work();
+
+      my $err = 0;
+      my $errstr;  # captured before rollback() — rollback can clear errstr


+    Error('Failed '._sql_with_bind_values($sql, @params).' : '.$dbh->errstr());
+    last;
+  }
+  if ( defined $rows and ZoneMinder::Logger::logLevel() > INFO ) {


+  # Storage DiskSpace resync: same trap. Snapshot per-Storage sums via plain
+  # SELECT, then UPDATE Storage one row at a time.
+  {
+    my $storage_done = 0;
+    my $rows = $dbh->selectall_arrayref(
+      'SELECT StorageId, COALESCE(SUM(DiskSpace), 0) FROM Events GROUP BY StorageId'
+    );
+    if ($dbh->err()) {
+      Error('zmaudit Storage aggregate failed: '.$dbh->errstr());
+    } else {
+      my %disk = map { $_->[0] => $_->[1] } grep { defined $_->[0] } @$rows;
+      my $storage_ids = $dbh->selectcol_arrayref('SELECT Id FROM Storage');
+      if ($dbh->err()) {
+        Error('zmaudit Storage enumerate failed: '.$dbh->errstr());
+      } else {
+        my $attempted = 0;
+        my $failed    = 0;
+        for my $sid (@$storage_ids) {
+          $attempted++;
+          my $rv = ZoneMinder::Database::zmDbDo(
+            'UPDATE Storage SET DiskSpace=? WHERE Id=?',
+            $disk{$sid} // 0, $sid
+          );
+          $failed++ if !defined $rv;
+        }
+        if (!$failed) {
+          aud_print('Finished updating Storage DiskSpace');
+          $storage_done = 1;
+        } else {
+          aud_print("Partial Storage DiskSpace update: $failed/$attempted row(s) failed");
+        }
+      }


@Bind

Every row in the previous arrayref-of-arrayref carried the same single bind value (the event Id), so the [$sql, $$event{Id}] wrapping and the my ($sql, @Bind) = @$stmt unpacking were doing no work. Iterate over the SQL strings directly and pass $$event{Id} as the one bind value.

connortechnology and others added 4 commits May 6, 2026 15:13

Copilot AI review requested due to automatic review settings May 7, 2026 02:38

Copilot started reviewing on behalf of connortechnology May 7, 2026 02:38 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

connortechnology and others added 2 commits May 7, 2026 09:39

connortechnology requested a review from Copilot May 7, 2026 13:40

Copilot started reviewing on behalf of connortechnology May 7, 2026 13:41 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread scripts/zmstats.pl.in

my ($table, $interval) = @$bucket;

my $event_ids = $dbh->selectcol_arrayref(

"SELECT EventId FROM $table WHERE StartDateTime < DATE_SUB(NOW(), INTERVAL $interval)"

);

docs: identify the 1213 magic number as MariaDB ER_LOCK_DEADLOCK

f5e8eae

Inline comments at every occurrence so future readers don't have to look up the errno. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

connortechnology requested a review from Copilot May 16, 2026 23:38

Copilot started reviewing on behalf of connortechnology May 16, 2026 23:39 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

connortechnology added 3 commits May 16, 2026 22:44

connortechnology requested a review from Copilot May 17, 2026 02:46

Copilot started reviewing on behalf of connortechnology May 17, 2026 02:47 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

connortechnology added 4 commits May 16, 2026 23:14

connortechnology requested a review from Copilot May 17, 2026 03:16

Copilot started reviewing on behalf of connortechnology May 17, 2026 03:17 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

connortechnology added 3 commits May 17, 2026 08:50

connortechnology requested a review from Copilot May 17, 2026 13:21

Copilot started reviewing on behalf of connortechnology May 17, 2026 13:22 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eliminate Event_Summaries / bucket-table deadlocks (zmstats, zmaudit, Event::delete, zmDbDo retry)#4810

Eliminate Event_Summaries / bucket-table deadlocks (zmstats, zmaudit, Event::delete, zmDbDo retry)#4810
connortechnology wants to merge 19 commits into
ZoneMinder:masterfrom
connortechnology:deadlock-fixes

connortechnology commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

connortechnology commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

connortechnology commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

connortechnology commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

connortechnology commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

connortechnology commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		my $rows = $dbh->selectall_arrayref(
		"SELECT MonitorId, COUNT(*), COALESCE(SUM(DiskSpace), 0) FROM $table GROUP BY MonitorId"

		if ($in_transaction or $err != 1213 or $attempt >= $max_attempts) { # 1213 = ER_LOCK_DEADLOCK
		return;

		ZoneMinder::Database::zmDbDo('SET TRANSACTION ISOLATION LEVEL READ COMMITTED');
		$ZoneMinder::Database::dbh->begin_work();

Uh oh!

Conversation

connortechnology commented May 7, 2026

Summary

What changes

Why isolation level alone wasn't enough

Notes for reviewers

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

connortechnology commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

connortechnology commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

connortechnology commented May 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

connortechnology commented May 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

connortechnology commented May 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants