Skip to content

Eliminate Event_Summaries / bucket-table deadlocks (zmstats, zmaudit, Event::delete, zmDbDo retry)#4810

Open
connortechnology wants to merge 19 commits into
ZoneMinder:masterfrom
connortechnology:deadlock-fixes
Open

Eliminate Event_Summaries / bucket-table deadlocks (zmstats, zmaudit, Event::delete, zmDbDo retry)#4810
connortechnology wants to merge 19 commits into
ZoneMinder:masterfrom
connortechnology:deadlock-fixes

Conversation

@connortechnology
Copy link
Copy Markdown
Member

Summary

Eliminates the recurring InnoDB deadlock cycle observed under load between zma's event_update_trigger path, zmstats.pl's Event_Summaries maintenance, zmaudit.pl's aggregations, and zmfilter.pl (Event::delete).

The root cause was a multi-table UPDATE Event_Summaries es JOIN Events_Hour ... pattern: MariaDB takes S-locks on the joined bucket rows and holds them to TX commit regardless of isolation level, which inverts the canonical lock order

Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId]

against the trigger writers that hold X-locks on those same bucket rows. The InnoDB deadlock log captured the exact zma vs zmstats cycle.

What changes

  • zmDbDo auto-retries on errno 1213 (5 attempts, exponential backoff with jitter) when called outside an explicit transaction. Most call sites are autocommit-style and already idempotent on retry.

  • Event::delete sets READ COMMITTED for the per-event delete TX, runs Stats/Event_Data/Frames/Events deletes, and retries the whole TX on 1213. Also fixes a latent commit() -> rollback() bug on the error path.

  • zmstats.pl Event_Summaries maintenance is rewritten to:

    1. Prune aged rows from Events_Hour/Day/Week/Month (RC, no gap locks).
    2. Snapshot per-monitor counts/disk-space via plain SELECT (consistent read at RC, no locks).
    3. Apply one UPDATE Event_Summaries per MonitorId using the snapshotted values.

    This keeps the lock prefix (buckets -> Event_Summaries) consistent with the trigger path and never holds bucket S-locks while writing ES. Wrapped in 5-retry deadlock loop.

  • zmaudit.pl receives the same snapshot-then-per-row treatment for its Event_Summaries and Storage DiskSpace recomputations, replacing five correlated/JOINed UPDATEs that had the same lock-ordering hazard.

  • db/triggers.sql picks up a comment block above event_update_trigger documenting the canonical lock acquisition order so future writers don't reintroduce the cycle.

Why isolation level alone wasn't enough

SET TRANSACTION ISOLATION LEVEL READ COMMITTED drops the next-key/gap locks on plain DELETE/UPDATE range scans, but a multi-table UPDATE always takes S-locks on the JOINed rows for as long as the transaction lives. The structural fix had to be removing the JOIN, not lowering isolation.

Notes for reviewers

  • No schema changes. Pure script + Perl module + retry-loop changes plus one comment block in triggers.sql.
  • The zmDbDo retry is conditional on AutoCommit — it will not silently re-run statements that are part of an explicit caller transaction (those callers are responsible for their own retry, e.g. the Event::delete loop).
  • Event::delete's isolation set + retry path is bypassed when the caller is already inside a transaction ($in_transaction short-circuit), so callers that wrap multiple deletes in their own TX keep their semantics.

Test plan

  • Build clean (cmake --build build) — Perl scripts go through cmake configure_file.
  • perl -c on the touched modules with a real install (placeholders substituted).
  • Run zmstats.pl, zmaudit.pl against a busy install and confirm no Deadlock found errors in /var/log/zm/.
  • Force contention by deleting a hot event via filter while zma is updating the same monitor; confirm Event::delete retries succeed and the row is removed.
  • Verify Event_Summaries totals match recomputed SUM(DiskSpace)/COUNT(*) from the bucket tables after a full zmstats cycle.
  • Confirm zmaudit.pl --interactive Storage DiskSpace UPDATE still produces correct totals.

🤖 Generated with Claude Code

connortechnology and others added 4 commits May 6, 2026 15:13
Deadlock detection (errno 1213) is part of normal InnoDB operation
under contention; the engine rolls back the loser and expects the
caller to re-run the statement on a fresh transaction. Most callers
into zmDbDo go through autocommit, where there's no caller-managed
transaction state for a retry to disturb.

When AutoCommit is on, retry the statement up to 5 times with
exponential backoff (~100ms -> ~1.6s, jittered). When AutoCommit is
off, the caller owns the transaction and a unilateral retry would
silently succeed against a TX that no longer reflects the work the
caller staged before this statement; preserve the existing behavior
of logging and returning undef so the caller can rebuild the TX
itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Three issues in the existing Stats/Event_Data/Frames/Events delete
sequence:

  - On any zmDbDo error inside the transaction the code called
    dbh->commit() instead of dbh->rollback(). The server-side
    transaction was already rolled back when InnoDB picked us as the
    deadlock victim, so the commit() was effectively against a fresh
    auto-started TX, but the bug pattern leaked through as confusing
    state and prevented any retry.

  - There was no retry. errno 1213 is expected under contention with
    zmstats and zma touching the same Event_Summaries[MonitorId] row,
    and the loser is supposed to re-run.

  - At REPEATABLE READ, two concurrent filter workers deleting events
    with adjacent EventIds take next-key/gap locks on each other's
    rows in the bucket tables.

Rewrite the delete block as a retry loop: SET TRANSACTION ISOLATION
LEVEL READ COMMITTED, begin_work, run the four DELETEs, commit on
success. On any error rollback (was: commit). On errno 1213 retry up
to 5 times with backoff. Skip both the isolation switch and the
rollback-then-retry when the caller is managing their own transaction
(in_transaction); they would be the wrong scope to act in.

Falls through to the storage DiskSpace adjustment only on commit, so
a deadlocked delete leaves the event for the next filter pass instead
of orphaning the row with stale storage accounting.

Note: do NOT pre-lock Event_Summaries[MonitorId] FOR UPDATE here, even
though the trigger touches it last. Pre-locking puts ES before
buckets[Id] in the lock acquisition order, which inverts against zma's
event_update_trigger path (Events[A] -> buckets[A] -> ES[N]) and
re-introduces the cycle the rest of this work is removing.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…s resync

The previous resync code in zmstats and zmaudit used multi-table
UPDATEs against Event_Summaries that joined the bucket tables:

  UPDATE Event_Summaries es
  LEFT JOIN (SELECT ... FROM Events_Hour ...) h ON ...
  LEFT JOIN (SELECT ... FROM Events_Day  ...) d ON ...
  ... SET es.HourEvents = h.c, ...

zmaudit additionally used scalar correlated subqueries against Events
for the Total/Archived columns and against Events for Storage.DiskSpace.

MariaDB takes S-locks on the joined and sub-queried rows for the
duration of any multi-table UPDATE statement, regardless of isolation
level. event_update_trigger and event_delete_trigger hold X-locks on
those same bucket rows while they walk the trigger body, so the resync
deadlocks against active event lifecycle traffic. Captured a textbook
example in SHOW ENGINE INNODB STATUS:

  TX(1) zma:    HOLDS X Events_Hour[42229643]
                WAITS X Event_Summaries[28]
  TX(2) zmstats: HOLDS X Event_Summaries[2,4,5,...,28,...,73]
                 WAITS S Events_Hour[42229643]

Replace the JOIN/subquery pattern in both scripts with a snapshot phase
followed by per-monitor UPDATEs:

  1. SELECT MonitorId, COUNT(*), SUM(DiskSpace) FROM each bucket
     (and the equivalent Total/Archived aggregate from Events).
     Plain SELECTs do consistent reads and take no row locks.
  2. SELECT MonitorId FROM Event_Summaries to widen the universe so
     monitors with empty buckets still get zeroed out.
  3. For each monitor, UPDATE Event_Summaries SET ... WHERE MonitorId=?.
     Each UPDATE only X-locks one ES row and reads no other table.

zmaudit's five separate UPDATEs collapse to one snapshot phase plus
one UPDATE per monitor. Storage DiskSpace gets the same treatment.

zmstats keeps the same outer transaction (BEGIN ... COMMIT, RC
isolation, retry on 1213) so the bucket DELETEs and the resync stay
atomic, but the resync no longer reads the bucket tables under lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Records the InnoDB X-lock order that all writers (zma trigger path, zmstats
prune+resync, zmfilter/Event::delete) must follow to avoid the deadlock cycles
fixed in the surrounding commits. Comment only; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copilot AI review requested due to automatic review settings May 7, 2026 02:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces recurring MariaDB/InnoDB deadlocks between event-trigger writers (zma), maintenance scripts (zmstats.pl, zmaudit.pl), and event deletion by removing join-based multi-table UPDATE patterns and adding targeted deadlock retries.

Changes:

  • Add deadlock (errno 1213) retry with exponential backoff to ZoneMinder::Database::zmDbDo when running in autocommit mode.
  • Rework Event::delete to run under READ COMMITTED (when it owns the transaction) and retry the whole delete transaction on deadlock.
  • Rewrite zmstats.pl and zmaudit.pl Event_Summaries (and Storage DiskSpace in zmaudit.pl) maintenance to “snapshot via SELECT, then per-row UPDATE” to avoid lock-order inversions; add deadlock retry loop in zmstats.pl.
  • Add documentation in db/triggers.sql describing intended lock acquisition order.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/ZoneMinder/lib/ZoneMinder/Event.pm Adds RC isolation + deadlock retry loop around per-event delete transaction and updates trigger/locking rationale comments.
scripts/ZoneMinder/lib/ZoneMinder/Database.pm Adds autocommit-only deadlock retry/backoff behavior to zmDbDo.
scripts/zmstats.pl.in Rewrites bucket prune + Event_Summaries resync to snapshot aggregates then update per MonitorId inside an RC transaction with retries.
scripts/zmaudit.pl.in Rewrites Event_Summaries and Storage DiskSpace recomputation to snapshot aggregates then update per row.
db/triggers.sql Adds lock-order comment block near event_update_trigger definition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +398 to +407
# event_delete_trigger fires BEFORE DELETE on Events; after the
# accompanying triggers.sql change, event_update_trigger now also fires
# BEFORE UPDATE. That gives every Events writer the same lock acquisition
# order: buckets[Id] -> Event_Summaries[MonitorId] -> Events[Id] (the
# outer DML row last). zmstats.pl runs at the same RC isolation and takes
# bucket-row X-locks first, then UPDATE Event_Summaries, which is the
# same prefix order — so no cycle is possible across filter / zma /
# zmstats. Do NOT pre-lock Event_Summaries here: that puts ES before
# buckets and re-introduces the inversion against zma's UPDATE path.
#
Comment thread db/triggers.sql
drop trigger if exists event_update_trigger//

CREATE TRIGGER event_update_trigger AFTER UPDATE ON Events
CREATE TRIGGER event_update_trigger AFTER UPDATE ON Events
Comment thread scripts/zmaudit.pl.in
Comment on lines +946 to +963
my $events_rows = $dbh->selectall_arrayref(q{
SELECT MonitorId,
COUNT(Id),
COALESCE(SUM(CASE WHEN DiskSpace IS NOT NULL THEN DiskSpace ELSE 0 END), 0),
COUNT(CASE WHEN Archived = 1 THEN 1 END),
COALESCE(SUM(CASE WHEN Archived = 1 AND DiskSpace IS NOT NULL THEN DiskSpace ELSE 0 END), 0)
FROM Events
GROUP BY MonitorId
});
if ($dbh->err()) {
Error("zmaudit Events aggregate failed: ".$dbh->errstr());
} else {
for my $r (@$events_rows) {
$agg{$r->[0]}{total_c} = $r->[1];
$agg{$r->[0]}{total_s} = $r->[2];
$agg{$r->[0]}{archived_c} = $r->[3];
$agg{$r->[0]}{archived_s} = $r->[4];
}
connortechnology and others added 2 commits May 7, 2026 09:39
The previous comment block claimed event_update_trigger fires BEFORE UPDATE
and that the lock order was buckets -> Event_Summaries -> Events. Neither
matches the code: triggers.sql defines event_update_trigger as AFTER UPDATE,
and InnoDB X-locks the matched Events row during WHERE evaluation before
either BEFORE or AFTER trigger bodies fire — so the canonical chain is
  Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId]
which is what triggers.sql already documents. Comment-only change.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…in zmaudit

Previously, if any of the five aggregate SELECTs (Events, Events_Hour/Day/
Week/Month) failed transiently, the per-monitor UPDATE phase still ran and
wrote `// 0` for every column from those failed groups, destroying valid
counters across all monitors.

Track per-aggregate success and build the UPDATE SET clause from only the
column groups whose SELECT succeeded. zmaudit re-runs on its normal
interval, so a missed group is corrected on the next pass instead of being
overwritten with zeros now.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@connortechnology
Copy link
Copy Markdown
Member Author

Pushed two follow-up commits addressing the review feedback:

  • 19c5bcbbd docs: correct Event::delete lock-order comment — fixes both review comments about the inconsistent comment vs. trigger definition. The block in Event.pm previously claimed event_update_trigger was changed to BEFORE UPDATE; that change was considered then reverted (lock order doesn't depend on trigger timing — InnoDB X-locks the matched row during WHERE evaluation before either trigger fires), but the comment never got rewritten. Now matches the chain documented in triggers.sql: Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId].
  • 943114da9 fix: don't zero Event_Summaries counters on aggregate SELECT failure in zmaudit — addresses the third review comment. Now tracks per-aggregate success and only updates the column groups whose SELECT succeeded; if every aggregate fails the resync is skipped entirely with an error logged, and the next zmaudit pass corrects it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread scripts/zmstats.pl.in
my ($table, $interval) = @$bucket;
my $event_ids = $dbh->selectcol_arrayref(
"SELECT EventId FROM $table WHERE StartDateTime < DATE_SUB(NOW(), INTERVAL $interval)"
);
selectcol_arrayref returns undef on a DB error. The previous code only
checked truthiness of the result before deciding to DELETE, so a transient
SELECT failure would silently skip the prune for that bucket and let the
transaction continue to commit an incomplete state. Capture \$dbh->err()
after the SELECT and bail out the same way the DELETE error path does.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@connortechnology
Copy link
Copy Markdown
Member Author

Pushed 0dcd500b7 addressing the new review comment: the bucket-prune SELECT now checks $dbh->err() and bails the same way the DELETE error path does. (Same hazard wasn't present in the snapshot/aggregate sections — those already check explicitly.)

Inline comments at every occurrence so future readers don't have to look up
the errno. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (5)

scripts/zmaudit.pl.in:1025

  • The loop ignores whether a per-monitor UPDATE ultimately failed after all zmDbDo retries. If one row fails, the script continues updating later rows and prints the success message, leaving Event_Summaries only partially resynced with no clear failure at the resync level.
        ZoneMinder::Database::zmDbDo(
          $sql,
          (map { $a->{$_} // 0 } @bind_template),
          $mid
        );

scripts/zmaudit.pl.in:1050

  • This Storage overwrite uses a previously snapshotted sum, so a concurrent event writer can update Storage.DiskSpace after the SELECT and then be overwritten here with the stale value. That can lose the live storage increment/decrement until a later audit pass.
          ZoneMinder::Database::zmDbDo(
            'UPDATE Storage SET DiskSpace=? WHERE Id=?',
            $disk{$sid} // 0, $sid

scripts/zmaudit.pl.in:1055

  • This success message is emitted even when the aggregate SELECT or Storage enumeration failed, because it is outside the success branch. In those cases no Storage rows were updated, so the message can mislead interactive users and logs into thinking the resync completed.
    aud_print('Finished updating Storage DiskSpace');

scripts/zmaudit.pl.in:1050

  • The return value from the per-storage UPDATE is ignored. If one Storage row still fails after zmDbDo's retries, the script silently continues with later rows and then reports completion, leaving DiskSpace only partially resynced.
          ZoneMinder::Database::zmDbDo(
            'UPDATE Storage SET DiskSpace=? WHERE Id=?',
            $disk{$sid} // 0, $sid

scripts/zmaudit.pl.in:1027

  • This reports a full Event_Summaries resync whenever at least one aggregate group succeeded, even if other aggregate SELECTs failed and their column groups were intentionally skipped. The message should distinguish a partial resync so operators do not assume all counters were refreshed.
      aud_print('Finished resyncing Event_Summaries from Events + bucket tables');

Debug("Deadlock on '"._sql_with_bind_values($sql, @params)."' attempt $attempt/$max_attempts, retrying");
select(undef, undef, undef, 0.05 * (1 << $attempt) + rand(0.05));
next;
}
Comment thread scripts/zmstats.pl.in Outdated
Comment on lines +149 to +150
my $rows = $dbh->selectall_arrayref(
"SELECT MonitorId, COUNT(*), COALESCE(SUM(DiskSpace), 0) FROM $table GROUP BY MonitorId"
Comment thread scripts/zmaudit.pl.in
Comment on lines +996 to +1030
my $existing = $dbh->selectcol_arrayref('SELECT MonitorId FROM Event_Summaries');
if ($dbh->err()) {
Error("zmaudit Event_Summaries enumerate failed: ".$dbh->errstr());
} else {
$agg{$_} ||= {} for @$existing;
}

# Build the SET clause from only the column groups we successfully read.
my @set;
my @bind_template;
if ($ok{events}) {
push @set, 'TotalEvents=?, TotalEventDiskSpace=?, ArchivedEvents=?, ArchivedEventDiskSpace=?';
push @bind_template, qw(total_c total_s archived_c archived_s);
}
for my $bucket (['h','Hour'], ['d','Day'], ['w','Week'], ['m','Month']) {
my ($key, $col) = @$bucket;
next unless $ok{$key};
push @set, "${col}Events=?, ${col}EventDiskSpace=?";
push @bind_template, $key.'_c', $key.'_s';
}

if (@set) {
my $sql = 'UPDATE Event_Summaries SET '.join(', ', @set).' WHERE MonitorId=?';
for my $mid (sort { $a <=> $b } keys %agg) {
my $a = $agg{$mid};
ZoneMinder::Database::zmDbDo(
$sql,
(map { $a->{$_} // 0 } @bind_template),
$mid
);
}
aud_print('Finished resyncing Event_Summaries from Events + bucket tables');
} else {
Error('zmaudit: every Event_Summaries aggregate SELECT failed; skipping resync');
}
Comment thread db/triggers.sql Outdated
Comment on lines +121 to +129
* scripts that touch the same tables). InnoDB X-locks the matched Events row
* during WHERE evaluation, before either BEFORE or AFTER trigger bodies fire,
* so the order is the same regardless of trigger timing:
* Events[Id] -> Events_Hour/Day/Week/Month[EventId] -> Event_Summaries[MonitorId]
* zmstats.pl prune+resync follows the matching prefix (bucket DELETEs then
* UPDATE Event_Summaries) and crucially does NOT pre-lock Event_Summaries —
* pre-locking ES would invert against the trigger body order and reintroduce
* deadlocks against filter / zma writers. The bucket update/delete triggers
* also propagate into Event_Summaries[MonitorId] in the same direction.
When zmDbDo is called inside a caller-managed transaction (AutoCommit off),
max_attempts is 1 and the loop falls through to Error on a 1213 deadlock —
which is misleading, because the caller (Event::delete, the zmstats
prune+resync TX) has its own outer retry loop that will roll back and
succeed. Downgrade to a Debug message in that path; Error is still emitted
for non-deadlock failures and for autocommit calls that exhaust their
retries.
A concurrent trigger writer can adjust Event_Summaries between our
snapshot SELECTs and the per-monitor UPDATE; the UPDATE then overwrites
that adjustment with the older snapshot. Drift is bounded by the
zmstats/zmaudit interval and corrected on the next pass, because the
incremental triggers continue to maintain ES correctly between resyncs.
Locking ES before reading aggregates would invert the canonical lock
order and re-introduce the deadlock cycle the resync rewrite eliminated.
Adds zm_event.cpp's create path (Events INSERT -> bucket INSERTs ->
Event_Summaries UPDATE; event_insert_trigger itself is commented out so
zmc does this directly), and lists zmaudit.pl alongside zmstats.pl, so
the comment enumerates every writer obligated to follow the canonical
order rather than just the trigger paths.
@connortechnology
Copy link
Copy Markdown
Member Author

Addressed the four new review comments:

  • 628bcb3e8 fix: don't log Error on deadlock inside a caller-managed transactionzmDbDo no longer emits the misleading Error('Failed ... Deadlock found') when called from Event::delete / zmstats prune+resync TXs; those have their own outer retry loops. Now a Debug line in that path; Error still fires for non-1213 failures and for autocommit calls that exhaust retries.
  • 7b4617b1a docs: document atomicity tradeoff in zmstats/zmaudit ES resync — agreed the snapshot-then-update split is non-atomic; that's intentional. The trigger writers maintain ES incrementally between resyncs, drift is bounded by the resync interval and corrected on the next pass, and locking ES before reading aggregates would invert the canonical order and re-introduce the deadlock we just removed. Comment block on both scripts now spells this out.
  • e254ed41d docs: expand triggers.sql lock-order block with full writer list — added src/zm_event.cpp's create path (event_insert_trigger is commented out below, so zmc inserts Events then bucket rows then Event_Summaries directly), the bucket triggers cascading into ES, and zmaudit.pl alongside zmstats.pl. The doc now enumerates every writer obligated to follow the canonical order, not just the Events-level triggers.

PR is now at 11 commits.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

scripts/zmaudit.pl.in:1064

  • This success message is emitted even when the aggregate SELECT or Storage enumeration fails and no Storage rows were updated. Move it into the successful update path (or make it conditional) so operators do not see “Finished updating Storage DiskSpace” immediately after an error.
    aud_print('Finished updating Storage DiskSpace');

Comment on lines +447 to +448
if ($in_transaction or $err != 1213 or $attempt >= $max_attempts) { # 1213 = ER_LOCK_DEADLOCK
return;
Comment thread db/triggers.sql Outdated
Comment on lines +130 to +132
* - src/zm_event.cpp Event::createNotification path: INSERT Events, then
* INSERT Events_Hour/Day/Week/Month, then INSERT/UPDATE Event_Summaries
* (event_insert_trigger is commented out below; zmc does it directly)
Comment thread scripts/zmstats.pl.in Outdated
Comment on lines +182 to +184
# One UPDATE per monitor: each takes a single ES X-lock and reads
# nothing else, so it can't hold any bucket-row lock that would
# deadlock with the trigger path.
Comment thread scripts/zmaudit.pl.in Outdated
Comment on lines +1057 to +1060
ZoneMinder::Database::zmDbDo(
'UPDATE Storage SET DiskSpace=? WHERE Id=?',
$disk{$sid} // 0, $sid
);
Comment thread scripts/zmaudit.pl.in Outdated
Comment on lines +1028 to +1036
for my $mid (sort { $a <=> $b } keys %agg) {
my $a = $agg{$mid};
ZoneMinder::Database::zmDbDo(
$sql,
(map { $a->{$_} // 0 } @bind_template),
$mid
);
}
aud_print('Finished resyncing Event_Summaries from Events + bucket tables');
zmDbDo suppresses its Error log on 1213 inside a caller-managed TX (the
caller owns the retry), and the previous fallthrough at the end of the
retry loop just `return`ed silently. After 5 failed attempts on persistent
contention the event was effectively un-deleted with no record of the
failure. Capture errstr before rollback (some drivers clear it) and emit
an Error on the bail path.
Previous wording named a fictional Event::createNotification function;
the bucket+ES insert sequence in src/zm_event.cpp lives in the Event
constructor (line 46). Point future readers at the right symbol so they
can grep and verify the path.
The previous comment claimed each UPDATE couldn't hold any bucket lock
that would deadlock with the trigger path, which conflated statement-
level locks with TX-level locks. By the time we reach this loop the TX
already holds bucket-row X-locks from the earlier DELETEs plus any ES
X-locks acquired by the bucket DELETE triggers cascading. Rewrite the
comment to distinguish those TX-held locks from the locks acquired by
the new UPDATE statement and to be explicit that the TX's lock
acquisition direction is preserved.
Previously zmaudit logged "Finished resyncing Event_Summaries" /
"Finished updating Storage DiskSpace" unconditionally as long as the
aggregate SELECTs succeeded, masking per-row UPDATE failures (e.g.
zmDbDo exhausting its deadlock retries) and skipped aggregate column
groups. Track which aggregates were skipped and which per-monitor /
per-storage UPDATEs failed (zmDbDo returns undef on failure), and
surface that in the audit log instead of claiming the resync is
complete.
@connortechnology
Copy link
Copy Markdown
Member Author

Five new comments addressed in four follow-up commits:

  • `d01300832` fix: log Error in Event::delete when deadlock retries are exhausted — addresses the "silent return after 5 failed attempts" hazard. zmDbDo intentionally suppresses Error on 1213 inside a caller-managed TX, so Event::delete now logs its own Error (with errstr captured before rollback) on the exhausted-retries path.
  • `1323e9f02` docs: reference the actual Event::Event constructor in lock-order block — fixed the bogus `Event::createNotification` symbol; the bucket+ES insert sequence lives in `Event::Event` at `src/zm_event.cpp:46`.
  • `52d5f80e0` docs: clarify per-monitor UPDATE lock state in zmstats resync — agreed the previous comment conflated statement-level locks with TX-level locks. Rewrote it to distinguish the TX's already-held bucket and ES X-locks (acquired in canonical order by the earlier DELETEs and their cascading triggers) from the new statement-level UPDATE.
  • `ccc0bcf82` fix: track per-row UPDATE failures in zmaudit ES and Storage resync — both `aud_print('Finished ...')` messages now reflect actual completion: skipped aggregate column groups and per-row `zmDbDo` failures are counted and reported instead of being masked. An Error is logged if Storage resync didn't complete.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Comment thread scripts/zmstats.pl.in Outdated
Comment on lines +216 to +218
$dbh->rollback();
if ($err != 1213 or $attempt >= $max_attempts) { # 1213 = ER_LOCK_DEADLOCK
Error("Event_Summaries prune+resync gave up after $attempt attempt(s): ".$dbh->errstr());
Comment thread scripts/zmstats.pl.in Outdated
Comment on lines +190 to +194
if (!$err) {
for my $mid (sort { $a <=> $b } keys %agg) {
my $a = $agg{$mid};
zmDbDo(
'UPDATE Event_Summaries SET '.
Comment thread scripts/zmstats.pl.in
Comment on lines +158 to +160
my $rows = $dbh->selectall_arrayref(
"SELECT MonitorId, COUNT(*), COALESCE(SUM(DiskSpace), 0) FROM $table GROUP BY MonitorId"
);
Comment thread scripts/zmaudit.pl.in
Comment on lines +1005 to +1008
my $existing = $dbh->selectcol_arrayref('SELECT MonitorId FROM Event_Summaries');
if ($dbh->err()) {
Error("zmaudit Event_Summaries enumerate failed: ".$dbh->errstr());
} else {
Comment on lines +284 to +286
# Caller-managed TX owns the retry loop; don't pollute logs with a
# misleading "Failed ... Deadlock found" Error before they roll back.
Debug('Deadlock in caller-managed TX on '._sql_with_bind_values($sql, @params).'; deferring to caller');
…ansaction

ZoneMinder::Logger->logPrint runs INSERT INTO Logs on the same dbh.
Calling Debug()/Error() from zmDbDo's failure path inside a caller-managed
transaction would execute another statement on the connection, clearing
the err/errstr state the caller needs to see for rollback/retry. The
result could be a caller observing err=0 after a deadlock-victim TX and
committing what looks like success but is actually a rolled-back no-op.

Bail silently from zmDbDo when AutoCommit is off; the caller owns the
retry loop and is responsible for logging. Logging in the autocommit
path is still safe because each statement is its own TX.
Two issues with the previous implementation:
  1. Aggregate SELECTs ran GROUP BY MonitorId across the full bucket
     tables every zmstats cycle (default 60s). Events_Month grows for
     weeks; this turned the stats daemon into a constant full-scan
     workload on busy installs.
  2. The per-monitor UPDATE loop X-locked every Event_Summaries row on
     every cycle even when nothing changed, adding avoidable contention
     with the trigger writers this rewrite is supposed to protect.

Capture MonitorIds as we SELECT bucket rows for pruning, then skip the
resync entirely if no rows were pruned. When rows were pruned, restrict
the aggregate SELECTs (WHERE MonitorId IN ...) and the per-monitor
UPDATEs to that touched set. zmaudit remains the periodic deep-resync
safety net for drift in untouched monitors.

Also capture errstr before rollback so the gave-up Error reports the
actual reason instead of an empty string on drivers that clear errstr
on rollback.
…udit

Without this, an enumerate failure (SELECT MonitorId FROM Event_Summaries)
left @Skipped empty and the per-monitor UPDATE loop ran for whatever rows
the bucket aggregates returned — but monitors that exist only in
Event_Summaries (no current bucket rows) never got zeroed, while the
audit log claimed a full resync. Track enumerate success and report
partial resync when it fails.
@connortechnology
Copy link
Copy Markdown
Member Author

Five new comments addressed in three follow-up commits (the zmstats commit folds K + L together because the touched-monitors restriction fixes both at once):

  • `411b6916b` fix: avoid DB-backed logging inside zmDbDo when caller manages the transaction — the most serious of the batch. `ZoneMinder::Logger->logPrint` does `INSERT INTO Logs` on the same `$dbh`, so calling `Debug()`/`Error()` from zmDbDo's failure path inside a caller-managed TX would execute another statement, clearing the err/errstr the caller needs for rollback/retry. A caller could then read err=0 after a deadlock-victim TX and `commit()` what looks like success but is actually a rolled-back no-op. Now zmDbDo bails silently when AutoCommit is off; the caller's own retry loop handles logging.
  • `8fb4ad79e` perf: restrict zmstats ES resync to monitors with pruned buckets — addresses both the "full Events_Month scan every 60s" workload concern and the "X-lock every ES row every pass" contention concern. Capture MonitorIds during the prune SELECT, skip the resync entirely if nothing was pruned, and restrict the aggregate SELECTs (`WHERE MonitorId IN (...)`) and per-monitor UPDATEs to the touched set. zmaudit remains the deep-resync safety net for drift in untouched monitors. Also captures errstr before rollback so the gave-up Error reports the actual reason on drivers that clear errstr on rollback.
  • `d098a055b` fix: track Event_Summaries enumerate failure as partial resync in zmaudit — without it, the enumerate failure left `@skipped` empty and the audit log still claimed a full resync while monitors with no current bucket rows never got zeroed. Now tracked alongside the per-aggregate flags and surfaced in the partial-resync message.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment on lines +423 to +424
ZoneMinder::Database::zmDbDo('SET TRANSACTION ISOLATION LEVEL READ COMMITTED');
$ZoneMinder::Database::dbh->begin_work();
Comment thread scripts/zmstats.pl.in
Comment on lines +118 to +124
# SET TRANSACTION ... applies only to the next transaction, so it must
# be issued before begin_work and re-issued on each retry.
zmDbDo('SET TRANSACTION ISOLATION LEVEL READ COMMITTED');
$dbh->begin_work();

my $err = 0;
my $errstr; # captured before rollback() — rollback can clear errstr
Error('Failed '._sql_with_bind_values($sql, @params).' : '.$dbh->errstr());
last;
}
if ( defined $rows and ZoneMinder::Logger::logLevel() > INFO ) {
Comment thread scripts/zmaudit.pl.in
Comment on lines +1059 to +1090
# Storage DiskSpace resync: same trap. Snapshot per-Storage sums via plain
# SELECT, then UPDATE Storage one row at a time.
{
my $storage_done = 0;
my $rows = $dbh->selectall_arrayref(
'SELECT StorageId, COALESCE(SUM(DiskSpace), 0) FROM Events GROUP BY StorageId'
);
if ($dbh->err()) {
Error('zmaudit Storage aggregate failed: '.$dbh->errstr());
} else {
my %disk = map { $_->[0] => $_->[1] } grep { defined $_->[0] } @$rows;
my $storage_ids = $dbh->selectcol_arrayref('SELECT Id FROM Storage');
if ($dbh->err()) {
Error('zmaudit Storage enumerate failed: '.$dbh->errstr());
} else {
my $attempted = 0;
my $failed = 0;
for my $sid (@$storage_ids) {
$attempted++;
my $rv = ZoneMinder::Database::zmDbDo(
'UPDATE Storage SET DiskSpace=? WHERE Id=?',
$disk{$sid} // 0, $sid
);
$failed++ if !defined $rv;
}
if (!$failed) {
aud_print('Finished updating Storage DiskSpace');
$storage_done = 1;
} else {
aud_print("Partial Storage DiskSpace update: $failed/$attempted row(s) failed");
}
}
Every row in the previous arrayref-of-arrayref carried the same single
bind value (the event Id), so the [$sql, $$event{Id}] wrapping and the
my ($sql, @Bind) = @$stmt unpacking were doing no work. Iterate over the
SQL strings directly and pass $$event{Id} as the one bind value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants