Eliminate Event_Summaries / bucket-table deadlocks (zmstats, zmaudit, Event::delete, zmDbDo retry)#4810
Eliminate Event_Summaries / bucket-table deadlocks (zmstats, zmaudit, Event::delete, zmDbDo retry)#4810connortechnology wants to merge 19 commits into
Conversation
Deadlock detection (errno 1213) is part of normal InnoDB operation under contention; the engine rolls back the loser and expects the caller to re-run the statement on a fresh transaction. Most callers into zmDbDo go through autocommit, where there's no caller-managed transaction state for a retry to disturb. When AutoCommit is on, retry the statement up to 5 times with exponential backoff (~100ms -> ~1.6s, jittered). When AutoCommit is off, the caller owns the transaction and a unilateral retry would silently succeed against a TX that no longer reflects the work the caller staged before this statement; preserve the existing behavior of logging and returning undef so the caller can rebuild the TX itself. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Three issues in the existing Stats/Event_Data/Frames/Events delete
sequence:
- On any zmDbDo error inside the transaction the code called
dbh->commit() instead of dbh->rollback(). The server-side
transaction was already rolled back when InnoDB picked us as the
deadlock victim, so the commit() was effectively against a fresh
auto-started TX, but the bug pattern leaked through as confusing
state and prevented any retry.
- There was no retry. errno 1213 is expected under contention with
zmstats and zma touching the same Event_Summaries[MonitorId] row,
and the loser is supposed to re-run.
- At REPEATABLE READ, two concurrent filter workers deleting events
with adjacent EventIds take next-key/gap locks on each other's
rows in the bucket tables.
Rewrite the delete block as a retry loop: SET TRANSACTION ISOLATION
LEVEL READ COMMITTED, begin_work, run the four DELETEs, commit on
success. On any error rollback (was: commit). On errno 1213 retry up
to 5 times with backoff. Skip both the isolation switch and the
rollback-then-retry when the caller is managing their own transaction
(in_transaction); they would be the wrong scope to act in.
Falls through to the storage DiskSpace adjustment only on commit, so
a deadlocked delete leaves the event for the next filter pass instead
of orphaning the row with stale storage accounting.
Note: do NOT pre-lock Event_Summaries[MonitorId] FOR UPDATE here, even
though the trigger touches it last. Pre-locking puts ES before
buckets[Id] in the lock acquisition order, which inverts against zma's
event_update_trigger path (Events[A] -> buckets[A] -> ES[N]) and
re-introduces the cycle the rest of this work is removing.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…s resync
The previous resync code in zmstats and zmaudit used multi-table
UPDATEs against Event_Summaries that joined the bucket tables:
UPDATE Event_Summaries es
LEFT JOIN (SELECT ... FROM Events_Hour ...) h ON ...
LEFT JOIN (SELECT ... FROM Events_Day ...) d ON ...
... SET es.HourEvents = h.c, ...
zmaudit additionally used scalar correlated subqueries against Events
for the Total/Archived columns and against Events for Storage.DiskSpace.
MariaDB takes S-locks on the joined and sub-queried rows for the
duration of any multi-table UPDATE statement, regardless of isolation
level. event_update_trigger and event_delete_trigger hold X-locks on
those same bucket rows while they walk the trigger body, so the resync
deadlocks against active event lifecycle traffic. Captured a textbook
example in SHOW ENGINE INNODB STATUS:
TX(1) zma: HOLDS X Events_Hour[42229643]
WAITS X Event_Summaries[28]
TX(2) zmstats: HOLDS X Event_Summaries[2,4,5,...,28,...,73]
WAITS S Events_Hour[42229643]
Replace the JOIN/subquery pattern in both scripts with a snapshot phase
followed by per-monitor UPDATEs:
1. SELECT MonitorId, COUNT(*), SUM(DiskSpace) FROM each bucket
(and the equivalent Total/Archived aggregate from Events).
Plain SELECTs do consistent reads and take no row locks.
2. SELECT MonitorId FROM Event_Summaries to widen the universe so
monitors with empty buckets still get zeroed out.
3. For each monitor, UPDATE Event_Summaries SET ... WHERE MonitorId=?.
Each UPDATE only X-locks one ES row and reads no other table.
zmaudit's five separate UPDATEs collapse to one snapshot phase plus
one UPDATE per monitor. Storage DiskSpace gets the same treatment.
zmstats keeps the same outer transaction (BEGIN ... COMMIT, RC
isolation, retry on 1213) so the bucket DELETEs and the resync stay
atomic, but the resync no longer reads the bucket tables under lock.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Records the InnoDB X-lock order that all writers (zma trigger path, zmstats prune+resync, zmfilter/Event::delete) must follow to avoid the deadlock cycles fixed in the surrounding commits. Comment only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Pull request overview
Reduces recurring MariaDB/InnoDB deadlocks between event-trigger writers (zma), maintenance scripts (zmstats.pl, zmaudit.pl), and event deletion by removing join-based multi-table UPDATE patterns and adding targeted deadlock retries.
Changes:
- Add deadlock (errno 1213) retry with exponential backoff to
ZoneMinder::Database::zmDbDowhen running in autocommit mode. - Rework
Event::deleteto run underREAD COMMITTED(when it owns the transaction) and retry the whole delete transaction on deadlock. - Rewrite
zmstats.plandzmaudit.plEvent_Summaries (and Storage DiskSpace inzmaudit.pl) maintenance to “snapshot via SELECT, then per-row UPDATE” to avoid lock-order inversions; add deadlock retry loop inzmstats.pl. - Add documentation in
db/triggers.sqldescribing intended lock acquisition order.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/ZoneMinder/lib/ZoneMinder/Event.pm | Adds RC isolation + deadlock retry loop around per-event delete transaction and updates trigger/locking rationale comments. |
| scripts/ZoneMinder/lib/ZoneMinder/Database.pm | Adds autocommit-only deadlock retry/backoff behavior to zmDbDo. |
| scripts/zmstats.pl.in | Rewrites bucket prune + Event_Summaries resync to snapshot aggregates then update per MonitorId inside an RC transaction with retries. |
| scripts/zmaudit.pl.in | Rewrites Event_Summaries and Storage DiskSpace recomputation to snapshot aggregates then update per row. |
| db/triggers.sql | Adds lock-order comment block near event_update_trigger definition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # event_delete_trigger fires BEFORE DELETE on Events; after the | ||
| # accompanying triggers.sql change, event_update_trigger now also fires | ||
| # BEFORE UPDATE. That gives every Events writer the same lock acquisition | ||
| # order: buckets[Id] -> Event_Summaries[MonitorId] -> Events[Id] (the | ||
| # outer DML row last). zmstats.pl runs at the same RC isolation and takes | ||
| # bucket-row X-locks first, then UPDATE Event_Summaries, which is the | ||
| # same prefix order — so no cycle is possible across filter / zma / | ||
| # zmstats. Do NOT pre-lock Event_Summaries here: that puts ES before | ||
| # buckets and re-introduces the inversion against zma's UPDATE path. | ||
| # |
| drop trigger if exists event_update_trigger// | ||
|
|
||
| CREATE TRIGGER event_update_trigger AFTER UPDATE ON Events | ||
| CREATE TRIGGER event_update_trigger AFTER UPDATE ON Events |
| my $events_rows = $dbh->selectall_arrayref(q{ | ||
| SELECT MonitorId, | ||
| COUNT(Id), | ||
| COALESCE(SUM(CASE WHEN DiskSpace IS NOT NULL THEN DiskSpace ELSE 0 END), 0), | ||
| COUNT(CASE WHEN Archived = 1 THEN 1 END), | ||
| COALESCE(SUM(CASE WHEN Archived = 1 AND DiskSpace IS NOT NULL THEN DiskSpace ELSE 0 END), 0) | ||
| FROM Events | ||
| GROUP BY MonitorId | ||
| }); | ||
| if ($dbh->err()) { | ||
| Error("zmaudit Events aggregate failed: ".$dbh->errstr()); | ||
| } else { | ||
| for my $r (@$events_rows) { | ||
| $agg{$r->[0]}{total_c} = $r->[1]; | ||
| $agg{$r->[0]}{total_s} = $r->[2]; | ||
| $agg{$r->[0]}{archived_c} = $r->[3]; | ||
| $agg{$r->[0]}{archived_s} = $r->[4]; | ||
| } |
The previous comment block claimed event_update_trigger fires BEFORE UPDATE and that the lock order was buckets -> Event_Summaries -> Events. Neither matches the code: triggers.sql defines event_update_trigger as AFTER UPDATE, and InnoDB X-locks the matched Events row during WHERE evaluation before either BEFORE or AFTER trigger bodies fire — so the canonical chain is Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId] which is what triggers.sql already documents. Comment-only change. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…in zmaudit Previously, if any of the five aggregate SELECTs (Events, Events_Hour/Day/ Week/Month) failed transiently, the per-monitor UPDATE phase still ran and wrote `// 0` for every column from those failed groups, destroying valid counters across all monitors. Track per-aggregate success and build the UPDATE SET clause from only the column groups whose SELECT succeeded. zmaudit re-runs on its normal interval, so a missed group is corrected on the next pass instead of being overwritten with zeros now. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Pushed two follow-up commits addressing the review feedback:
|
| my ($table, $interval) = @$bucket; | ||
| my $event_ids = $dbh->selectcol_arrayref( | ||
| "SELECT EventId FROM $table WHERE StartDateTime < DATE_SUB(NOW(), INTERVAL $interval)" | ||
| ); |
selectcol_arrayref returns undef on a DB error. The previous code only checked truthiness of the result before deciding to DELETE, so a transient SELECT failure would silently skip the prune for that bucket and let the transaction continue to commit an incomplete state. Capture \$dbh->err() after the SELECT and bail out the same way the DELETE error path does. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Pushed |
Inline comments at every occurrence so future readers don't have to look up the errno. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (5)
scripts/zmaudit.pl.in:1025
- The loop ignores whether a per-monitor UPDATE ultimately failed after all zmDbDo retries. If one row fails, the script continues updating later rows and prints the success message, leaving Event_Summaries only partially resynced with no clear failure at the resync level.
ZoneMinder::Database::zmDbDo(
$sql,
(map { $a->{$_} // 0 } @bind_template),
$mid
);
scripts/zmaudit.pl.in:1050
- This Storage overwrite uses a previously snapshotted sum, so a concurrent event writer can update Storage.DiskSpace after the SELECT and then be overwritten here with the stale value. That can lose the live storage increment/decrement until a later audit pass.
ZoneMinder::Database::zmDbDo(
'UPDATE Storage SET DiskSpace=? WHERE Id=?',
$disk{$sid} // 0, $sid
scripts/zmaudit.pl.in:1055
- This success message is emitted even when the aggregate SELECT or Storage enumeration failed, because it is outside the success branch. In those cases no Storage rows were updated, so the message can mislead interactive users and logs into thinking the resync completed.
aud_print('Finished updating Storage DiskSpace');
scripts/zmaudit.pl.in:1050
- The return value from the per-storage UPDATE is ignored. If one Storage row still fails after zmDbDo's retries, the script silently continues with later rows and then reports completion, leaving DiskSpace only partially resynced.
ZoneMinder::Database::zmDbDo(
'UPDATE Storage SET DiskSpace=? WHERE Id=?',
$disk{$sid} // 0, $sid
scripts/zmaudit.pl.in:1027
- This reports a full Event_Summaries resync whenever at least one aggregate group succeeded, even if other aggregate SELECTs failed and their column groups were intentionally skipped. The message should distinguish a partial resync so operators do not assume all counters were refreshed.
aud_print('Finished resyncing Event_Summaries from Events + bucket tables');
| Debug("Deadlock on '"._sql_with_bind_values($sql, @params)."' attempt $attempt/$max_attempts, retrying"); | ||
| select(undef, undef, undef, 0.05 * (1 << $attempt) + rand(0.05)); | ||
| next; | ||
| } |
| my $rows = $dbh->selectall_arrayref( | ||
| "SELECT MonitorId, COUNT(*), COALESCE(SUM(DiskSpace), 0) FROM $table GROUP BY MonitorId" |
| my $existing = $dbh->selectcol_arrayref('SELECT MonitorId FROM Event_Summaries'); | ||
| if ($dbh->err()) { | ||
| Error("zmaudit Event_Summaries enumerate failed: ".$dbh->errstr()); | ||
| } else { | ||
| $agg{$_} ||= {} for @$existing; | ||
| } | ||
|
|
||
| # Build the SET clause from only the column groups we successfully read. | ||
| my @set; | ||
| my @bind_template; | ||
| if ($ok{events}) { | ||
| push @set, 'TotalEvents=?, TotalEventDiskSpace=?, ArchivedEvents=?, ArchivedEventDiskSpace=?'; | ||
| push @bind_template, qw(total_c total_s archived_c archived_s); | ||
| } | ||
| for my $bucket (['h','Hour'], ['d','Day'], ['w','Week'], ['m','Month']) { | ||
| my ($key, $col) = @$bucket; | ||
| next unless $ok{$key}; | ||
| push @set, "${col}Events=?, ${col}EventDiskSpace=?"; | ||
| push @bind_template, $key.'_c', $key.'_s'; | ||
| } | ||
|
|
||
| if (@set) { | ||
| my $sql = 'UPDATE Event_Summaries SET '.join(', ', @set).' WHERE MonitorId=?'; | ||
| for my $mid (sort { $a <=> $b } keys %agg) { | ||
| my $a = $agg{$mid}; | ||
| ZoneMinder::Database::zmDbDo( | ||
| $sql, | ||
| (map { $a->{$_} // 0 } @bind_template), | ||
| $mid | ||
| ); | ||
| } | ||
| aud_print('Finished resyncing Event_Summaries from Events + bucket tables'); | ||
| } else { | ||
| Error('zmaudit: every Event_Summaries aggregate SELECT failed; skipping resync'); | ||
| } |
| * scripts that touch the same tables). InnoDB X-locks the matched Events row | ||
| * during WHERE evaluation, before either BEFORE or AFTER trigger bodies fire, | ||
| * so the order is the same regardless of trigger timing: | ||
| * Events[Id] -> Events_Hour/Day/Week/Month[EventId] -> Event_Summaries[MonitorId] | ||
| * zmstats.pl prune+resync follows the matching prefix (bucket DELETEs then | ||
| * UPDATE Event_Summaries) and crucially does NOT pre-lock Event_Summaries — | ||
| * pre-locking ES would invert against the trigger body order and reintroduce | ||
| * deadlocks against filter / zma writers. The bucket update/delete triggers | ||
| * also propagate into Event_Summaries[MonitorId] in the same direction. |
When zmDbDo is called inside a caller-managed transaction (AutoCommit off), max_attempts is 1 and the loop falls through to Error on a 1213 deadlock — which is misleading, because the caller (Event::delete, the zmstats prune+resync TX) has its own outer retry loop that will roll back and succeed. Downgrade to a Debug message in that path; Error is still emitted for non-deadlock failures and for autocommit calls that exhaust their retries.
A concurrent trigger writer can adjust Event_Summaries between our snapshot SELECTs and the per-monitor UPDATE; the UPDATE then overwrites that adjustment with the older snapshot. Drift is bounded by the zmstats/zmaudit interval and corrected on the next pass, because the incremental triggers continue to maintain ES correctly between resyncs. Locking ES before reading aggregates would invert the canonical lock order and re-introduce the deadlock cycle the resync rewrite eliminated.
Adds zm_event.cpp's create path (Events INSERT -> bucket INSERTs -> Event_Summaries UPDATE; event_insert_trigger itself is commented out so zmc does this directly), and lists zmaudit.pl alongside zmstats.pl, so the comment enumerates every writer obligated to follow the canonical order rather than just the trigger paths.
|
Addressed the four new review comments:
PR is now at 11 commits. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
scripts/zmaudit.pl.in:1064
- This success message is emitted even when the aggregate SELECT or Storage enumeration fails and no Storage rows were updated. Move it into the successful update path (or make it conditional) so operators do not see “Finished updating Storage DiskSpace” immediately after an error.
aud_print('Finished updating Storage DiskSpace');
| if ($in_transaction or $err != 1213 or $attempt >= $max_attempts) { # 1213 = ER_LOCK_DEADLOCK | ||
| return; |
| * - src/zm_event.cpp Event::createNotification path: INSERT Events, then | ||
| * INSERT Events_Hour/Day/Week/Month, then INSERT/UPDATE Event_Summaries | ||
| * (event_insert_trigger is commented out below; zmc does it directly) |
| # One UPDATE per monitor: each takes a single ES X-lock and reads | ||
| # nothing else, so it can't hold any bucket-row lock that would | ||
| # deadlock with the trigger path. |
| ZoneMinder::Database::zmDbDo( | ||
| 'UPDATE Storage SET DiskSpace=? WHERE Id=?', | ||
| $disk{$sid} // 0, $sid | ||
| ); |
| for my $mid (sort { $a <=> $b } keys %agg) { | ||
| my $a = $agg{$mid}; | ||
| ZoneMinder::Database::zmDbDo( | ||
| $sql, | ||
| (map { $a->{$_} // 0 } @bind_template), | ||
| $mid | ||
| ); | ||
| } | ||
| aud_print('Finished resyncing Event_Summaries from Events + bucket tables'); |
zmDbDo suppresses its Error log on 1213 inside a caller-managed TX (the caller owns the retry), and the previous fallthrough at the end of the retry loop just `return`ed silently. After 5 failed attempts on persistent contention the event was effectively un-deleted with no record of the failure. Capture errstr before rollback (some drivers clear it) and emit an Error on the bail path.
Previous wording named a fictional Event::createNotification function; the bucket+ES insert sequence in src/zm_event.cpp lives in the Event constructor (line 46). Point future readers at the right symbol so they can grep and verify the path.
The previous comment claimed each UPDATE couldn't hold any bucket lock that would deadlock with the trigger path, which conflated statement- level locks with TX-level locks. By the time we reach this loop the TX already holds bucket-row X-locks from the earlier DELETEs plus any ES X-locks acquired by the bucket DELETE triggers cascading. Rewrite the comment to distinguish those TX-held locks from the locks acquired by the new UPDATE statement and to be explicit that the TX's lock acquisition direction is preserved.
Previously zmaudit logged "Finished resyncing Event_Summaries" / "Finished updating Storage DiskSpace" unconditionally as long as the aggregate SELECTs succeeded, masking per-row UPDATE failures (e.g. zmDbDo exhausting its deadlock retries) and skipped aggregate column groups. Track which aggregates were skipped and which per-monitor / per-storage UPDATEs failed (zmDbDo returns undef on failure), and surface that in the audit log instead of claiming the resync is complete.
|
Five new comments addressed in four follow-up commits:
|
| $dbh->rollback(); | ||
| if ($err != 1213 or $attempt >= $max_attempts) { # 1213 = ER_LOCK_DEADLOCK | ||
| Error("Event_Summaries prune+resync gave up after $attempt attempt(s): ".$dbh->errstr()); |
| if (!$err) { | ||
| for my $mid (sort { $a <=> $b } keys %agg) { | ||
| my $a = $agg{$mid}; | ||
| zmDbDo( | ||
| 'UPDATE Event_Summaries SET '. |
| my $rows = $dbh->selectall_arrayref( | ||
| "SELECT MonitorId, COUNT(*), COALESCE(SUM(DiskSpace), 0) FROM $table GROUP BY MonitorId" | ||
| ); |
| my $existing = $dbh->selectcol_arrayref('SELECT MonitorId FROM Event_Summaries'); | ||
| if ($dbh->err()) { | ||
| Error("zmaudit Event_Summaries enumerate failed: ".$dbh->errstr()); | ||
| } else { |
| # Caller-managed TX owns the retry loop; don't pollute logs with a | ||
| # misleading "Failed ... Deadlock found" Error before they roll back. | ||
| Debug('Deadlock in caller-managed TX on '._sql_with_bind_values($sql, @params).'; deferring to caller'); |
…ansaction ZoneMinder::Logger->logPrint runs INSERT INTO Logs on the same dbh. Calling Debug()/Error() from zmDbDo's failure path inside a caller-managed transaction would execute another statement on the connection, clearing the err/errstr state the caller needs to see for rollback/retry. The result could be a caller observing err=0 after a deadlock-victim TX and committing what looks like success but is actually a rolled-back no-op. Bail silently from zmDbDo when AutoCommit is off; the caller owns the retry loop and is responsible for logging. Logging in the autocommit path is still safe because each statement is its own TX.
Two issues with the previous implementation:
1. Aggregate SELECTs ran GROUP BY MonitorId across the full bucket
tables every zmstats cycle (default 60s). Events_Month grows for
weeks; this turned the stats daemon into a constant full-scan
workload on busy installs.
2. The per-monitor UPDATE loop X-locked every Event_Summaries row on
every cycle even when nothing changed, adding avoidable contention
with the trigger writers this rewrite is supposed to protect.
Capture MonitorIds as we SELECT bucket rows for pruning, then skip the
resync entirely if no rows were pruned. When rows were pruned, restrict
the aggregate SELECTs (WHERE MonitorId IN ...) and the per-monitor
UPDATEs to that touched set. zmaudit remains the periodic deep-resync
safety net for drift in untouched monitors.
Also capture errstr before rollback so the gave-up Error reports the
actual reason instead of an empty string on drivers that clear errstr
on rollback.
…udit Without this, an enumerate failure (SELECT MonitorId FROM Event_Summaries) left @Skipped empty and the per-monitor UPDATE loop ran for whatever rows the bucket aggregates returned — but monitors that exist only in Event_Summaries (no current bucket rows) never got zeroed, while the audit log claimed a full resync. Track enumerate success and report partial resync when it fails.
|
Five new comments addressed in three follow-up commits (the zmstats commit folds K + L together because the touched-monitors restriction fixes both at once):
|
| ZoneMinder::Database::zmDbDo('SET TRANSACTION ISOLATION LEVEL READ COMMITTED'); | ||
| $ZoneMinder::Database::dbh->begin_work(); |
| # SET TRANSACTION ... applies only to the next transaction, so it must | ||
| # be issued before begin_work and re-issued on each retry. | ||
| zmDbDo('SET TRANSACTION ISOLATION LEVEL READ COMMITTED'); | ||
| $dbh->begin_work(); | ||
|
|
||
| my $err = 0; | ||
| my $errstr; # captured before rollback() — rollback can clear errstr |
| Error('Failed '._sql_with_bind_values($sql, @params).' : '.$dbh->errstr()); | ||
| last; | ||
| } | ||
| if ( defined $rows and ZoneMinder::Logger::logLevel() > INFO ) { |
| # Storage DiskSpace resync: same trap. Snapshot per-Storage sums via plain | ||
| # SELECT, then UPDATE Storage one row at a time. | ||
| { | ||
| my $storage_done = 0; | ||
| my $rows = $dbh->selectall_arrayref( | ||
| 'SELECT StorageId, COALESCE(SUM(DiskSpace), 0) FROM Events GROUP BY StorageId' | ||
| ); | ||
| if ($dbh->err()) { | ||
| Error('zmaudit Storage aggregate failed: '.$dbh->errstr()); | ||
| } else { | ||
| my %disk = map { $_->[0] => $_->[1] } grep { defined $_->[0] } @$rows; | ||
| my $storage_ids = $dbh->selectcol_arrayref('SELECT Id FROM Storage'); | ||
| if ($dbh->err()) { | ||
| Error('zmaudit Storage enumerate failed: '.$dbh->errstr()); | ||
| } else { | ||
| my $attempted = 0; | ||
| my $failed = 0; | ||
| for my $sid (@$storage_ids) { | ||
| $attempted++; | ||
| my $rv = ZoneMinder::Database::zmDbDo( | ||
| 'UPDATE Storage SET DiskSpace=? WHERE Id=?', | ||
| $disk{$sid} // 0, $sid | ||
| ); | ||
| $failed++ if !defined $rv; | ||
| } | ||
| if (!$failed) { | ||
| aud_print('Finished updating Storage DiskSpace'); | ||
| $storage_done = 1; | ||
| } else { | ||
| aud_print("Partial Storage DiskSpace update: $failed/$attempted row(s) failed"); | ||
| } | ||
| } |
Every row in the previous arrayref-of-arrayref carried the same single
bind value (the event Id), so the [$sql, $$event{Id}] wrapping and the
my ($sql, @Bind) = @$stmt unpacking were doing no work. Iterate over the
SQL strings directly and pass $$event{Id} as the one bind value.
Summary
Eliminates the recurring InnoDB deadlock cycle observed under load between
zma'sevent_update_triggerpath,zmstats.pl's Event_Summaries maintenance,zmaudit.pl's aggregations, andzmfilter.pl(Event::delete).The root cause was a multi-table
UPDATE Event_Summaries es JOIN Events_Hour ...pattern: MariaDB takes S-locks on the joined bucket rows and holds them to TX commit regardless of isolation level, which inverts the canonical lock orderagainst the trigger writers that hold X-locks on those same bucket rows. The InnoDB deadlock log captured the exact
zmavszmstatscycle.What changes
zmDbDoauto-retries on errno 1213 (5 attempts, exponential backoff with jitter) when called outside an explicit transaction. Most call sites are autocommit-style and already idempotent on retry.Event::deletesetsREAD COMMITTEDfor the per-event delete TX, runs Stats/Event_Data/Frames/Events deletes, and retries the whole TX on 1213. Also fixes a latentcommit()->rollback()bug on the error path.zmstats.plEvent_Summaries maintenance is rewritten to:Events_Hour/Day/Week/Month(RC, no gap locks).SELECT(consistent read at RC, no locks).UPDATE Event_SummariesperMonitorIdusing the snapshotted values.This keeps the lock prefix (
buckets -> Event_Summaries) consistent with the trigger path and never holds bucket S-locks while writing ES. Wrapped in 5-retry deadlock loop.zmaudit.plreceives the same snapshot-then-per-row treatment for its Event_Summaries and StorageDiskSpacerecomputations, replacing five correlated/JOINed UPDATEs that had the same lock-ordering hazard.db/triggers.sqlpicks up a comment block aboveevent_update_triggerdocumenting the canonical lock acquisition order so future writers don't reintroduce the cycle.Why isolation level alone wasn't enough
SET TRANSACTION ISOLATION LEVEL READ COMMITTEDdrops the next-key/gap locks on plain DELETE/UPDATE range scans, but a multi-table UPDATE always takes S-locks on the JOINed rows for as long as the transaction lives. The structural fix had to be removing the JOIN, not lowering isolation.Notes for reviewers
triggers.sql.zmDbDoretry is conditional onAutoCommit— it will not silently re-run statements that are part of an explicit caller transaction (those callers are responsible for their own retry, e.g. theEvent::deleteloop).Event::delete's isolation set + retry path is bypassed when the caller is already inside a transaction ($in_transactionshort-circuit), so callers that wrap multiple deletes in their own TX keep their semantics.Test plan
cmake --build build) — Perl scripts go through cmakeconfigure_file.perl -con the touched modules with a real install (placeholders substituted).zmstats.pl,zmaudit.plagainst a busy install and confirm noDeadlock founderrors in/var/log/zm/.zmais updating the same monitor; confirmEvent::deleteretries succeed and the row is removed.SUM(DiskSpace)/COUNT(*)from the bucket tables after a fullzmstatscycle.zmaudit.pl --interactiveStorage DiskSpace UPDATE still produces correct totals.🤖 Generated with Claude Code