Skip to content

BaseTrigger Max Retries + Pre-Ack Handling#1969

Open
DylanTinianov wants to merge 9 commits intomainfrom
CRE-3248-basetrigger-attempts-max
Open

BaseTrigger Max Retries + Pre-Ack Handling#1969
DylanTinianov wants to merge 9 commits intomainfrom
CRE-3248-basetrigger-attempts-max

Conversation

@DylanTinianov
Copy link
Copy Markdown
Contributor

@DylanTinianov DylanTinianov commented Apr 7, 2026

  • Adds max retries to base trigger along with DB pruning loop for cleanup.
  • Adds pre-ACK cache for slow nodes which receive an ACK before even seeing the trigger event
  • Fixes race condition on ACK during event Delivery

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

⚠️ API Diff Results - github.com/smartcontractkit/chainlink-common

⚠️ Breaking Changes (1)

pkg/capabilities.BaseTriggerMetrics (1)
  • IncGaveUp — ➕ Added

✅ Compatible Changes (3)

pkg/capabilities.(*BaseTriggerBeholderMetrics) (1)
  • IncGaveUp — ➕ Added
pkg/settings/cresettings.Schema (2)
  • BaseTriggerMaxRetries — ➕ Added

  • BaseTriggerPruneAge — ➕ Added


📄 View full apidiff report

@DylanTinianov DylanTinianov changed the title Implement max retries BaseTrigger Max Retries Apr 7, 2026
@DylanTinianov DylanTinianov marked this pull request as ready for review April 7, 2026 15:55
@DylanTinianov DylanTinianov requested a review from a team as a code owner April 7, 2026 15:55
return
}

maxAttempts := b.maxRetries(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you be consistent, either maxAttempt or maxRetries?

activeRegistrations metric.Int64UpDownCounter
pendingEvents metric.Int64UpDownCounter
stuckEvents metric.Int64UpDownCounter
gaveUpCount metric.Int64Counter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gaveUp shouldn't be something more like stopResendingEvents? specially since the other metrics are stuckEvents/pendingEvents/etc.

var toGiveUp []gaveUpEvent
for triggerID, pendingForTrigger := range b.pending {
for eventID, rec := range pendingForTrigger {
if maxAttempts > 0 && rec.Attempts >= maxAttempts {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace maxAttempts > 0 by a function to make it clear

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, could you make a new method with this new logic which is "stop resending and fire a metric", and the old behaviour the code for "appendToTryResending"

so there's a clear

if reachedMaxAttempts(rec.Attempts){
  "stop resending and fire a metric"()
} else {
  "appendToTryResending"()
}

triggerID: triggerID,
eventID: eventID,
attempts: rec.Attempts,
wasCritical: wasCritical,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for another PR but this metric of criticality somehow collides with AddPendingEvents() one.

I believe critically shouldn't be handled by the code here but in alerts, so you might drop this metric of critical and just have a metric of resending.

In an ideal scenario, resending should be almost near 0

if ev.wasCritical {
b.metrics.DecStuckEvent(ev.triggerID, ev.eventID)
}
if err := b.store.DeleteEvent(ctx, ev.triggerID, ev.eventID); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really wondering if we want to do this DB deletion here, or having another long lived process that prunes old data from DB.

Specially since if this happens for a payload that's unrecoverable such as the HTTP Trigger, you have no means to somehow restore this data to the customer (just a thought)

b.mu.Unlock()

if inMemory {
// Still actively tracked — scanPending will handle it (gave-up or ACK).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should technically never happen, right?

If the prune time is set to 24hs, and max attempts to 20, with a retrial of 30 seconds, that's 10m of retrials, so eventually this should have been deleted from in-mem.

I believe you should error here, or even throw a metric to show there's an inconsistence

}
cutoff := time.Now().Add(-age)

recs, err := b.store.List(b.ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to have a query just to hit the DB asking for events which have been modified recently, and potentially also parameterize the maxAttempts to it

}

for _, rec := range recs {
if rec.FirstAt.After(cutoff) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why FirstAt field and not LastSeenAt?
Shouldn't you remove based on the last time you modified the row in the DB instead of the first time you inserted it?

@DylanTinianov DylanTinianov changed the title BaseTrigger Max Retries BaseTrigger Max Retries + Pre-Ack Handling Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants