Skip to content

fix(channel): retry retrigger relay sends to avoid dropped final replies#582

Open
brendandebeasi wants to merge 2 commits into
spacedriveapp:mainfrom
brendandebeasi:fix/telegram-retrigger-relay-retry
Open

fix(channel): retry retrigger relay sends to avoid dropped final replies#582
brendandebeasi wants to merge 2 commits into
spacedriveapp:mainfrom
brendandebeasi:fix/telegram-retrigger-relay-retry

Conversation

@brendandebeasi
Copy link
Copy Markdown

Summary

Implements a first-pass mitigation for #581 by hardening retrigger/fallback delivery in src/agent/channel.rs.

What changed

  • Added bounded retry/backoff for outbound text sends (send_outbound_text_with_retry).
  • Applied retries to retrigger plaintext fallback path (where dropped final replies were observed).
  • When retrigger fallback send succeeds, marks replied_flag=true so relay state is preserved correctly.
  • If all retries fail on retrigger relay, sends a short backup notice instructing user to send continue to replay preserved result.
  • Applied smaller retry to non-retrigger plaintext fallback path.

Why

Logs showed successful worker completion followed by:

  • retrigger produced text without reply tool, sending as fallback
  • retrigger relay failed, preserving result in history for next turn

That leads to user-visible truncation (partial opener, no final rundown). This PR adds retries and clearer user-facing fallback behavior.

Validation

  • cargo check --all-targets passes.

Follow-ups (optional)

  • Add adapter-level tests with injected send failure to assert retry + backup notice behavior.
  • Include richer transport error metadata from messaging adapters for easier debugging.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 03684f88-5ad0-48c3-945b-3ca4adb0387f

📥 Commits

Reviewing files that changed from the base of the PR and between d7b5c0d and 1010830.

📒 Files selected for processing (1)
  • src/agent/channel.rs

Walkthrough

send_outbound_text now returns bool indicating routed-send success. handle_agent_result fallback branches check that return value and set replied_flag = true when fallback text is sent successfully; one non-retrigger call site now awaits the bool return without using it.

Changes

Channel outbound send updates

Layer / File(s) Summary
Send return type
src/agent/channel.rs
send_outbound_text changed to return bool (true on routed send success, false on routed channel closed); metrics and logging retained.
Retrigger-skipped fallback
src/agent/channel.rs
Retrigger-skipped-but-produced-text path now checks outbound send result and sets replied_flag = true when send returns true.
Retrigger produced-text (no reply tool)
src/agent/channel.rs
Retrigger produced-text without reply-tool path now checks outbound send result and sets replied_flag = true when send returns true.
Non-retrigger fallback call site
src/agent/channel.rs
Non-retrigger fallback send updated to await send_outbound_text's bool return value; the returned bool is not used to set replied_flag in this path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • jamiepine
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding retry logic to prevent dropped final replies in retrigger relay sends, which directly aligns with the changeset's primary objective.
Description check ✅ Passed The description provides relevant context about the changes made to handle_agent_result and send_outbound_text in channel.rs, explaining the motivation to avoid dropped final replies and improve fallback delivery.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/channel.rs`:
- Around line 3114-3131: Add an injected-failure unit/integration test that
exercises the retrigger fallback state machine around
send_outbound_text_with_retry: simulate controlled send failures and time/retry
behavior to assert both branches — (1) success after retry sets replied_flag to
true and does not enqueue the preserved replay, and (2) permanent failure leaves
replied_flag false and sends the backup "continue" notice while preserving the
background result for replay. Use dependency injection/mocking or an injectable
send function used by send_outbound_text_with_retry to force 0, partial, and
full-failure scenarios, advance any timers/async delays as needed, and assert
final state (replied_flag, preserved-result replay state, and that the correct
backup notice was sent).
- Around line 2991-3037: The send_outbound_text_with_retry function implements
sleeps and retries inside the channel event loop but mpsc::Sender::send() only
errors on a permanently closed receiver, so remove the retry loop and delay
sleeps from send_outbound_text_with_retry; instead, call send_outbound_text
once, log a clear warning including channel_id and error_context when it fails,
and return false immediately (fail fast). Keep send_outbound_text unchanged and
avoid introducing retry logic or sleeps at this channel layer—if transient
transport retries are needed, implement them in the adapter/transport layer with
tests and rationale.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2a18f698-a307-41f1-98cc-ebbfaf3ffe92

📥 Commits

Reviewing files that changed from the base of the PR and between 8fb07b8 and d7b5c0d.

📒 Files selected for processing (1)
  • src/agent/channel.rs

Comment thread src/agent/channel.rs Outdated
Comment thread src/agent/channel.rs Outdated
Comment on lines +3114 to +3131
let delivered = self
.send_outbound_text_with_retry(
final_text,
"failed to send retrigger fallback reply",
3,
)
.await;
if delivered {
replied_flag.store(true, std::sync::atomic::Ordering::Relaxed);
} else {
let _ = self
.send_outbound_text_with_retry(
"Delivery issue: your background result is preserved. Send 'continue' to replay it.".to_string(),
"failed to send relay failure backup notice",
1,
)
.await;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Add an injected-failure test for the new retrigger fallback state machine.

These paths now couple retry timing, replied_flag mutation, preserved-result replay, and the backup "continue" notice, but the provided validation is still just cargo check. Please add a targeted test that forces controlled send failures and asserts both branches: success after retry and failure-after-all-attempts preserving replay state.

As per coding guidelines, "For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to just just gate-pr."

Also applies to: 3192-3209, 3265-3270

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel.rs` around lines 3114 - 3131, Add an injected-failure
unit/integration test that exercises the retrigger fallback state machine around
send_outbound_text_with_retry: simulate controlled send failures and time/retry
behavior to assert both branches — (1) success after retry sets replied_flag to
true and does not enqueue the preserved replay, and (2) permanent failure leaves
replied_flag false and sends the backup "continue" notice while preserving the
background result for replay. Use dependency injection/mocking or an injectable
send function used by send_outbound_text_with_retry to force 0, partial, and
full-failure scenarios, advance any timers/async delays as needed, and assert
final state (replied_flag, preserved-result replay state, and that the correct
backup notice was sent).

… success

Per CodeRabbit: tokio mpsc::Sender::send only fails when the receiver is
closed, so the 250/1000/2000ms retry loop and the "send 'continue' to replay"
backup notice are dead code (the backup goes through the same closed channel
and would also fail).

The real bug fix in this PR was setting replied_flag = true on successful
fallback delivery — without it, the post-turn check at line ~2354 still logs
"retrigger relay failed" and injects "[background work completed but relay
to user failed]" into history even when the fallback succeeded, which is
what produced the user-visible truncation symptom.

Now: send_outbound_text returns bool, logs a clear warning on failure,
fallback paths set replied_flag on success. Net diff: +13/-77.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant