fix: 添加 SafeGo panic 恢复并修复无缓冲 channel goroutine 死锁#4862
Conversation
本次提交包含以下修复: **goroutine panic 恢复** - common/gopool_safe.go: 新增 SafeGo 函数,在 gopool.Go 基础上自动添加 panic 恢复。所有 fire-and-forget goroutine 中的 panic 不再导致进程崩溃, 而是通过 SysLog 记录错误日志。 - 将 19 个文件中的 gopool.Go() 调用全部替换为 common.SafeGo(), 确保所有异步 goroutine 都有 panic 保护。 **无缓冲 channel 死锁修复** - relay/channel/cohere/relay-cohere.go: 将 dataChan 改为缓冲 channel(10), 并为 scanner goroutine 添加 stopChan 退出机制。当 SSE 流消费者 (c.Stream) 提前退出时,goroutine 不再永久阻塞。 - relay/channel/zhipu/relay-zhipu.go: 将 dataChan 和 metaChan 改为 缓冲 channel(10),防止客户端断开后 goroutine 死锁。 - relay/channel/palm/relay-palm.go: 将 dataChan 改为缓冲 channel(10)。 **Import 清理** - 移除 19 个文件中不再使用的 gopool 导入。 Co-Authored-By: deepseek-v4-pro[1m] <deepseek-ai@claude-code-best.win>
WalkthroughThis PR removes the external ChangesGoroutine concurrency refactor: gopool → common.SafeGo
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
model/utils.go (1)
33-38:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the batch updater alive after a panic.
A panic inside
batchUpdate()is recovered bycommon.SafeGo, but then this goroutine exits permanently. That can silently stop all future batch flushes.💡 Proposed fix
func InitBatchUpdater() { common.SafeGo(func() { for { - time.Sleep(time.Duration(common.BatchUpdateInterval) * time.Second) - batchUpdate() + func() { + defer func() { + if r := recover(); r != nil { + common.SysError("batch updater recovered from panic") + } + }() + time.Sleep(time.Duration(common.BatchUpdateInterval) * time.Second) + batchUpdate() + }() } }) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model/utils.go` around lines 33 - 38, A panic inside batchUpdate called from the goroutine spawned by common.SafeGo causes that goroutine to exit and stop future flushes; to fix, ensure the for loop continues after a panic by wrapping each batchUpdate invocation in a per-iteration recover block (e.g., call batchUpdate inside an anonymous func with a defer that recovers and logs the panic) rather than letting the panic escape the loop; keep the outer common.SafeGo and the sleep using common.BatchUpdateInterval, and reference the functions/idents batchUpdate, common.SafeGo, and common.BatchUpdateInterval when making the change.relay/channel/zhipu/relay-zhipu.go (1)
164-183:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPotential goroutine leak: blocking
stopChansend.At line 183, the goroutine sends to the unbuffered
stopChan. If thec.Streamconsumer exits early (e.g., client disconnect or context cancellation), the send blocks indefinitely, leaking the goroutine.Cohere solved this with a non-blocking send. Consider applying the same pattern here.
Proposed fix: non-blocking stopChan send
- stopChan <- true + select { + case stopChan <- true: + default: + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@relay/channel/zhipu/relay-zhipu.go` around lines 164 - 183, The goroutine can block when sending to the unbuffered stopChan causing a leak; change the send to be non-blocking (or make stopChan buffered) so the goroutine can exit without waiting for a receiver. Locate stopChan declaration and the anonymous goroutine that reads from scanner (referenced by stopChan, scanner, dataChan, metaChan) and either initialize stopChan := make(chan bool, 1) or replace the blocking send stopChan <- true with a non-blocking select { case stopChan <- true: default: } so the send never blocks if the consumer has exited.relay/channel/palm/relay-palm.go (1)
58-87:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPotential goroutine leak: blocking
stopChansends.The goroutine sends to the unbuffered
stopChanat lines 63, 71, 83, and 87. If thec.Streamconsumer exits early (e.g., client disconnect or context cancellation), the send will block indefinitely, leaking the goroutine.Cohere solved this with a non-blocking send (lines 113-115 in
relay-cohere.go). Consider the same pattern here.Proposed fix: non-blocking stopChan send
Replace blocking sends with:
- stopChan <- true + select { + case stopChan <- true: + default: + }Apply this pattern at lines 63, 71, 83, and 87.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@relay/channel/palm/relay-palm.go` around lines 58 - 87, The goroutine uses an unbuffered stopChan and performs blocking sends (stopChan) in the anonymous goroutine, which can leak if the c.Stream consumer exits; change the send strategy to avoid blocking by either making stopChan buffered (size 1) or using a non-blocking send via select { case stopChan <- true: default: } at each send site (the sends after io.ReadAll error, json.Unmarshal error, json.Marshal error, and the final send), updating references in this block that interact with resp, palmResponse, fullTextResponse, and streamResponsePaLM2OpenAI so the goroutine never blocks waiting on stopChan.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@model/utils.go`:
- Around line 33-38: A panic inside batchUpdate called from the goroutine
spawned by common.SafeGo causes that goroutine to exit and stop future flushes;
to fix, ensure the for loop continues after a panic by wrapping each batchUpdate
invocation in a per-iteration recover block (e.g., call batchUpdate inside an
anonymous func with a defer that recovers and logs the panic) rather than
letting the panic escape the loop; keep the outer common.SafeGo and the sleep
using common.BatchUpdateInterval, and reference the functions/idents
batchUpdate, common.SafeGo, and common.BatchUpdateInterval when making the
change.
In `@relay/channel/palm/relay-palm.go`:
- Around line 58-87: The goroutine uses an unbuffered stopChan and performs
blocking sends (stopChan) in the anonymous goroutine, which can leak if the
c.Stream consumer exits; change the send strategy to avoid blocking by either
making stopChan buffered (size 1) or using a non-blocking send via select { case
stopChan <- true: default: } at each send site (the sends after io.ReadAll
error, json.Unmarshal error, json.Marshal error, and the final send), updating
references in this block that interact with resp, palmResponse,
fullTextResponse, and streamResponsePaLM2OpenAI so the goroutine never blocks
waiting on stopChan.
In `@relay/channel/zhipu/relay-zhipu.go`:
- Around line 164-183: The goroutine can block when sending to the unbuffered
stopChan causing a leak; change the send to be non-blocking (or make stopChan
buffered) so the goroutine can exit without waiting for a receiver. Locate
stopChan declaration and the anonymous goroutine that reads from scanner
(referenced by stopChan, scanner, dataChan, metaChan) and either initialize
stopChan := make(chan bool, 1) or replace the blocking send stopChan <- true
with a non-blocking select { case stopChan <- true: default: } so the send never
blocks if the consumer has exited.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c742db48-17b9-4548-b184-67e943cd067a
📒 Files selected for processing (23)
common/gopool.gocontroller/channel-test.gocontroller/relay.gologger/logger.gomain.gomodel/log.gomodel/token.gomodel/user.gomodel/user_cache.gomodel/utils.gorelay/channel/api_request.gorelay/channel/cohere/relay-cohere.gorelay/channel/openai/relay-openai.gorelay/channel/palm/relay-palm.gorelay/channel/zhipu/relay-zhipu.gorelay/helper/stream_scanner.goservice/billing_session.goservice/codex_credential_refresh_task.goservice/notify-limit.goservice/pre_consume_quota.goservice/quota.goservice/subscription_reset_task.goservice/text_quota.go
💤 Files with no reviewable changes (1)
- common/gopool.go
摘要
本次 PR 修复了 goroutine panic 导致进程崩溃的风险,以及流式响应中无缓冲 channel 导致的 goroutine 死锁问题。
goroutine panic 恢复
common/gopool_safe.go:新增SafeGo()函数,在gopool.Go基础上自动添加 panic 恢复。所有 fire-and-forget goroutine 中的 panic 不再导致进程崩溃,而是通过SysLog记录错误日志。gopool.Go()调用全部替换为common.SafeGo()。gopool导包。无缓冲 channel 死锁修复
relay/channel/cohere/relay-cohere.go:将dataChan改为缓冲 channel(10),为 scanner goroutine 添加stopChan退出机制。当 SSE 流消费者(c.Stream)提前退出时,goroutine 不再永久阻塞。relay/channel/zhipu/relay-zhipu.go:将dataChan和metaChan改为缓冲 channel(10)。relay/channel/palm/relay-palm.go:将dataChan改为缓冲 channel(10)。变更文件
23 个文件,+54 行,-67 行
测试
SafeGo对已有 panic 恢复的 goroutine 无影响(内层 recover 先捕获)gopool导包仅对不再直接使用gopool的文件生效🤖 Generated with Claude Code Best
Summary by CodeRabbit