Skip to content

fix: 添加 SafeGo panic 恢复并修复无缓冲 channel goroutine 死锁#4862

Open
Evsdrg wants to merge 1 commit into
QuantumNous:mainfrom
Evsdrg:fix/goroutine-panic-channel-leaks
Open

fix: 添加 SafeGo panic 恢复并修复无缓冲 channel goroutine 死锁#4862
Evsdrg wants to merge 1 commit into
QuantumNous:mainfrom
Evsdrg:fix/goroutine-panic-channel-leaks

Conversation

@Evsdrg
Copy link
Copy Markdown

@Evsdrg Evsdrg commented May 14, 2026

摘要

本次 PR 修复了 goroutine panic 导致进程崩溃的风险,以及流式响应中无缓冲 channel 导致的 goroutine 死锁问题。

goroutine panic 恢复

  • common/gopool_safe.go:新增 SafeGo() 函数,在 gopool.Go 基础上自动添加 panic 恢复。所有 fire-and-forget goroutine 中的 panic 不再导致进程崩溃,而是通过 SysLog 记录错误日志。
  • 19 个文件中的 gopool.Go() 调用全部替换为 common.SafeGo()
  • 清理了不再使用的 gopool 导包。

无缓冲 channel 死锁修复

  • relay/channel/cohere/relay-cohere.go:将 dataChan 改为缓冲 channel(10),为 scanner goroutine 添加 stopChan 退出机制。当 SSE 流消费者(c.Stream)提前退出时,goroutine 不再永久阻塞。
  • relay/channel/zhipu/relay-zhipu.go:将 dataChanmetaChan 改为缓冲 channel(10)。
  • relay/channel/palm/relay-palm.go:将 dataChan 改为缓冲 channel(10)。

变更文件

23 个文件,+54 行,-67 行

测试

  • SafeGo 对已有 panic 恢复的 goroutine 无影响(内层 recover 先捕获)
  • 缓冲 channel 大小 10 足以容纳 SSE 流帧,不影响正常流程
  • 删除 gopool 导包仅对不再直接使用 gopool 的文件生效

🤖 Generated with Claude Code Best

Summary by CodeRabbit

  • Refactor
    • Improved internal goroutine execution management across background tasks including logging, caching, relay operations, and billing processes to enhance system stability and error resilience.

Review Change Stack

本次提交包含以下修复:

**goroutine panic 恢复**
- common/gopool_safe.go: 新增 SafeGo 函数,在 gopool.Go 基础上自动添加
  panic 恢复。所有 fire-and-forget goroutine 中的 panic 不再导致进程崩溃,
  而是通过 SysLog 记录错误日志。
- 将 19 个文件中的 gopool.Go() 调用全部替换为 common.SafeGo(),
  确保所有异步 goroutine 都有 panic 保护。

**无缓冲 channel 死锁修复**
- relay/channel/cohere/relay-cohere.go: 将 dataChan 改为缓冲 channel(10),
  并为 scanner goroutine 添加 stopChan 退出机制。当 SSE 流消费者
  (c.Stream) 提前退出时,goroutine 不再永久阻塞。
- relay/channel/zhipu/relay-zhipu.go: 将 dataChan 和 metaChan 改为
  缓冲 channel(10),防止客户端断开后 goroutine 死锁。
- relay/channel/palm/relay-palm.go: 将 dataChan 改为缓冲 channel(10)。

**Import 清理**
- 移除 19 个文件中不再使用的 gopool 导入。

Co-Authored-By: deepseek-v4-pro[1m] <deepseek-ai@claude-code-best.win>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Walkthrough

This PR removes the external github.com/bytedance/gopkg/util/gopool dependency and replaces all gopool.Go(...) calls throughout the codebase with a local common.SafeGo(...) wrapper. It also buffers streaming channels in relay handlers to reduce producer blocking.

Changes

Goroutine concurrency refactor: gopool → common.SafeGo

Layer / File(s) Summary
Main entry point and controller goroutine migration
main.go, controller/channel-test.go
Replace gopool.Go with common.SafeGo in main.go startup routines for Midjourney bulk update and pprof HTTP listener; update controller channel test background execution.
Model layer Redis cache async updates
model/token.go, model/user.go, model/user_cache.go, model/utils.go, model/log.go
Migrate deferred async Redis cache operations (DB-read fills, cache sets/deletes, quota mutations) from gopool.Go to common.SafeGo across all model layer files.
Service-layer async quota and relay sampling
controller/relay.go, service/quota.go, service/text_quota.go, service/pre_consume_quota.go, service/billing_session.go
Replace gopool.Go with common.SafeGo for quota notifications, relay sample recording, channel auto-disable, and refund scheduling across controller and service modules.
Service background task scheduling
logger/logger.go, service/codex_credential_refresh_task.go, service/subscription_reset_task.go, service/notify-limit.go
Switch background task startup (credential refresh, subscription reset, notification cleanup, log rotation) from gopool.Go to common.SafeGo.
Relay streaming and API request goroutine management
relay/channel/api_request.go, relay/channel/openai/relay-openai.go, relay/helper/stream_scanner.go
Replace gopool.Go with common.SafeGo in relay channel API request ping keep-alive, OpenAI realtime websocket readers, and stream scanner goroutine coordination.
Streaming channel buffering optimization
relay/channel/cohere/relay-cohere.go, relay/channel/palm/relay-palm.go, relay/channel/zhipu/relay-zhipu.go, common/gopool.go
Buffer streaming channels (dataChan, metaChan capacity 10), add stopChan select logic in Cohere handler to reduce producer blocking during streaming and shutdown; remove gopool dependency from common/gopool.go.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • QuantumNous/new-api#2877: Both PRs touch the billing refund async execution path; main PR swaps gopool.Go to common.SafeGo in service/billing_session.go.
  • QuantumNous/new-api#1775: Both PRs modify controller/relay.go's async channel auto-disable path around service.DisableChannel.

Suggested reviewers

  • creamlike1024

Poem

🐰 A rabbit hops through code so neat,
Swapping pools for safety's beat,
From Bytedance to common's care,
Goroutines now buffered, fair! 🎀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main changes: adding SafeGo panic recovery and fixing unbuffered channel goroutine deadlocks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
model/utils.go (1)

33-38: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the batch updater alive after a panic.

A panic inside batchUpdate() is recovered by common.SafeGo, but then this goroutine exits permanently. That can silently stop all future batch flushes.

💡 Proposed fix
 func InitBatchUpdater() {
 	common.SafeGo(func() {
 		for {
-			time.Sleep(time.Duration(common.BatchUpdateInterval) * time.Second)
-			batchUpdate()
+			func() {
+				defer func() {
+					if r := recover(); r != nil {
+						common.SysError("batch updater recovered from panic")
+					}
+				}()
+				time.Sleep(time.Duration(common.BatchUpdateInterval) * time.Second)
+				batchUpdate()
+			}()
 		}
 	})
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model/utils.go` around lines 33 - 38, A panic inside batchUpdate called from
the goroutine spawned by common.SafeGo causes that goroutine to exit and stop
future flushes; to fix, ensure the for loop continues after a panic by wrapping
each batchUpdate invocation in a per-iteration recover block (e.g., call
batchUpdate inside an anonymous func with a defer that recovers and logs the
panic) rather than letting the panic escape the loop; keep the outer
common.SafeGo and the sleep using common.BatchUpdateInterval, and reference the
functions/idents batchUpdate, common.SafeGo, and common.BatchUpdateInterval when
making the change.
relay/channel/zhipu/relay-zhipu.go (1)

164-183: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Potential goroutine leak: blocking stopChan send.

At line 183, the goroutine sends to the unbuffered stopChan. If the c.Stream consumer exits early (e.g., client disconnect or context cancellation), the send blocks indefinitely, leaking the goroutine.

Cohere solved this with a non-blocking send. Consider applying the same pattern here.

Proposed fix: non-blocking stopChan send
-	stopChan <- true
+	select {
+	case stopChan <- true:
+	default:
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@relay/channel/zhipu/relay-zhipu.go` around lines 164 - 183, The goroutine can
block when sending to the unbuffered stopChan causing a leak; change the send to
be non-blocking (or make stopChan buffered) so the goroutine can exit without
waiting for a receiver. Locate stopChan declaration and the anonymous goroutine
that reads from scanner (referenced by stopChan, scanner, dataChan, metaChan)
and either initialize stopChan := make(chan bool, 1) or replace the blocking
send stopChan <- true with a non-blocking select { case stopChan <- true:
default: } so the send never blocks if the consumer has exited.
relay/channel/palm/relay-palm.go (1)

58-87: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Potential goroutine leak: blocking stopChan sends.

The goroutine sends to the unbuffered stopChan at lines 63, 71, 83, and 87. If the c.Stream consumer exits early (e.g., client disconnect or context cancellation), the send will block indefinitely, leaking the goroutine.

Cohere solved this with a non-blocking send (lines 113-115 in relay-cohere.go). Consider the same pattern here.

Proposed fix: non-blocking stopChan send

Replace blocking sends with:

-		stopChan <- true
+		select {
+		case stopChan <- true:
+		default:
+		}

Apply this pattern at lines 63, 71, 83, and 87.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@relay/channel/palm/relay-palm.go` around lines 58 - 87, The goroutine uses an
unbuffered stopChan and performs blocking sends (stopChan) in the anonymous
goroutine, which can leak if the c.Stream consumer exits; change the send
strategy to avoid blocking by either making stopChan buffered (size 1) or using
a non-blocking send via select { case stopChan <- true: default: } at each send
site (the sends after io.ReadAll error, json.Unmarshal error, json.Marshal
error, and the final send), updating references in this block that interact with
resp, palmResponse, fullTextResponse, and streamResponsePaLM2OpenAI so the
goroutine never blocks waiting on stopChan.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@model/utils.go`:
- Around line 33-38: A panic inside batchUpdate called from the goroutine
spawned by common.SafeGo causes that goroutine to exit and stop future flushes;
to fix, ensure the for loop continues after a panic by wrapping each batchUpdate
invocation in a per-iteration recover block (e.g., call batchUpdate inside an
anonymous func with a defer that recovers and logs the panic) rather than
letting the panic escape the loop; keep the outer common.SafeGo and the sleep
using common.BatchUpdateInterval, and reference the functions/idents
batchUpdate, common.SafeGo, and common.BatchUpdateInterval when making the
change.

In `@relay/channel/palm/relay-palm.go`:
- Around line 58-87: The goroutine uses an unbuffered stopChan and performs
blocking sends (stopChan) in the anonymous goroutine, which can leak if the
c.Stream consumer exits; change the send strategy to avoid blocking by either
making stopChan buffered (size 1) or using a non-blocking send via select { case
stopChan <- true: default: } at each send site (the sends after io.ReadAll
error, json.Unmarshal error, json.Marshal error, and the final send), updating
references in this block that interact with resp, palmResponse,
fullTextResponse, and streamResponsePaLM2OpenAI so the goroutine never blocks
waiting on stopChan.

In `@relay/channel/zhipu/relay-zhipu.go`:
- Around line 164-183: The goroutine can block when sending to the unbuffered
stopChan causing a leak; change the send to be non-blocking (or make stopChan
buffered) so the goroutine can exit without waiting for a receiver. Locate
stopChan declaration and the anonymous goroutine that reads from scanner
(referenced by stopChan, scanner, dataChan, metaChan) and either initialize
stopChan := make(chan bool, 1) or replace the blocking send stopChan <- true
with a non-blocking select { case stopChan <- true: default: } so the send never
blocks if the consumer has exited.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c742db48-17b9-4548-b184-67e943cd067a

📥 Commits

Reviewing files that changed from the base of the PR and between 18282e6 and 244bddf.

📒 Files selected for processing (23)
  • common/gopool.go
  • controller/channel-test.go
  • controller/relay.go
  • logger/logger.go
  • main.go
  • model/log.go
  • model/token.go
  • model/user.go
  • model/user_cache.go
  • model/utils.go
  • relay/channel/api_request.go
  • relay/channel/cohere/relay-cohere.go
  • relay/channel/openai/relay-openai.go
  • relay/channel/palm/relay-palm.go
  • relay/channel/zhipu/relay-zhipu.go
  • relay/helper/stream_scanner.go
  • service/billing_session.go
  • service/codex_credential_refresh_task.go
  • service/notify-limit.go
  • service/pre_consume_quota.go
  • service/quota.go
  • service/subscription_reset_task.go
  • service/text_quota.go
💤 Files with no reviewable changes (1)
  • common/gopool.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant