fix: prevent document processing cleanup from using expired context#1348
Open
HelloWeit wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
文档处理任务在 embedding/indexing 阶段可能因为任务超时或上游服务超时而失败。此前
document:process任务依赖 Asynq 默认 30 分钟超时,大文件场景下任务尚未处理完成就可能触发context deadline exceeded。相关超时时间配置问题已在前一个 PRfix: extend document processing timeout for large files中处理。本 PR 关注失败后的收尾逻辑:
processChunks在失败后会尝试把 knowledge 标记为failed,并清理 chunks/index。当前失败收尾继续使用原始任务
ctx。当失败原因是context deadline exceeded时,这个ctx已经被被取消,后续的状态更新和清理失败,导致 knowledge 仍然停留在processing,前端页面看起来像任务一直没有结束。修改内容
parse_status=failed和error_message。测试
已执行:
gofmt -w internal/application/service/knowledge_process.go go test ./internal/application/service/... ./internal/router/...说明:
与该修改直接相关的代码已通过 gofmt。
如本地未启动 Elasticsearch、OSS 等外部依赖,部分现有测试可能失败,失败原因与本次修改无关。
手动验证:
将文档处理 timeout 临时调小,用大文件/慢 embedding 服务触发 BatchIndex 超时。
超时后确认 knowledge 能从 processing 更新为 failed。
确认失败收尾使用新的 cleanup context,状态更新和 chunks/index 清理不会因为原始任务 ctx 已取消而立即失败。
代码上是在 knowledge_process.go 的
processChunks失败分支里改,不需要动页面,也不需要动队列配置。这个 PR 和前一个 timeout 配置 PR 是互补的:前一个减少“大文件被 30 分钟打断”的概率,这个保证“即使真的超时失败,失败状态和清理也尽量能落库”。