feat(ingest+parser): large-file routing — doc_large queue + hybrid parsers for PDF/Word/PPT by toy0116 · Pull Request #1232 · Tencent/WeKnora

toy0116 · 2026-05-09T05:33:17Z

Problem

Large documents (≥5 MB PDFs, Word files with charts, PowerPoint decks) cause two distinct problems when processed by MinerU:

Queue starvation: A 100-page PDF occupies a MinerU worker for 10–40 min, blocking all subsequent uploads
Image loss: PDFParser uses a chain-of-responsibility pattern that stops at MarkItDown success — pdfminer.six is text-only, so embedded figures and scanned pages are silently dropped

Changes

Queue routing (`internal/`)

Add doc_large asynq queue (weight 1) alongside default (weight 2) — large files drain only after small files complete
docProcessQueue(fileSize int64): files ≥5 MB → doc_large, otherwise → default
Applied at all three enqueue sites: upload, reparse, clone/move
Concurrency: 16 to fully utilize I/O-bound LLM/embedding calls

MinerU bypass routing (`knowledge_process.go`)

Dual-threshold check overrides parserEngine=mineru when a file is too large for the MinerU worker budget:

Condition	Override engine
`pptx`/`ppt` (any size)	`pptx_hybrid` — slides are always visual
PDF: size ≥5 MB or page count ≥80	`pdf_hybrid`
docx/doc: size ≥5 MB	`docx_hybrid`

Page count is estimated via an O(n) byte scan for /Type /Page markers, avoiding a full PDF parse.

Hybrid parsers (`docreader/parser/`)

PDFHybridParser (new, registered as pdf_hybrid):

Fast O(n) byte scan for /Subtype /Image decides text-only vs hybrid path per file
Text-only PDFs: MarkItDown only — fast, no page rendering
Image-rich PDFs: MarkItDown + PDFScannedParser run concurrently via ThreadPoolExecutor; merged result preserves both text structure and per-page JPEG renders for VLM

office_hybrid_parser.py (new file):

DocxHybridParser and PptxHybridParser share _OfficeHybridParser base
MarkItDown extracts text structure concurrently with LibreOffice headless conversion to PDF
PDFScannedParser renders every page/slide as JPEG for the VLM pipeline
Availability gated on LibreOffice (soffice) being present

registry.py: registers pdf_hybrid, docx_hybrid, pptx_hybrid engines with availability checks

engine_registry.go: Go-side registration of docx_hybrid and pptx_hybrid with checkLibreOffice()

Why not always use MinerU?

MinerU uses subprocess.Popen isolation with MAX_CONCURRENT=1. Each worker requires 6–20 GB RAM and runs OCR/layout on the full document before returning. A 200-page PDF at 150 DPI blocks the queue for the entire conversion duration. The hybrid parsers give comparable visual coverage using LibreOffice (locally installed, no GPU required) and skip MinerU entirely for these cases.

Testing

Verified queue routing with files above/below 5 MB threshold
Verified page-count threshold with a 3 MB / 150-page text PDF correctly routes to pdf_hybrid
DocxHybridParser tested with a real 987 KB Robustel product deck (pptx) — LibreOffice conversion + page renders succeeded
go build ./... passes

MinerU runs with max_concurrent=1 via subprocess isolation. Files > 5 MB cause two problems: 1. They block the queue for all subsequent files (serial processing). 2. They can OOM or hang the MinerU subprocess indefinitely, leaving the calling document task stuck in "processing" with no further retry. Evidence from 2026-05-09 00:55 batch: milesight-iot-brochure-cloud.pdf 20 MB → hung milesight-iot-brochure-collection-en 13 MB → hung milesight-iot-lorawan-series-cat. 11 MB → hung ur35-user-guide-en.pdf 8 MB → hung Small files queued behind them waited 72+ min before first attempt. Fix: in convert(), if the selected engine is "mineru" and the file is > 5 MB, downgrade to "builtin". Builtin is always available, handles large PDFs reliably, and does not block a shared queue. Files under the threshold keep using MinerU for better OCR quality. Threshold constant: mineruLargeFileThreshold = 5 MB (easily adjustable). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…c_large queue Adds a "doc_large" asynq queue (weight 1) alongside the existing "default" queue (weight 2) for document:process tasks. Files at or above 5 MB are routed to "doc_large" so small files always drain their higher-weight queue first, preventing a handful of large PDFs from blocking dozens of small docs. The 5 MB threshold matches the existing mineru→builtin engine override (c3719d5) and is now defined as the package-level docLargeFileThreshold constant, replacing the inline const in convert(). A new docProcessQueue() helper makes the routing decision at every Enqueue call site consistently. Changed sites: - knowledge_create.go: file upload path - knowledge_process.go: file reparse + file-URL reparse - knowledge_clone_move.go: knowledge move reparse - router/task.go: NewAsynqServer() queue weight table URL-based knowledge and passage-text knowledge stay on "default" (no known FileSize at enqueue time or inherently tiny payloads). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ch PDFs Problem: builtin engine (MarkItDown/pdfminer.six) is text-only. For a text-searchable PDF like a Milesight product manual, FirstParser stops at MarkItDown on the first success and PDFScannedParser never runs, so all embedded figures, wiring diagrams and photos are silently dropped. Fix: new PDFHybridParser runs both passes unconditionally: 1. MarkitdownParser — extracts text structure (headings, tables, lists) 2. PDFScannedParser — renders every page as JPEG so multimodal VLM sees all visual content (figures, diagrams, photos) Output markdown = MarkItDown text + page image refs merged; image refs trigger the existing async multimodal VLM pipeline for captioning/OCR. Registered as engine "pdf_hybrid" in the Python parser registry. Go routing updated in convert(): large PDF files (≥5 MB, engine=mineru) now override to "pdf_hybrid" instead of "builtin", preserving image coverage. Non-PDF large files (docx, pptx, …) still fall back to "builtin" since MarkItDown handles those formats' embedded images natively. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…g concurrently MarkitdownParser (pdfminer.six) and PDFScannedParser (pypdfium2 page render) are independent: both read the same input bytes and write to separate outputs. Run them in a ThreadPoolExecutor(max_workers=2) so they execute in parallel, cutting total wall-clock time from sum(step1 + step2) to max(step1, step2). For a 50-page 10 MB PDF this saves ~2–5 s of text-extraction time that was previously paid as pure serial overhead before page rendering started. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… file Adds _pdf_has_images() fast heuristic (O(n) byte scan for /Subtype /Image) to decide at runtime whether a large PDF needs page rendering: No images detected → text-only fast path: MarkitdownParser only, no page rendering, no VLM tasks generated Images detected → hybrid path: MarkitdownParser + PDFScannedParser run concurrently, results merged This implements the three-tier best practice for large PDFs (≥5 MB): 1. ≤5 MB → MinerU (best quality, handles all content) 2. >5 MB, text-only → pdf_hybrid text path (fast, lightweight) 3. >5 MB, text+images → pdf_hybrid hybrid path (concurrent, full coverage) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Problem: a text-heavy PDF can be many pages yet small in file size (<5 MB), causing MinerU to block for 30-90 min on what looks like a "small" file. File size alone is a poor predictor of MinerU processing time; page count is the direct driver (~1-2 min/10 pages on 8-core M-series with pipeline backend). Changes: - Add docLargePageThreshold = 80 (≈10-20 min wall-clock on this machine) - Add pdfEstimatePageCount(): O(n) byte scan for /Type /Page markers, same heuristic approach as _pdf_has_images in Python — zero dependencies - For PDF + engine=mineru: load file bytes early (before engine routing) to count pages, then cache them in cachedFileBytes to avoid a second storage read when building req.FileContent - Route to pdf_hybrid when FileSize >= 5MB OR page count >= 80 - Log both dimensions in the warning so the trigger is visible in logs Queue routing (docProcessQueue) unchanged — page count is not available at enqueue time and file size is a good-enough proxy for queue priority. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rid rendering Adds two new hybrid parser engines for Word and PowerPoint using the same pattern as PDFHybridParser: MarkItDown text extraction + LibreOffice page rendering run concurrently via ThreadPoolExecutor(max_workers=2). LibreOffice converts docx/pptx→PDF (headless, subprocess), then PDFScannedParser renders each page as a JPEG for the multimodal VLM pipeline. DocxHybridParser — captures Charts/SmartArt/OLE objects that DocxParser (python-docx inline image extraction) silently drops. Triggered for large Word files (≥5 MB) when engine=mineru. PptxHybridParser — slides are fundamentally visual; MarkItDown alone extracts text only. Triggered for all pptx/ppt when engine=mineru. Python registry: registers docx_hybrid + pptx_hybrid with LibreOffice availability check (find_soffice scans PATH + known install paths). Go engine_registry: docxHybridEngine + pptxHybridEngine with checkLibreOffice() so both engines appear in the UI dropdown with availability status. Go routing (convert()): pptx/ppt + mineru → always pptx_hybrid (visual format regardless of size) docx/doc + mineru + ≥5MB → docx_hybrid pdf + mineru + (≥5MB or ≥80 pages) → pdf_hybrid (unchanged) resolveDocReader: pdf_hybrid/docx_hybrid/pptx_hybrid all route to s.documentReader (Python docreader) — same as builtin. Tested: PptxHybridParser on 987 KB real deck → 9 slides rendered ✅ DocxHybridParser on generated docx → text + 1 page ✅ LibreOffice availability detection ✅ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

toy0116 and others added 7 commits May 9, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest+parser): large-file routing — doc_large queue + hybrid parsers for PDF/Word/PPT#1232

feat(ingest+parser): large-file routing — doc_large queue + hybrid parsers for PDF/Word/PPT#1232
toy0116 wants to merge 7 commits into
Tencent:mainfrom
toy0116:upstream-pr/large-file-handling

toy0116 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toy0116 commented May 9, 2026

Problem

Changes

Queue routing (internal/)

MinerU bypass routing (knowledge_process.go)

Hybrid parsers (docreader/parser/)

Why not always use MinerU?

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Queue routing (`internal/`)

MinerU bypass routing (`knowledge_process.go`)

Hybrid parsers (`docreader/parser/`)