feat(ingest+parser): large-file routing — doc_large queue + hybrid parsers for PDF/Word/PPT#1232
Open
toy0116 wants to merge 7 commits into
Open
feat(ingest+parser): large-file routing — doc_large queue + hybrid parsers for PDF/Word/PPT#1232toy0116 wants to merge 7 commits into
toy0116 wants to merge 7 commits into
Conversation
MinerU runs with max_concurrent=1 via subprocess isolation. Files > 5 MB cause two problems: 1. They block the queue for all subsequent files (serial processing). 2. They can OOM or hang the MinerU subprocess indefinitely, leaving the calling document task stuck in "processing" with no further retry. Evidence from 2026-05-09 00:55 batch: milesight-iot-brochure-cloud.pdf 20 MB → hung milesight-iot-brochure-collection-en 13 MB → hung milesight-iot-lorawan-series-cat. 11 MB → hung ur35-user-guide-en.pdf 8 MB → hung Small files queued behind them waited 72+ min before first attempt. Fix: in convert(), if the selected engine is "mineru" and the file is > 5 MB, downgrade to "builtin". Builtin is always available, handles large PDFs reliably, and does not block a shared queue. Files under the threshold keep using MinerU for better OCR quality. Threshold constant: mineruLargeFileThreshold = 5 MB (easily adjustable). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c_large queue Adds a "doc_large" asynq queue (weight 1) alongside the existing "default" queue (weight 2) for document:process tasks. Files at or above 5 MB are routed to "doc_large" so small files always drain their higher-weight queue first, preventing a handful of large PDFs from blocking dozens of small docs. The 5 MB threshold matches the existing mineru→builtin engine override (c3719d5) and is now defined as the package-level docLargeFileThreshold constant, replacing the inline const in convert(). A new docProcessQueue() helper makes the routing decision at every Enqueue call site consistently. Changed sites: - knowledge_create.go: file upload path - knowledge_process.go: file reparse + file-URL reparse - knowledge_clone_move.go: knowledge move reparse - router/task.go: NewAsynqServer() queue weight table URL-based knowledge and passage-text knowledge stay on "default" (no known FileSize at enqueue time or inherently tiny payloads). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ch PDFs
Problem: builtin engine (MarkItDown/pdfminer.six) is text-only. For a
text-searchable PDF like a Milesight product manual, FirstParser stops at
MarkItDown on the first success and PDFScannedParser never runs, so all
embedded figures, wiring diagrams and photos are silently dropped.
Fix: new PDFHybridParser runs both passes unconditionally:
1. MarkitdownParser — extracts text structure (headings, tables, lists)
2. PDFScannedParser — renders every page as JPEG so multimodal VLM sees
all visual content (figures, diagrams, photos)
Output markdown = MarkItDown text + page image refs merged; image refs
trigger the existing async multimodal VLM pipeline for captioning/OCR.
Registered as engine "pdf_hybrid" in the Python parser registry.
Go routing updated in convert(): large PDF files (≥5 MB, engine=mineru)
now override to "pdf_hybrid" instead of "builtin", preserving image coverage.
Non-PDF large files (docx, pptx, …) still fall back to "builtin" since
MarkItDown handles those formats' embedded images natively.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g concurrently MarkitdownParser (pdfminer.six) and PDFScannedParser (pypdfium2 page render) are independent: both read the same input bytes and write to separate outputs. Run them in a ThreadPoolExecutor(max_workers=2) so they execute in parallel, cutting total wall-clock time from sum(step1 + step2) to max(step1, step2). For a 50-page 10 MB PDF this saves ~2–5 s of text-extraction time that was previously paid as pure serial overhead before page rendering started. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… file
Adds _pdf_has_images() fast heuristic (O(n) byte scan for /Subtype /Image)
to decide at runtime whether a large PDF needs page rendering:
No images detected → text-only fast path: MarkitdownParser only,
no page rendering, no VLM tasks generated
Images detected → hybrid path: MarkitdownParser + PDFScannedParser
run concurrently, results merged
This implements the three-tier best practice for large PDFs (≥5 MB):
1. ≤5 MB → MinerU (best quality, handles all content)
2. >5 MB, text-only → pdf_hybrid text path (fast, lightweight)
3. >5 MB, text+images → pdf_hybrid hybrid path (concurrent, full coverage)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Problem: a text-heavy PDF can be many pages yet small in file size (<5 MB),
causing MinerU to block for 30-90 min on what looks like a "small" file.
File size alone is a poor predictor of MinerU processing time; page count is
the direct driver (~1-2 min/10 pages on 8-core M-series with pipeline backend).
Changes:
- Add docLargePageThreshold = 80 (≈10-20 min wall-clock on this machine)
- Add pdfEstimatePageCount(): O(n) byte scan for /Type /Page markers,
same heuristic approach as _pdf_has_images in Python — zero dependencies
- For PDF + engine=mineru: load file bytes early (before engine routing)
to count pages, then cache them in cachedFileBytes to avoid a second
storage read when building req.FileContent
- Route to pdf_hybrid when FileSize >= 5MB OR page count >= 80
- Log both dimensions in the warning so the trigger is visible in logs
Queue routing (docProcessQueue) unchanged — page count is not available
at enqueue time and file size is a good-enough proxy for queue priority.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rid rendering
Adds two new hybrid parser engines for Word and PowerPoint using the same
pattern as PDFHybridParser: MarkItDown text extraction + LibreOffice page
rendering run concurrently via ThreadPoolExecutor(max_workers=2).
LibreOffice converts docx/pptx→PDF (headless, subprocess), then
PDFScannedParser renders each page as a JPEG for the multimodal VLM pipeline.
DocxHybridParser — captures Charts/SmartArt/OLE objects that DocxParser
(python-docx inline image extraction) silently drops. Triggered for
large Word files (≥5 MB) when engine=mineru.
PptxHybridParser — slides are fundamentally visual; MarkItDown alone
extracts text only. Triggered for all pptx/ppt when engine=mineru.
Python registry: registers docx_hybrid + pptx_hybrid with LibreOffice
availability check (find_soffice scans PATH + known install paths).
Go engine_registry: docxHybridEngine + pptxHybridEngine with checkLibreOffice()
so both engines appear in the UI dropdown with availability status.
Go routing (convert()):
pptx/ppt + mineru → always pptx_hybrid (visual format regardless of size)
docx/doc + mineru + ≥5MB → docx_hybrid
pdf + mineru + (≥5MB or ≥80 pages) → pdf_hybrid (unchanged)
resolveDocReader: pdf_hybrid/docx_hybrid/pptx_hybrid all route to
s.documentReader (Python docreader) — same as builtin.
Tested: PptxHybridParser on 987 KB real deck → 9 slides rendered ✅
DocxHybridParser on generated docx → text + 1 page ✅
LibreOffice availability detection ✅
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Large documents (≥5 MB PDFs, Word files with charts, PowerPoint decks) cause two distinct problems when processed by MinerU:
PDFParseruses a chain-of-responsibility pattern that stops at MarkItDown success — pdfminer.six is text-only, so embedded figures and scanned pages are silently droppedChanges
Queue routing (
internal/)doc_largeasynq queue (weight 1) alongsidedefault(weight 2) — large files drain only after small files completedocProcessQueue(fileSize int64): files ≥5 MB →doc_large, otherwise →defaultConcurrency: 16to fully utilize I/O-bound LLM/embedding callsMinerU bypass routing (
knowledge_process.go)Dual-threshold check overrides
parserEngine=mineruwhen a file is too large for the MinerU worker budget:pptx/ppt(any size)pptx_hybrid— slides are always visualpdf_hybriddocx_hybridPage count is estimated via an O(n) byte scan for
/Type /Pagemarkers, avoiding a full PDF parse.Hybrid parsers (
docreader/parser/)PDFHybridParser(new, registered aspdf_hybrid):/Subtype /Imagedecides text-only vs hybrid path per fileThreadPoolExecutor; merged result preserves both text structure and per-page JPEG renders for VLMoffice_hybrid_parser.py(new file):DocxHybridParserandPptxHybridParsershare_OfficeHybridParserbasesoffice) being presentregistry.py: registerspdf_hybrid,docx_hybrid,pptx_hybridengines with availability checksengine_registry.go: Go-side registration ofdocx_hybridandpptx_hybridwithcheckLibreOffice()Why not always use MinerU?
MinerU uses
subprocess.Popenisolation withMAX_CONCURRENT=1. Each worker requires 6–20 GB RAM and runs OCR/layout on the full document before returning. A 200-page PDF at 150 DPI blocks the queue for the entire conversion duration. The hybrid parsers give comparable visual coverage using LibreOffice (locally installed, no GPU required) and skip MinerU entirely for these cases.Testing
pdf_hybridDocxHybridParsertested with a real 987 KB Robustel product deck (pptx) — LibreOffice conversion + page renders succeededgo build ./...passes