Skip to content

feat(ingest+parser): large-file routing — doc_large queue + hybrid parsers for PDF/Word/PPT#1232

Open
toy0116 wants to merge 7 commits into
Tencent:mainfrom
toy0116:upstream-pr/large-file-handling
Open

feat(ingest+parser): large-file routing — doc_large queue + hybrid parsers for PDF/Word/PPT#1232
toy0116 wants to merge 7 commits into
Tencent:mainfrom
toy0116:upstream-pr/large-file-handling

Conversation

@toy0116
Copy link
Copy Markdown
Contributor

@toy0116 toy0116 commented May 9, 2026

Problem

Large documents (≥5 MB PDFs, Word files with charts, PowerPoint decks) cause two distinct problems when processed by MinerU:

  1. Queue starvation: A 100-page PDF occupies a MinerU worker for 10–40 min, blocking all subsequent uploads
  2. Image loss: PDFParser uses a chain-of-responsibility pattern that stops at MarkItDown success — pdfminer.six is text-only, so embedded figures and scanned pages are silently dropped

Changes

Queue routing (internal/)

  • Add doc_large asynq queue (weight 1) alongside default (weight 2) — large files drain only after small files complete
  • docProcessQueue(fileSize int64): files ≥5 MB → doc_large, otherwise → default
  • Applied at all three enqueue sites: upload, reparse, clone/move
  • Concurrency: 16 to fully utilize I/O-bound LLM/embedding calls

MinerU bypass routing (knowledge_process.go)

Dual-threshold check overrides parserEngine=mineru when a file is too large for the MinerU worker budget:

Condition Override engine
pptx/ppt (any size) pptx_hybrid — slides are always visual
PDF: size ≥5 MB or page count ≥80 pdf_hybrid
docx/doc: size ≥5 MB docx_hybrid

Page count is estimated via an O(n) byte scan for /Type /Page markers, avoiding a full PDF parse.

Hybrid parsers (docreader/parser/)

PDFHybridParser (new, registered as pdf_hybrid):

  • Fast O(n) byte scan for /Subtype /Image decides text-only vs hybrid path per file
  • Text-only PDFs: MarkItDown only — fast, no page rendering
  • Image-rich PDFs: MarkItDown + PDFScannedParser run concurrently via ThreadPoolExecutor; merged result preserves both text structure and per-page JPEG renders for VLM

office_hybrid_parser.py (new file):

  • DocxHybridParser and PptxHybridParser share _OfficeHybridParser base
  • MarkItDown extracts text structure concurrently with LibreOffice headless conversion to PDF
  • PDFScannedParser renders every page/slide as JPEG for the VLM pipeline
  • Availability gated on LibreOffice (soffice) being present

registry.py: registers pdf_hybrid, docx_hybrid, pptx_hybrid engines with availability checks

engine_registry.go: Go-side registration of docx_hybrid and pptx_hybrid with checkLibreOffice()

Why not always use MinerU?

MinerU uses subprocess.Popen isolation with MAX_CONCURRENT=1. Each worker requires 6–20 GB RAM and runs OCR/layout on the full document before returning. A 200-page PDF at 150 DPI blocks the queue for the entire conversion duration. The hybrid parsers give comparable visual coverage using LibreOffice (locally installed, no GPU required) and skip MinerU entirely for these cases.

Testing

  • Verified queue routing with files above/below 5 MB threshold
  • Verified page-count threshold with a 3 MB / 150-page text PDF correctly routes to pdf_hybrid
  • DocxHybridParser tested with a real 987 KB Robustel product deck (pptx) — LibreOffice conversion + page renders succeeded
  • go build ./... passes

toy0116 and others added 7 commits May 9, 2026 13:29
MinerU runs with max_concurrent=1 via subprocess isolation. Files > 5 MB
cause two problems:
1. They block the queue for all subsequent files (serial processing).
2. They can OOM or hang the MinerU subprocess indefinitely, leaving
   the calling document task stuck in "processing" with no further retry.

Evidence from 2026-05-09 00:55 batch:
  milesight-iot-brochure-cloud.pdf      20 MB  → hung
  milesight-iot-brochure-collection-en  13 MB  → hung
  milesight-iot-lorawan-series-cat.     11 MB  → hung
  ur35-user-guide-en.pdf                 8 MB  → hung
  Small files queued behind them waited 72+ min before first attempt.

Fix: in convert(), if the selected engine is "mineru" and the file is
> 5 MB, downgrade to "builtin". Builtin is always available, handles
large PDFs reliably, and does not block a shared queue. Files under the
threshold keep using MinerU for better OCR quality.

Threshold constant: mineruLargeFileThreshold = 5 MB (easily adjustable).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c_large queue

Adds a "doc_large" asynq queue (weight 1) alongside the existing "default"
queue (weight 2) for document:process tasks. Files at or above 5 MB are
routed to "doc_large" so small files always drain their higher-weight queue
first, preventing a handful of large PDFs from blocking dozens of small docs.

The 5 MB threshold matches the existing mineru→builtin engine override
(c3719d5) and is now defined as the package-level docLargeFileThreshold
constant, replacing the inline const in convert(). A new docProcessQueue()
helper makes the routing decision at every Enqueue call site consistently.

Changed sites:
- knowledge_create.go: file upload path
- knowledge_process.go: file reparse + file-URL reparse
- knowledge_clone_move.go: knowledge move reparse
- router/task.go: NewAsynqServer() queue weight table

URL-based knowledge and passage-text knowledge stay on "default" (no
known FileSize at enqueue time or inherently tiny payloads).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ch PDFs

Problem: builtin engine (MarkItDown/pdfminer.six) is text-only. For a
text-searchable PDF like a Milesight product manual, FirstParser stops at
MarkItDown on the first success and PDFScannedParser never runs, so all
embedded figures, wiring diagrams and photos are silently dropped.

Fix: new PDFHybridParser runs both passes unconditionally:
  1. MarkitdownParser  — extracts text structure (headings, tables, lists)
  2. PDFScannedParser  — renders every page as JPEG so multimodal VLM sees
                         all visual content (figures, diagrams, photos)
Output markdown = MarkItDown text + page image refs merged; image refs
trigger the existing async multimodal VLM pipeline for captioning/OCR.

Registered as engine "pdf_hybrid" in the Python parser registry.

Go routing updated in convert(): large PDF files (≥5 MB, engine=mineru)
now override to "pdf_hybrid" instead of "builtin", preserving image coverage.
Non-PDF large files (docx, pptx, …) still fall back to "builtin" since
MarkItDown handles those formats' embedded images natively.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g concurrently

MarkitdownParser (pdfminer.six) and PDFScannedParser (pypdfium2 page render)
are independent: both read the same input bytes and write to separate outputs.
Run them in a ThreadPoolExecutor(max_workers=2) so they execute in parallel,
cutting total wall-clock time from sum(step1 + step2) to max(step1, step2).
For a 50-page 10 MB PDF this saves ~2–5 s of text-extraction time that was
previously paid as pure serial overhead before page rendering started.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… file

Adds _pdf_has_images() fast heuristic (O(n) byte scan for /Subtype /Image)
to decide at runtime whether a large PDF needs page rendering:

  No images detected  → text-only fast path: MarkitdownParser only,
                        no page rendering, no VLM tasks generated
  Images detected     → hybrid path: MarkitdownParser + PDFScannedParser
                        run concurrently, results merged

This implements the three-tier best practice for large PDFs (≥5 MB):
  1. ≤5 MB              → MinerU  (best quality, handles all content)
  2. >5 MB, text-only   → pdf_hybrid text path  (fast, lightweight)
  3. >5 MB, text+images → pdf_hybrid hybrid path (concurrent, full coverage)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Problem: a text-heavy PDF can be many pages yet small in file size (<5 MB),
causing MinerU to block for 30-90 min on what looks like a "small" file.
File size alone is a poor predictor of MinerU processing time; page count is
the direct driver (~1-2 min/10 pages on 8-core M-series with pipeline backend).

Changes:
  - Add docLargePageThreshold = 80 (≈10-20 min wall-clock on this machine)
  - Add pdfEstimatePageCount(): O(n) byte scan for /Type /Page markers,
    same heuristic approach as _pdf_has_images in Python — zero dependencies
  - For PDF + engine=mineru: load file bytes early (before engine routing)
    to count pages, then cache them in cachedFileBytes to avoid a second
    storage read when building req.FileContent
  - Route to pdf_hybrid when FileSize >= 5MB OR page count >= 80
  - Log both dimensions in the warning so the trigger is visible in logs

Queue routing (docProcessQueue) unchanged — page count is not available
at enqueue time and file size is a good-enough proxy for queue priority.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rid rendering

Adds two new hybrid parser engines for Word and PowerPoint using the same
pattern as PDFHybridParser: MarkItDown text extraction + LibreOffice page
rendering run concurrently via ThreadPoolExecutor(max_workers=2).

LibreOffice converts docx/pptx→PDF (headless, subprocess), then
PDFScannedParser renders each page as a JPEG for the multimodal VLM pipeline.

  DocxHybridParser — captures Charts/SmartArt/OLE objects that DocxParser
    (python-docx inline image extraction) silently drops. Triggered for
    large Word files (≥5 MB) when engine=mineru.

  PptxHybridParser — slides are fundamentally visual; MarkItDown alone
    extracts text only. Triggered for all pptx/ppt when engine=mineru.

Python registry: registers docx_hybrid + pptx_hybrid with LibreOffice
availability check (find_soffice scans PATH + known install paths).

Go engine_registry: docxHybridEngine + pptxHybridEngine with checkLibreOffice()
so both engines appear in the UI dropdown with availability status.

Go routing (convert()):
  pptx/ppt + mineru → always pptx_hybrid (visual format regardless of size)
  docx/doc + mineru + ≥5MB → docx_hybrid
  pdf + mineru + (≥5MB or ≥80 pages) → pdf_hybrid  (unchanged)

resolveDocReader: pdf_hybrid/docx_hybrid/pptx_hybrid all route to
s.documentReader (Python docreader) — same as builtin.

Tested: PptxHybridParser on 987 KB real deck → 9 slides rendered ✅
        DocxHybridParser on generated docx → text + 1 page ✅
        LibreOffice availability detection ✅

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant