Skip to content

studio: chat document extraction with runtime VLM probe and bounded concurrency#5351

Draft
Etherll wants to merge 6 commits into
unslothai:mainfrom
Etherll:document-extractor-refactor
Draft

studio: chat document extraction with runtime VLM probe and bounded concurrency#5351
Etherll wants to merge 6 commits into
unslothai:mainfrom
Etherll:document-extractor-refactor

Conversation

@Etherll
Copy link
Copy Markdown
Collaborator

@Etherll Etherll commented May 9, 2026

Description

Adds first-class document extraction to the Studio chat composer. Users drop a PDF, DOCX, HTML, Markdown, plain-text or source-code file into the composer; the backend converts it to layout-aware Markdown, optionally captions selected figures through the user's currently-loaded vision model, and the result is spliced into the outgoing chat message. The whole pipeline streams progress back to the UI, respects a server-side concurrency budget, and gracefully degrades when optional parsers or a vision model are not present.

Motivation

The previous flow had no way to attach a document to a chat - only inline images. Two further pain points are addressed:

  • The static VISION_ARCHITECTURES allow-list silently excluded legitimately new vision architectures (DeepSeek-OCR, PaddleOCR-VL, GLM-OCR, …) and could not see the user's actual loaded model. It is replaced by a runtime probe.
  • Multi-select uploads previously raced and produced 503 busy errors as soon as the bounded extractor pool saturated. The new pipeline parks callers in a bounded queue and exposes the limit so the UI can cap parallelism client-side.

Backend

  • New module studio/backend/core/chat/document_extractor.py (1.1k LOC). Parses PDFs via pymupdf + pymupdf4llm, DOCX via mammoth, HTML via Markdown conversion, plain text / Markdown / source files as UTF-8 with replacement. Bounded by _EXTRACT_SEMAPHORE (UNSLOTH_STUDIO_EXTRACT_CONCURRENCY, default 2) with a queue wait window before any 503. Per-format dependency checks so plain-text formats keep working when optional libs are missing.
  • New module studio/backend/core/chat/vlm_capability.py. A read-only runtime probe across the three Studio inference backends - embedded llama-server (GGUF), transformers, and Unsloth/LoRA - exposed as a single immutable VlmCapability dataclass (is_vlm, endpoint_url, model_name, source, reason). When the loaded model is vision-capable the extractor captions selected figures through its OpenAI-compatible /v1/chat/completions endpoint; otherwise figures come back with caption=None and a human-readable describe_skipped_reason.
  • New routes GET /api/inference/document-support and POST /api/inference/extract-document. The latter streams NDJSON {stage: "parsing" | "captioning" | "done" | "result" | "error", …} events; document-support advertises supported formats, parser availability, and max_extract_concurrency.
  • Hardening: oversized request rejection by Content-Length, streaming overflow rejection for raw uploads and multipart bodies, proper task-waiter exception draining, larger chat JSON body limit (_OPENAI_CHAT_BODY_MAX_BYTES) that accounts for embedded image payloads, and hf_token is no longer accepted as a query string on probe endpoints.

Frontend

  • New runWithTemporaryOcrModel orchestrator that - if the user picked a dedicated OCR preset - validates, unloads the active chat model, loads the OCR model, runs the extraction, then restores the prior snapshot. Concurrent uploads are serialized through a module-level promise queue; manual mid-run swaps are never overwritten.
  • New components: attachment chips, DocAttachmentChip, DocumentPreviewSheet, and DocumentStack for grouped attachments. Three OCR presets shipped with model defaults: deepseek-ai/DeepSeek-OCR, unsloth/PaddleOCR-VL, zai-org/GLM-OCR.
  • extraction-queue.ts honours the backend's advertised concurrency limit; the settings sheet caps the user-facing slider accordingly while preserving an explicit user override.

Tests

  • test_chat_document_extraction.py (900 LOC): VLM probe across backend shapes, dataclass round-trips, format detection, optional-dep skips.
  • test_chat_document_routes.py (1093 LOC): oversized-request rejection, streaming-overflow rejection on raw + multipart, NDJSON shape, end-to-end fakes.
  • test_inference_worker.py, test_models_get_model_config_case_resolution.py, test_anthropic_messages.py, test_openai_tool_passthrough.py, test_vision_cache.py updated for the new probe and request limits.

Notes

  • OCR is deliberately not run locally in this build. Scanned PDFs without a text layer yield empty Markdown; toggling use_vlm_ocr renders bounded page images for the loaded vision model to describe.
  • PPTX is no longer advertised in SUPPORTED_SUFFIXES / SUPPORTED_MIME_TYPES.
  • No crates/ui API breakage - the ## Breaking Changes section is intentionally omitted.

Etherll added 3 commits May 9, 2026 10:53
Expose and propagate the server-side document extraction concurrency limit so clients can cap their parallel extractions and avoid 503 busy responses. Backend: export _EXTRACT_CONCURRENCY, add max_extract_concurrency to DocumentSupportResponse, surface a default on import failure, and compute a larger chat JSON body limit (_OPENAI_CHAT_BODY_MAX_BYTES) that accounts for embedded image payloads. Also create the extraction task waiter once to properly drain exceptions. Tests updated to assert the new values and chat body behavior. Frontend: add setExtractionBackendLimit and apply the backend limit in the extraction queue and UI (cap slider, adjust stored settings), call setExtractionBackendLimit when caching document support, and tweak NDJSON progress wording from “uploaded” to “processed.”
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a document extraction pipeline for the Chat interface, enabling users to upload and process PDF, DOCX, HTML, and text files with optional VLM-powered figure captioning and OCR support. Key additions include server-side parsing logic, a temporary OCR model swap mechanism, and a revamped frontend UI for managing document attachments and previews. Feedback identifies a potential semaphore leak during process initialization and suggests simplifying the cancellation logic for document extraction to avoid unnecessary complexity and potential race conditions.

Comment on lines +807 to +814

def _run_extract_process_sync(
file_bytes: bytes,
filename: str,
options: dict,
content_type: str,
timeout_seconds: int,
cancel_event: Optional[threading.Event] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This section has two issues:

  1. Semaphore Leak: If an exception occurs during process initialization (e.g., in multiprocessing.get_context or ctx.Queue), the acquired semaphore will be leaked because the try...finally block hasn't been entered yet. The try block should start immediately after the semaphore is successfully acquired.
  2. Process Context: While spawn is more robust in multi-threaded environments like FastAPI, ensure this change is consistent with the repository's established process management patterns to avoid one-off architectural changes.
    try:
        ctx = multiprocessing.get_context("spawn")
        result_queue = ctx.Queue(maxsize = 1)
        proc = ctx.Process(
            target = _run_extract_worker,
            args = (result_queue, file_bytes, filename, options, content_type),
            daemon = True,
        )
References
  1. Architectural changes, like resource management or process strategies, should be applied consistently across the codebase rather than as one-off changes.

authorization_header = caption_authorization_header,
cancel_event = cancel_event,
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of asyncio.shield combined with a 10-second wait_for and manual cancel() is unnecessarily complex for handling client disconnects. Since the extraction task already monitors the cancel_event and runs in a separate process that responds to it, simply setting the event and allowing the task to terminate naturally (or awaiting it briefly) would be more straightforward and less prone to race conditions.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ae1091f8a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/run.py

if _llama_cpp_backend is not None:
_llama_cpp_backend._kill_process()
backend = getattr(_llama_cpp_mod, "_llama_cpp_backend", None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Kill the GGUF backend that inference actually uses

When a GGUF model is loaded through /api/inference/load, the active LlamaCppBackend is still the singleton in routes.inference (_llama_cpp_backend returned by that route module). This shutdown path now checks core.inference.llama_cpp._llama_cpp_backend instead, which is a separate lazy singleton and is typically still None, so Studio shutdown no longer calls _kill_process() for the llama-server child started by inference. Use the same backend instance as the load route, or migrate the route to the core singleton before switching shutdown over.

Useful? React with 👍 / 👎.

Comment on lines +2709 to 2711
from core.inference.llama_cpp import get_llama_cpp_backend

llama_backend = get_llama_cpp_backend()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check the active GGUF singleton before deleting cache files

This guard now queries the new core.inference.llama_cpp singleton, but GGUF loads in routes.inference still populate the route-local singleton. In the common case where a user loaded a cached GGUF through /api/inference/load, this llama_backend.is_loaded check sees a fresh inactive backend and allows deleting the files for the model that is currently being served, which can break the running llama-server session. Point this check at the same singleton used by inference loading (or complete the singleton migration first).

Useful? React with 👍 / 👎.

Etherll added 2 commits May 9, 2026 23:45
Stub document-parser availability in chat document route tests to avoid spurious 501 errors in CI when optional parsers are missing, by monkeypatching _document_parser_support and _document_parser_unavailable_reasons. Update several AST-based tests to consider both the public handler (openai_chat_completions) and the implementation function (_openai_chat_completions_impl), preferring the impl when present, so structural checks still pass after streaming bodies were moved into the implementation. Changes touch test_chat_document_routes.py and multiple tests in test_stream_cancel_registration_timing.py.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a79102a745

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1519 to 1521
from core.inference.llama_cpp import get_llama_cpp_backend

llama_backend = get_llama_cpp_backend()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the route-owned GGUF singleton when listing models

This is another occurrence of the GGUF singleton split flagged elsewhere: /api/inference/load still populates routes.inference._llama_cpp_backend, but this endpoint now instantiates/reads core.inference.llama_cpp instead. After loading a GGUF model through the normal route, /api/models/list will see a fresh inactive backend and omit the active GGUF model from loaded_models, so the UI/API no longer reports it as available. Point this check at the same singleton used by the load route, or migrate the load route before switching this import.

Useful? React with 👍 / 👎.

) -> tuple[str, list[ExtractedFigure], int, int, int]:
html = file_bytes.decode("utf-8", errors = "replace")
try:
from core.inference._html_to_md import html_to_markdown
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid depending on inference imports for HTML cleanup

HTML extraction now imports the stdlib-only converter through the core.inference package, which executes core/inference/__init__.py and pulls in the inference orchestrator before reaching _html_to_md. In extraction-only or partially installed environments where an inference dependency is missing or broken, this catch block returns raw HTML instead of cleaned Markdown, leaving scripts/styles and markup in the prompt even though HTML support is advertised as independent of optional parsers. Move the converter to a lightweight package (or import it without triggering inference initialization) so HTML uploads are still sanitized when inference imports fail.

Useful? React with 👍 / 👎.

@rolandtannous rolandtannous marked this pull request as draft May 10, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant