studio: chat document extraction with runtime VLM probe and bounded concurrency by Etherll · Pull Request #5351 · unslothai/unsloth

Etherll · 2026-05-09T20:10:07Z

Description

Adds first-class document extraction to the Studio chat composer. Users drop a PDF, DOCX, HTML, Markdown, plain-text or source-code file into the composer; the backend converts it to layout-aware Markdown, optionally captions selected figures through the user's currently-loaded vision model, and the result is spliced into the outgoing chat message. The whole pipeline streams progress back to the UI, respects a server-side concurrency budget, and gracefully degrades when optional parsers or a vision model are not present.

Motivation

The previous flow had no way to attach a document to a chat - only inline images. Two further pain points are addressed:

The static VISION_ARCHITECTURES allow-list silently excluded legitimately new vision architectures (DeepSeek-OCR, PaddleOCR-VL, GLM-OCR, …) and could not see the user's actual loaded model. It is replaced by a runtime probe.
Multi-select uploads previously raced and produced 503 busy errors as soon as the bounded extractor pool saturated. The new pipeline parks callers in a bounded queue and exposes the limit so the UI can cap parallelism client-side.

Backend

New module studio/backend/core/chat/document_extractor.py (1.1k LOC). Parses PDFs via pymupdf + pymupdf4llm, DOCX via mammoth, HTML via Markdown conversion, plain text / Markdown / source files as UTF-8 with replacement. Bounded by _EXTRACT_SEMAPHORE (UNSLOTH_STUDIO_EXTRACT_CONCURRENCY, default 2) with a queue wait window before any 503. Per-format dependency checks so plain-text formats keep working when optional libs are missing.
New module studio/backend/core/chat/vlm_capability.py. A read-only runtime probe across the three Studio inference backends - embedded llama-server (GGUF), transformers, and Unsloth/LoRA - exposed as a single immutable VlmCapability dataclass (is_vlm, endpoint_url, model_name, source, reason). When the loaded model is vision-capable the extractor captions selected figures through its OpenAI-compatible /v1/chat/completions endpoint; otherwise figures come back with caption=None and a human-readable describe_skipped_reason.
New routes GET /api/inference/document-support and POST /api/inference/extract-document. The latter streams NDJSON {stage: "parsing" | "captioning" | "done" | "result" | "error", …} events; document-support advertises supported formats, parser availability, and max_extract_concurrency.
Hardening: oversized request rejection by Content-Length, streaming overflow rejection for raw uploads and multipart bodies, proper task-waiter exception draining, larger chat JSON body limit (_OPENAI_CHAT_BODY_MAX_BYTES) that accounts for embedded image payloads, and hf_token is no longer accepted as a query string on probe endpoints.

Frontend

New runWithTemporaryOcrModel orchestrator that - if the user picked a dedicated OCR preset - validates, unloads the active chat model, loads the OCR model, runs the extraction, then restores the prior snapshot. Concurrent uploads are serialized through a module-level promise queue; manual mid-run swaps are never overwritten.
New components: attachment chips, DocAttachmentChip, DocumentPreviewSheet, and DocumentStack for grouped attachments. Three OCR presets shipped with model defaults: deepseek-ai/DeepSeek-OCR, unsloth/PaddleOCR-VL, zai-org/GLM-OCR.
extraction-queue.ts honours the backend's advertised concurrency limit; the settings sheet caps the user-facing slider accordingly while preserving an explicit user override.

Tests

test_chat_document_extraction.py (900 LOC): VLM probe across backend shapes, dataclass round-trips, format detection, optional-dep skips.
test_chat_document_routes.py (1093 LOC): oversized-request rejection, streaming-overflow rejection on raw + multipart, NDJSON shape, end-to-end fakes.
test_inference_worker.py, test_models_get_model_config_case_resolution.py, test_anthropic_messages.py, test_openai_tool_passthrough.py, test_vision_cache.py updated for the new probe and request limits.

Notes

OCR is deliberately not run locally in this build. Scanned PDFs without a text layer yield empty Markdown; toggling use_vlm_ocr renders bounded page images for the loaded vision model to describe.
PPTX is no longer advertised in SUPPORTED_SUFFIXES / SUPPORTED_MIME_TYPES.
No crates/ui API breakage - the ## Breaking Changes section is intentionally omitted.

…resolved)

Expose and propagate the server-side document extraction concurrency limit so clients can cap their parallel extractions and avoid 503 busy responses. Backend: export _EXTRACT_CONCURRENCY, add max_extract_concurrency to DocumentSupportResponse, surface a default on import failure, and compute a larger chat JSON body limit (_OPENAI_CHAT_BODY_MAX_BYTES) that accounts for embedded image payloads. Also create the extraction task waiter once to properly drain exceptions. Tests updated to assert the new values and chat body behavior. Frontend: add setExtractionBackendLimit and apply the backend limit in the extraction queue and UI (cap slider, adjust stored settings), call setExtractionBackendLimit when caching document support, and tweak NDJSON progress wording from “uploaded” to “processed.”

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request implements a document extraction pipeline for the Chat interface, enabling users to upload and process PDF, DOCX, HTML, and text files with optional VLM-powered figure captioning and OCR support. Key additions include server-side parsing logic, a temporary OCR model swap mechanism, and a revamped frontend UI for managing document attachments and previews. Feedback identifies a potential semaphore leak during process initialization and suggests simplifying the cancellation logic for document extraction to avoid unnecessary complexity and potential race conditions.

gemini-code-assist · 2026-05-09T20:13:36Z

+
+def _run_extract_process_sync(
+    file_bytes: bytes,
+    filename: str,
+    options: dict,
+    content_type: str,
+    timeout_seconds: int,
+    cancel_event: Optional[threading.Event] = None,


This section has two issues:

Semaphore Leak: If an exception occurs during process initialization (e.g., in multiprocessing.get_context or ctx.Queue), the acquired semaphore will be leaked because the try...finally block hasn't been entered yet. The try block should start immediately after the semaphore is successfully acquired.

Process Context: While spawn is more robust in multi-threaded environments like FastAPI, ensure this change is consistent with the repository's established process management patterns to avoid one-off architectural changes.

try: ctx = multiprocessing.get_context("spawn") result_queue = ctx.Queue(maxsize = 1) proc = ctx.Process( target = _run_extract_worker, args = (result_queue, file_bytes, filename, options, content_type), daemon = True, )

References

Architectural changes, like resource management or process strategies, should be applied consistently across the codebase rather than as one-off changes.

gemini-code-assist · 2026-05-09T20:13:36Z

+                    authorization_header = caption_authorization_header,
+                    cancel_event = cancel_event,
+                )
+            )


The use of asyncio.shield combined with a 10-second wait_for and manual cancel() is unnecessarily complex for handling client disconnects. Since the extraction task already monitors the cancel_event and runs in a separate process that responds to it, simply setting the event and allowing the task to terminate naturally (or awaiting it briefly) would be more straightforward and less prone to race conditions.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ae1091f8a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T20:15:06Z


-        if _llama_cpp_backend is not None:
-            _llama_cpp_backend._kill_process()
+        backend = getattr(_llama_cpp_mod, "_llama_cpp_backend", None)


Kill the GGUF backend that inference actually uses

When a GGUF model is loaded through /api/inference/load, the active LlamaCppBackend is still the singleton in routes.inference (_llama_cpp_backend returned by that route module). This shutdown path now checks core.inference.llama_cpp._llama_cpp_backend instead, which is a separate lazy singleton and is typically still None, so Studio shutdown no longer calls _kill_process() for the llama-server child started by inference. Use the same backend instance as the load route, or migrate the route to the core singleton before switching shutdown over.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T20:15:06Z

+        from core.inference.llama_cpp import get_llama_cpp_backend

        llama_backend = get_llama_cpp_backend()


Check the active GGUF singleton before deleting cache files

This guard now queries the new core.inference.llama_cpp singleton, but GGUF loads in routes.inference still populate the route-local singleton. In the common case where a user loaded a cached GGUF through /api/inference/load, this llama_backend.is_loaded check sees a fresh inactive backend and allows deleting the files for the model that is currently being served, which can break the running llama-server session. Point this check at the same singleton used by inference loading (or complete the singleton migration first).

Useful? React with 👍 / 👎.

Stub document-parser availability in chat document route tests to avoid spurious 501 errors in CI when optional parsers are missing, by monkeypatching _document_parser_support and _document_parser_unavailable_reasons. Update several AST-based tests to consider both the public handler (openai_chat_completions) and the implementation function (_openai_chat_completions_impl), preferring the impl when present, so structural checks still pass after streaming bodies were moved into the implementation. Changes touch test_chat_document_routes.py and multiple tests in test_stream_cancel_registration_timing.py.

…rll/unsloth into document-extractor-refactor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a79102a745

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T20:51:01Z

+        from core.inference.llama_cpp import get_llama_cpp_backend

        llama_backend = get_llama_cpp_backend()


Use the route-owned GGUF singleton when listing models

This is another occurrence of the GGUF singleton split flagged elsewhere: /api/inference/load still populates routes.inference._llama_cpp_backend, but this endpoint now instantiates/reads core.inference.llama_cpp instead. After loading a GGUF model through the normal route, /api/models/list will see a fresh inactive backend and omit the active GGUF model from loaded_models, so the UI/API no longer reports it as available. Point this check at the same singleton used by the load route, or migrate the load route before switching this import.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T20:51:01Z

+) -> tuple[str, list[ExtractedFigure], int, int, int]:
+    html = file_bytes.decode("utf-8", errors = "replace")
+    try:
+        from core.inference._html_to_md import html_to_markdown


Avoid depending on inference imports for HTML cleanup

HTML extraction now imports the stdlib-only converter through the core.inference package, which executes core/inference/__init__.py and pulls in the inference orchestrator before reaching _html_to_md. In extraction-only or partially installed environments where an inference dependency is missing or broken, this catch block returns raw HTML instead of cleaned Markdown, leaving scripts/styles and markup in the prompt even though HTML support is advertised as independent of optional parsers. Move the converter to a lightweight package (or import it without triggering inference initialization) so HTML uploads are still sanitized when inference imports fail.

Useful? React with 👍 / 👎.

Etherll added 3 commits May 9, 2026 10:53

Document extractor refactor (rebased onto origin/main with conflicts …

d579ce9

…resolved)

fix: harden Studio document extraction

794a177

Etherll requested review from Manan17, danielhanchen and rolandtannous as code owners May 9, 2026 20:10

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ae1091

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Etherll added 2 commits May 9, 2026 23:45

Merge branch 'document-extractor-refactor' of https://github.com/Ethe…

a79102a

…rll/unsloth into document-extractor-refactor

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

rolandtannous marked this pull request as draft May 10, 2026 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: chat document extraction with runtime VLM probe and bounded concurrency#5351

studio: chat document extraction with runtime VLM probe and bounded concurrency#5351
Etherll wants to merge 6 commits into
unslothai:mainfrom
Etherll:document-extractor-refactor

Etherll commented May 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		from core.inference.llama_cpp import get_llama_cpp_backend

		llama_backend = get_llama_cpp_backend()

Uh oh!

Conversation

Etherll commented May 9, 2026

Description

Motivation

Backend

Frontend

Tests

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant