studio: chat document extraction with runtime VLM probe and bounded concurrency#5351
studio: chat document extraction with runtime VLM probe and bounded concurrency#5351Etherll wants to merge 6 commits into
Conversation
Expose and propagate the server-side document extraction concurrency limit so clients can cap their parallel extractions and avoid 503 busy responses. Backend: export _EXTRACT_CONCURRENCY, add max_extract_concurrency to DocumentSupportResponse, surface a default on import failure, and compute a larger chat JSON body limit (_OPENAI_CHAT_BODY_MAX_BYTES) that accounts for embedded image payloads. Also create the extraction task waiter once to properly drain exceptions. Tests updated to assert the new values and chat body behavior. Frontend: add setExtractionBackendLimit and apply the backend limit in the extraction queue and UI (cap slider, adjust stored settings), call setExtractionBackendLimit when caching document support, and tweak NDJSON progress wording from “uploaded” to “processed.”
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request implements a document extraction pipeline for the Chat interface, enabling users to upload and process PDF, DOCX, HTML, and text files with optional VLM-powered figure captioning and OCR support. Key additions include server-side parsing logic, a temporary OCR model swap mechanism, and a revamped frontend UI for managing document attachments and previews. Feedback identifies a potential semaphore leak during process initialization and suggests simplifying the cancellation logic for document extraction to avoid unnecessary complexity and potential race conditions.
|
|
||
| def _run_extract_process_sync( | ||
| file_bytes: bytes, | ||
| filename: str, | ||
| options: dict, | ||
| content_type: str, | ||
| timeout_seconds: int, | ||
| cancel_event: Optional[threading.Event] = None, |
There was a problem hiding this comment.
This section has two issues:
- Semaphore Leak: If an exception occurs during process initialization (e.g., in
multiprocessing.get_contextorctx.Queue), the acquired semaphore will be leaked because thetry...finallyblock hasn't been entered yet. Thetryblock should start immediately after the semaphore is successfully acquired. - Process Context: While
spawnis more robust in multi-threaded environments like FastAPI, ensure this change is consistent with the repository's established process management patterns to avoid one-off architectural changes.
try:
ctx = multiprocessing.get_context("spawn")
result_queue = ctx.Queue(maxsize = 1)
proc = ctx.Process(
target = _run_extract_worker,
args = (result_queue, file_bytes, filename, options, content_type),
daemon = True,
)References
- Architectural changes, like resource management or process strategies, should be applied consistently across the codebase rather than as one-off changes.
| authorization_header = caption_authorization_header, | ||
| cancel_event = cancel_event, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
The use of asyncio.shield combined with a 10-second wait_for and manual cancel() is unnecessarily complex for handling client disconnects. Since the extraction task already monitors the cancel_event and runs in a separate process that responds to it, simply setting the event and allowing the task to terminate naturally (or awaiting it briefly) would be more straightforward and less prone to race conditions.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ae1091f8a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| if _llama_cpp_backend is not None: | ||
| _llama_cpp_backend._kill_process() | ||
| backend = getattr(_llama_cpp_mod, "_llama_cpp_backend", None) |
There was a problem hiding this comment.
Kill the GGUF backend that inference actually uses
When a GGUF model is loaded through /api/inference/load, the active LlamaCppBackend is still the singleton in routes.inference (_llama_cpp_backend returned by that route module). This shutdown path now checks core.inference.llama_cpp._llama_cpp_backend instead, which is a separate lazy singleton and is typically still None, so Studio shutdown no longer calls _kill_process() for the llama-server child started by inference. Use the same backend instance as the load route, or migrate the route to the core singleton before switching shutdown over.
Useful? React with 👍 / 👎.
| from core.inference.llama_cpp import get_llama_cpp_backend | ||
|
|
||
| llama_backend = get_llama_cpp_backend() |
There was a problem hiding this comment.
Check the active GGUF singleton before deleting cache files
This guard now queries the new core.inference.llama_cpp singleton, but GGUF loads in routes.inference still populate the route-local singleton. In the common case where a user loaded a cached GGUF through /api/inference/load, this llama_backend.is_loaded check sees a fresh inactive backend and allows deleting the files for the model that is currently being served, which can break the running llama-server session. Point this check at the same singleton used by inference loading (or complete the singleton migration first).
Useful? React with 👍 / 👎.
Stub document-parser availability in chat document route tests to avoid spurious 501 errors in CI when optional parsers are missing, by monkeypatching _document_parser_support and _document_parser_unavailable_reasons. Update several AST-based tests to consider both the public handler (openai_chat_completions) and the implementation function (_openai_chat_completions_impl), preferring the impl when present, so structural checks still pass after streaming bodies were moved into the implementation. Changes touch test_chat_document_routes.py and multiple tests in test_stream_cancel_registration_timing.py.
…rll/unsloth into document-extractor-refactor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a79102a745
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from core.inference.llama_cpp import get_llama_cpp_backend | ||
|
|
||
| llama_backend = get_llama_cpp_backend() |
There was a problem hiding this comment.
Use the route-owned GGUF singleton when listing models
This is another occurrence of the GGUF singleton split flagged elsewhere: /api/inference/load still populates routes.inference._llama_cpp_backend, but this endpoint now instantiates/reads core.inference.llama_cpp instead. After loading a GGUF model through the normal route, /api/models/list will see a fresh inactive backend and omit the active GGUF model from loaded_models, so the UI/API no longer reports it as available. Point this check at the same singleton used by the load route, or migrate the load route before switching this import.
Useful? React with 👍 / 👎.
| ) -> tuple[str, list[ExtractedFigure], int, int, int]: | ||
| html = file_bytes.decode("utf-8", errors = "replace") | ||
| try: | ||
| from core.inference._html_to_md import html_to_markdown |
There was a problem hiding this comment.
Avoid depending on inference imports for HTML cleanup
HTML extraction now imports the stdlib-only converter through the core.inference package, which executes core/inference/__init__.py and pulls in the inference orchestrator before reaching _html_to_md. In extraction-only or partially installed environments where an inference dependency is missing or broken, this catch block returns raw HTML instead of cleaned Markdown, leaving scripts/styles and markup in the prompt even though HTML support is advertised as independent of optional parsers. Move the converter to a lightweight package (or import it without triggering inference initialization) so HTML uploads are still sanitized when inference imports fail.
Useful? React with 👍 / 👎.
Description
Adds first-class document extraction to the Studio chat composer. Users drop a PDF, DOCX, HTML, Markdown, plain-text or source-code file into the composer; the backend converts it to layout-aware Markdown, optionally captions selected figures through the user's currently-loaded vision model, and the result is spliced into the outgoing chat message. The whole pipeline streams progress back to the UI, respects a server-side concurrency budget, and gracefully degrades when optional parsers or a vision model are not present.
Motivation
The previous flow had no way to attach a document to a chat - only inline images. Two further pain points are addressed:
VISION_ARCHITECTURESallow-list silently excluded legitimately new vision architectures (DeepSeek-OCR, PaddleOCR-VL, GLM-OCR, …) and could not see the user's actual loaded model. It is replaced by a runtime probe.Backend
studio/backend/core/chat/document_extractor.py(1.1k LOC). Parses PDFs viapymupdf+pymupdf4llm, DOCX viamammoth, HTML via Markdown conversion, plain text / Markdown / source files as UTF-8 with replacement. Bounded by_EXTRACT_SEMAPHORE(UNSLOTH_STUDIO_EXTRACT_CONCURRENCY, default 2) with a queue wait window before any 503. Per-format dependency checks so plain-text formats keep working when optional libs are missing.studio/backend/core/chat/vlm_capability.py. A read-only runtime probe across the three Studio inference backends - embedded llama-server (GGUF),transformers, and Unsloth/LoRA - exposed as a single immutableVlmCapabilitydataclass (is_vlm,endpoint_url,model_name,source,reason). When the loaded model is vision-capable the extractor captions selected figures through its OpenAI-compatible/v1/chat/completionsendpoint; otherwise figures come back withcaption=Noneand a human-readabledescribe_skipped_reason.GET /api/inference/document-supportandPOST /api/inference/extract-document. The latter streams NDJSON{stage: "parsing" | "captioning" | "done" | "result" | "error", …}events;document-supportadvertises supported formats, parser availability, andmax_extract_concurrency.Content-Length, streaming overflow rejection for raw uploads and multipart bodies, proper task-waiter exception draining, larger chat JSON body limit (_OPENAI_CHAT_BODY_MAX_BYTES) that accounts for embedded image payloads, andhf_tokenis no longer accepted as a query string on probe endpoints.Frontend
runWithTemporaryOcrModelorchestrator that - if the user picked a dedicated OCR preset - validates, unloads the active chat model, loads the OCR model, runs the extraction, then restores the prior snapshot. Concurrent uploads are serialized through a module-level promise queue; manual mid-run swaps are never overwritten.DocAttachmentChip,DocumentPreviewSheet, andDocumentStackfor grouped attachments. Three OCR presets shipped with model defaults:deepseek-ai/DeepSeek-OCR,unsloth/PaddleOCR-VL,zai-org/GLM-OCR.extraction-queue.tshonours the backend's advertised concurrency limit; the settings sheet caps the user-facing slider accordingly while preserving an explicit user override.Tests
test_chat_document_extraction.py(900 LOC): VLM probe across backend shapes, dataclass round-trips, format detection, optional-dep skips.test_chat_document_routes.py(1093 LOC): oversized-request rejection, streaming-overflow rejection on raw + multipart, NDJSON shape, end-to-end fakes.test_inference_worker.py,test_models_get_model_config_case_resolution.py,test_anthropic_messages.py,test_openai_tool_passthrough.py,test_vision_cache.pyupdated for the new probe and request limits.Notes
use_vlm_ocrrenders bounded page images for the loaded vision model to describe.SUPPORTED_SUFFIXES/SUPPORTED_MIME_TYPES.crates/uiAPI breakage - the## Breaking Changessection is intentionally omitted.