Skip to content

feat: batch text-only MLX VLM requests#4918

Open
qinxuye wants to merge 11 commits into
xorbitsai:mainfrom
qinxuye:feat/mlx-vlm-batch
Open

feat: batch text-only MLX VLM requests#4918
qinxuye wants to merge 11 commits into
xorbitsai:mainfrom
qinxuye:feat/mlx-vlm-batch

Conversation

@qinxuye
Copy link
Copy Markdown
Contributor

@qinxuye qinxuye commented May 16, 2026

Summary

  • enable continuous batching for text-only MLX VLM requests via the underlying language model
  • keep image requests on the mlx-vlm path
  • isolate MLX batch generator state per model instance and add a VLM text-only parallel regression test

Tests

  • python -m py_compile xinference/model/llm/mlx/core.py xinference/model/llm/mlx/tests/test_mlx.py
  • python -m black --check xinference/model/llm/mlx/core.py xinference/model/llm/mlx/tests/test_mlx.py
  • PYTHONPATH=... python -X faulthandler -m pytest xinference/model/llm/mlx/tests/test_mlx.py::test_mlx_vision_text_only_parallel_inference -q -s
  • PYTHONPATH=... python -X faulthandler -m pytest xinference/model/llm/mlx/tests/test_mlx.py::test_load_mlx_vision -q -s

@XprobeBot XprobeBot added this to the v2.x milestone May 16, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables continuous batching for text-only requests in MLX Vision models by introducing the _MLXLogitsModelAdapter and refactoring MLXBatchModel to use instance-level state instead of class-level variables. It also includes a new test case to verify parallel inference for text-only prompts. Feedback was provided to simplify the logic in _is_text_only_prompt for better readability and maintainability.

Comment thread xinference/model/llm/mlx/core.py Outdated
@qinxuye
Copy link
Copy Markdown
Contributor Author

qinxuye commented May 16, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables continuous batching for text-only requests in MLX vision models by introducing a logits adapter and routing logic. It also refactors MLXBatchModel to use instance-level storage for improved state isolation. A review comment suggests making the text-only prompt detection more robust by checking for additional modalities like video and audio.

Comment thread xinference/model/llm/mlx/core.py Outdated
@qinxuye
Copy link
Copy Markdown
Contributor Author

qinxuye commented May 16, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors MLXBatchModel to use instance-level state instead of class-level variables and enables continuous batching for text-only prompts within MLXVisionModel. It introduces a _MLXLogitsModelAdapter to bridge mlx-vlm models with the mlx-lm API and includes new tests for prompt detection and parallel inference. Feedback was provided to move the _lock attribute in MLXBatchModel from the class level to the instance level to prevent unnecessary performance contention when multiple models are running in parallel.

Comment thread xinference/model/llm/mlx/core.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants