Skip to content

[Data] Apply DataProto to vLLM Inference & Align API with SGLang#967

Open
wheresmyhair wants to merge 1 commit intomainfrom
lmflow-vllm-dataproto
Open

[Data] Apply DataProto to vLLM Inference & Align API with SGLang#967
wheresmyhair wants to merge 1 commit intomainfrom
lmflow-vllm-dataproto

Conversation

@wheresmyhair
Copy link
Copy Markdown
Collaborator

@wheresmyhair wheresmyhair commented Apr 11, 2026

Overview

  • Apply DataProto to vllm inference pipeline, aligning its API with the sglang inferencer introduced in Unified data exchange protocol across modules #960. This unifies data exchange across inference engines and modernizes the vllm integration.
  • Remove Ray dependency in vllm, paving the way for a Ray-less lmflow implementation.

Detailed Description

DataProto integration

  • VLLMInferencer now returns DataProto instead of list[VLLMInferenceResultWithInput], with prompts in non_tensor_batch["inputs"] and generated text in non_tensor_batch["outputs"]
  • prepare_inputs_for_inference creates DataProto for both sglang and vllm through a unified code path
  • __vllm_inference in HFDecoderModel extracts prompts and sampling params from DataProto, converts to vllm.SamplingParams, and stores outputs back into the proto
  • Inference results are saved/loaded as pickle via DataProto.save_to_disk / load_from_disk
  • inference_results_path now accepts a directory — results are automatically saved as inference_results.pkl inside it

API alignment with sglang and modernization

  • VLLMInferencer now mirrors SGLangInferencer
  • Removed InferencerWithOffloading base class and all Ray-based distributed inference code -- vllm >= 0.8 supports data_parallel_size natively in vllm.LLM(), using a multiprocessing backend with no Ray dependency
  • Added --inference_data_parallel_size argument
  • Total GPUs used = tensor_parallel_size × data_parallel_size
  • Removed use_beam_search from sampling params (dropped in vLLM V1), added deprecation warning
  • Fixed deactivate_model_for_inference — old cleanup code referenced llm_engine.model_executor.driver_worker which no longer exists in V1
  • Added --inference_max_model_len to cap context length (prompt and output) for models with large defaults
  • Bumped vllm version constraint from >=0.4.3 to >=0.8.0 in setup.py

Files changed

File Change
src/lmflow/pipeline/vllm_inferencer.py Major rewrite: DataProto, aligned API, native DP
src/lmflow/models/hf_decoder_model.py DataProto for vllm, unified prepare_inputs
src/lmflow/models/hf_model_mixin.py DP, max_model_len, V1-compatible deactivation
src/lmflow/args.py New args, dir-based results path
src/lmflow/pipeline/sglang_inferencer.py Dir-based results path
src/lmflow/pipeline/utils/memory_safe_vllm_inference.py Simplified to new API
examples/vllm_inference.py Simplified to match sglang pattern
scripts/run_vllm_inference.sh New script
scripts/run_sglang_inference.sh Updated results path
setup.py vllm >= 0.8.0
tests/pipeline/test_vllm_inferencer.py New, 8 tests

Downstream impact

MemorySafeVLLMInferencer is updated to return DataProto. iterative_dpo_aligner.py consumes MemorySafeVLLMInferencer and will need a separate update to handle DataProto instead of list[VLLMInferenceResultWithInput].

Tests

  • 6 unit tests pass (no GPU): sampling params parsing, DataProto save/load round-trip, DataProto repeat logic
  • 2 GPU integration tests pass: full inference pipeline + save/load with Qwen3-0.6B on RTX 4090
  • Run scripts/run_vllm_inference.sh end-to-end with target model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant