System Info
vLLM 0.18 Performance Regression on GB200 (aarch64)
Summary
Compared vLLM 0.17 (stock) vs vLLM 0.18 (pip install vllm==0.18.0) on GB200 nodes.
Training reward curves (critic/rewards/mean) match closely between the two versions,
confirming correctness. However, vLLM 0.18 shows a ~10% throughput regression.
Observations
perf/throughput: vLLM 0.18 is ~10% slower than vLLM 0.17.
perf/mfu/actor (training): nearly identical between versions — no regression in the training phase.
perf/mfu/actor_infer (rollout): vLLM 0.18 is noticeably lower than vLLM 0.17 — clear gap.
Conclusion
The regression is isolated to the rollout/inference phase and does not affect training.
This points to a performance issue introduced in the vLLM 0.18 rollout backend on aarch64/GB200.
Screenshots
Information
Tasks
Reproduction
Reproduction Steps
Environment
- Hardware: NVIDIA GB200 NVL4 (aarch64), 4 GPUs per node
- verl commit:
516657fa
- vLLM 0.17: stock (as shipped in
docker/Dockerfile.stable.vllm)
- vLLM 0.18:
pip install vllm==0.18.0 on top of the same base image
1. Build the container
One image is built and used for both runs. The vLLM 0.18 variant upgrades vLLM inside the container at runtime (step 3).
# Clone verl at the tested commit
git clone https://github.com/volcengine/verl.git
cd verl
git checkout 516657fa
# Build the aarch64 image (runs on a GB200 node or via buildx)
docker build -f docker/Dockerfile.stable.vllm \
-t verl:vllm-arm64 .
2. Prepare the dataset
docker run --rm \
-v /path/to/models:/models \
-v $(pwd):/workspace/verl \
-w /workspace/verl \
verl:vllm-arm64 \
python3 examples/data_preprocess/gsm8k.py \
--local_save_dir /models/verl-datasets/gsm8k
Dataset is saved to /models/verl-datasets/gsm8k/{train,test}.parquet.
3. Run training
vLLM 0.17 (baseline) — use the image as-is:
docker run --rm --gpus all --privileged --shm-size=128g --network=host \
-v /path/to/models:/models \
-v $(pwd):/workspace/verl \
-w /workspace/verl \
verl:vllm-arm64 \
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=/models/verl-datasets/gsm8k/train.parquet \
data.val_files=/models/verl-datasets/gsm8k/test.parquet \
data.train_batch_size=1024 \
data.max_prompt_length=512 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.actor.ppo_mini_batch_size=512 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=128 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
actor_rollout_ref.actor.fsdp_config.forward_prefetch=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=80 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.enable_chunked_prefill=True \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=160 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.ref.fsdp_config.forward_prefetch=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name=vllm_vllm-0.17 \
trainer.n_gpus_per_node=4 \
+ray_kwargs.ray_init.num_gpus=4 \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=1000 \
trainer.total_epochs=15
vLLM 0.18 — launch the same image, upgrade vLLM, then run training:
docker run --rm --gpus all --privileged --shm-size=128g --network=host \
-v /path/to/models:/models \
-v $(pwd):/workspace/verl \
-w /workspace/verl \
verl:vllm-arm64 \
bash -c "pip install vllm==0.18.0 && python3 -m verl.trainer.main_ppo \
... (same args as above) \
trainer.experiment_name=vllm_vllm-0.18"
Expected behavior
We will first profile the rollout backend to root cause the regression. @borisfom
System Info
vLLM 0.18 Performance Regression on GB200 (aarch64)
Summary
Compared vLLM 0.17 (stock) vs vLLM 0.18 (
pip install vllm==0.18.0) on GB200 nodes.Training reward curves (
critic/rewards/mean) match closely between the two versions,confirming correctness. However, vLLM 0.18 shows a ~10% throughput regression.
Observations
perf/throughput: vLLM 0.18 is ~10% slower than vLLM 0.17.perf/mfu/actor(training): nearly identical between versions — no regression in the training phase.perf/mfu/actor_infer(rollout): vLLM 0.18 is noticeably lower than vLLM 0.17 — clear gap.Conclusion
The regression is isolated to the rollout/inference phase and does not affect training.
This points to a performance issue introduced in the vLLM 0.18 rollout backend on aarch64/GB200.
Screenshots
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Reproduction Steps
Environment
516657fadocker/Dockerfile.stable.vllm)pip install vllm==0.18.0on top of the same base image1. Build the container
One image is built and used for both runs. The vLLM 0.18 variant upgrades vLLM inside the container at runtime (step 3).
2. Prepare the dataset
docker run --rm \ -v /path/to/models:/models \ -v $(pwd):/workspace/verl \ -w /workspace/verl \ verl:vllm-arm64 \ python3 examples/data_preprocess/gsm8k.py \ --local_save_dir /models/verl-datasets/gsm8kDataset is saved to
/models/verl-datasets/gsm8k/{train,test}.parquet.3. Run training
vLLM 0.17 (baseline) — use the image as-is:
docker run --rm --gpus all --privileged --shm-size=128g --network=host \ -v /path/to/models:/models \ -v $(pwd):/workspace/verl \ -w /workspace/verl \ verl:vllm-arm64 \ python3 -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=/models/verl-datasets/gsm8k/train.parquet \ data.val_files=/models/verl-datasets/gsm8k/test.parquet \ data.train_batch_size=1024 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=False \ actor_rollout_ref.actor.ppo_mini_batch_size=512 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=128 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \ actor_rollout_ref.actor.fsdp_config.forward_prefetch=True \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=80 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.enable_chunked_prefill=True \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=160 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ actor_rollout_ref.ref.fsdp_config.forward_prefetch=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_grpo_example_gsm8k' \ trainer.experiment_name=vllm_vllm-0.17 \ trainer.n_gpus_per_node=4 \ +ray_kwargs.ray_init.num_gpus=4 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=1000 \ trainer.total_epochs=15vLLM 0.18 — launch the same image, upgrade vLLM, then run training:
docker run --rm --gpus all --privileged --shm-size=128g --network=host \ -v /path/to/models:/models \ -v $(pwd):/workspace/verl \ -w /workspace/verl \ verl:vllm-arm64 \ bash -c "pip install vllm==0.18.0 && python3 -m verl.trainer.main_ppo \ ... (same args as above) \ trainer.experiment_name=vllm_vllm-0.18"Expected behavior
We will first profile the rollout backend to root cause the regression. @borisfom