Skip to content

Aarch64 GB200 container perf regression from vllm 0.17 to vllm 0.18 #5964

@kaixih

Description

@kaixih

System Info

vLLM 0.18 Performance Regression on GB200 (aarch64)

Summary

Compared vLLM 0.17 (stock) vs vLLM 0.18 (pip install vllm==0.18.0) on GB200 nodes.
Training reward curves (critic/rewards/mean) match closely between the two versions,
confirming correctness. However, vLLM 0.18 shows a ~10% throughput regression.

Observations

  • perf/throughput: vLLM 0.18 is ~10% slower than vLLM 0.17.
  • perf/mfu/actor (training): nearly identical between versions — no regression in the training phase.
  • perf/mfu/actor_infer (rollout): vLLM 0.18 is noticeably lower than vLLM 0.17 — clear gap.

Conclusion

The regression is isolated to the rollout/inference phase and does not affect training.
This points to a performance issue introduced in the vLLM 0.18 rollout backend on aarch64/GB200.

Screenshots

Image Image Image

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Reproduction Steps

Environment

  • Hardware: NVIDIA GB200 NVL4 (aarch64), 4 GPUs per node
  • verl commit: 516657fa
  • vLLM 0.17: stock (as shipped in docker/Dockerfile.stable.vllm)
  • vLLM 0.18: pip install vllm==0.18.0 on top of the same base image

1. Build the container

One image is built and used for both runs. The vLLM 0.18 variant upgrades vLLM inside the container at runtime (step 3).

# Clone verl at the tested commit
git clone https://github.com/volcengine/verl.git
cd verl
git checkout 516657fa

# Build the aarch64 image (runs on a GB200 node or via buildx)
docker build -f docker/Dockerfile.stable.vllm \
    -t verl:vllm-arm64 .

2. Prepare the dataset

docker run --rm \
    -v /path/to/models:/models \
    -v $(pwd):/workspace/verl \
    -w /workspace/verl \
    verl:vllm-arm64 \
    python3 examples/data_preprocess/gsm8k.py \
        --local_save_dir /models/verl-datasets/gsm8k

Dataset is saved to /models/verl-datasets/gsm8k/{train,test}.parquet.

3. Run training

vLLM 0.17 (baseline) — use the image as-is:

docker run --rm --gpus all --privileged --shm-size=128g --network=host \
    -v /path/to/models:/models \
    -v $(pwd):/workspace/verl \
    -w /workspace/verl \
    verl:vllm-arm64 \
    python3 -m verl.trainer.main_ppo \
        algorithm.adv_estimator=grpo \
        data.train_files=/models/verl-datasets/gsm8k/train.parquet \
        data.val_files=/models/verl-datasets/gsm8k/test.parquet \
        data.train_batch_size=1024 \
        data.max_prompt_length=512 \
        data.max_response_length=1024 \
        data.filter_overlong_prompts=True \
        data.truncation='error' \
        actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
        actor_rollout_ref.actor.optim.lr=1e-6 \
        actor_rollout_ref.model.use_remove_padding=False \
        actor_rollout_ref.actor.ppo_mini_batch_size=512 \
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=128 \
        actor_rollout_ref.actor.use_kl_loss=True \
        actor_rollout_ref.actor.kl_loss_coef=0.001 \
        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
        actor_rollout_ref.actor.entropy_coeff=0 \
        actor_rollout_ref.model.enable_gradient_checkpointing=True \
        actor_rollout_ref.actor.fsdp_config.param_offload=False \
        actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
        actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
        actor_rollout_ref.actor.fsdp_config.forward_prefetch=True \
        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=80 \
        actor_rollout_ref.rollout.name=vllm \
        actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
        actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
        actor_rollout_ref.rollout.enforce_eager=False \
        actor_rollout_ref.rollout.enable_chunked_prefill=True \
        actor_rollout_ref.rollout.n=5 \
        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=160 \
        actor_rollout_ref.ref.fsdp_config.param_offload=True \
        actor_rollout_ref.ref.fsdp_config.forward_prefetch=True \
        algorithm.use_kl_in_reward=False \
        trainer.critic_warmup=0 \
        trainer.logger='["console","wandb"]' \
        trainer.project_name='verl_grpo_example_gsm8k' \
        trainer.experiment_name=vllm_vllm-0.17 \
        trainer.n_gpus_per_node=4 \
        +ray_kwargs.ray_init.num_gpus=4 \
        trainer.nnodes=1 \
        trainer.save_freq=20 \
        trainer.test_freq=1000 \
        trainer.total_epochs=15

vLLM 0.18 — launch the same image, upgrade vLLM, then run training:

docker run --rm --gpus all --privileged --shm-size=128g --network=host \
    -v /path/to/models:/models \
    -v $(pwd):/workspace/verl \
    -w /workspace/verl \
    verl:vllm-arm64 \
    bash -c "pip install vllm==0.18.0 && python3 -m verl.trainer.main_ppo \
        ... (same args as above) \
        trainer.experiment_name=vllm_vllm-0.18"

Expected behavior

We will first profile the rollout backend to root cause the regression. @borisfom

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions