Aarch64 GB200 container perf regression from vllm 0.17 to vllm 0.18

### System Info

# vLLM 0.18 Performance Regression on GB200 (aarch64)

## Summary

Compared vLLM 0.17 (stock) vs vLLM 0.18 (`pip install vllm==0.18.0`) on GB200 nodes.
Training reward curves (`critic/rewards/mean`) match closely between the two versions,
confirming correctness. However, vLLM 0.18 shows a ~10% throughput regression.

## Observations

- `perf/throughput`: vLLM 0.18 is ~10% slower than vLLM 0.17.
- `perf/mfu/actor` (training): nearly identical between versions — no regression in the training phase.
- `perf/mfu/actor_infer` (rollout): vLLM 0.18 is noticeably lower than vLLM 0.17 — clear gap.

## Conclusion

The regression is isolated to the **rollout/inference phase** and does not affect training.
This points to a performance issue introduced in the vLLM 0.18 rollout backend on aarch64/GB200.

## Screenshots

<img width="437" height="298" alt="Image" src="https://github.com/user-attachments/assets/5dc34bdc-d72c-4141-9c0d-42ac81e15eee" />

<img width="437" height="299" alt="Image" src="https://github.com/user-attachments/assets/2cf52690-dc3f-45de-84ef-7d8934b50928" />

<img width="899" height="299" alt="Image" src="https://github.com/user-attachments/assets/2f51dfdc-7ec1-4cd4-b522-55dace5370fe" />


### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

## Reproduction Steps

### Environment

- Hardware: NVIDIA GB200 NVL4 (aarch64), 4 GPUs per node
- verl commit: `516657fa`
- vLLM 0.17: stock (as shipped in `docker/Dockerfile.stable.vllm`)
- vLLM 0.18: `pip install vllm==0.18.0` on top of the same base image

### 1. Build the container

One image is built and used for both runs. The vLLM 0.18 variant upgrades vLLM inside the container at runtime (step 3).

```bash
# Clone verl at the tested commit
git clone https://github.com/volcengine/verl.git
cd verl
git checkout 516657fa

# Build the aarch64 image (runs on a GB200 node or via buildx)
docker build -f docker/Dockerfile.stable.vllm \
    -t verl:vllm-arm64 .
```

### 2. Prepare the dataset

```bash
docker run --rm \
    -v /path/to/models:/models \
    -v $(pwd):/workspace/verl \
    -w /workspace/verl \
    verl:vllm-arm64 \
    python3 examples/data_preprocess/gsm8k.py \
        --local_save_dir /models/verl-datasets/gsm8k
```

Dataset is saved to `/models/verl-datasets/gsm8k/{train,test}.parquet`.

### 3. Run training

**vLLM 0.17 (baseline)** — use the image as-is:

```bash
docker run --rm --gpus all --privileged --shm-size=128g --network=host \
    -v /path/to/models:/models \
    -v $(pwd):/workspace/verl \
    -w /workspace/verl \
    verl:vllm-arm64 \
    python3 -m verl.trainer.main_ppo \
        algorithm.adv_estimator=grpo \
        data.train_files=/models/verl-datasets/gsm8k/train.parquet \
        data.val_files=/models/verl-datasets/gsm8k/test.parquet \
        data.train_batch_size=1024 \
        data.max_prompt_length=512 \
        data.max_response_length=1024 \
        data.filter_overlong_prompts=True \
        data.truncation='error' \
        actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
        actor_rollout_ref.actor.optim.lr=1e-6 \
        actor_rollout_ref.model.use_remove_padding=False \
        actor_rollout_ref.actor.ppo_mini_batch_size=512 \
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=128 \
        actor_rollout_ref.actor.use_kl_loss=True \
        actor_rollout_ref.actor.kl_loss_coef=0.001 \
        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
        actor_rollout_ref.actor.entropy_coeff=0 \
        actor_rollout_ref.model.enable_gradient_checkpointing=True \
        actor_rollout_ref.actor.fsdp_config.param_offload=False \
        actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
        actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
        actor_rollout_ref.actor.fsdp_config.forward_prefetch=True \
        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=80 \
        actor_rollout_ref.rollout.name=vllm \
        actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
        actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
        actor_rollout_ref.rollout.enforce_eager=False \
        actor_rollout_ref.rollout.enable_chunked_prefill=True \
        actor_rollout_ref.rollout.n=5 \
        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=160 \
        actor_rollout_ref.ref.fsdp_config.param_offload=True \
        actor_rollout_ref.ref.fsdp_config.forward_prefetch=True \
        algorithm.use_kl_in_reward=False \
        trainer.critic_warmup=0 \
        trainer.logger='["console","wandb"]' \
        trainer.project_name='verl_grpo_example_gsm8k' \
        trainer.experiment_name=vllm_vllm-0.17 \
        trainer.n_gpus_per_node=4 \
        +ray_kwargs.ray_init.num_gpus=4 \
        trainer.nnodes=1 \
        trainer.save_freq=20 \
        trainer.test_freq=1000 \
        trainer.total_epochs=15
```

**vLLM 0.18** — launch the same image, upgrade vLLM, then run training:

```bash
docker run --rm --gpus all --privileged --shm-size=128g --network=host \
    -v /path/to/models:/models \
    -v $(pwd):/workspace/verl \
    -w /workspace/verl \
    verl:vllm-arm64 \
    bash -c "pip install vllm==0.18.0 && python3 -m verl.trainer.main_ppo \
        ... (same args as above) \
        trainer.experiment_name=vllm_vllm-0.18"
```


### Expected behavior

We will first profile the rollout backend to root cause the regression. @borisfom

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aarch64 GB200 container perf regression from vllm 0.17 to vllm 0.18 #5964

System Info

vLLM 0.18 Performance Regression on GB200 (aarch64)

Summary

Observations

Conclusion

Screenshots

Information

Tasks

Reproduction

Reproduction Steps

Environment

1. Build the container

2. Prepare the dataset

3. Run training

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aarch64 GB200 container perf regression from vllm 0.17 to vllm 0.18 #5964

Description

System Info

vLLM 0.18 Performance Regression on GB200 (aarch64)

Summary

Observations

Conclusion

Screenshots

Information

Tasks

Reproduction

Reproduction Steps

Environment

1. Build the container

2. Prepare the dataset

3. Run training

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions