This document describes AReaL's allocation mode system, which controls how GPUs are distributed between inference and training backends during distributed RL training.
Each engine component (actor, critic, rollout, ref, teacher) has its own backend
configuration field that specifies:
- Which backend to use (SGLang, vLLM for inference; FSDP, Megatron, Archon for training)
- The parallelization strategy
- The total number of GPUs required
AReaL parses each backend string into a ModelAllocation object that drives resource
allocation for that specific engine.
Each engine in the YAML config has its own backend field:
# Rollout (inference) engine
rollout:
backend: "sglang:d4t2"
# Actor (training) engine
actor:
backend: "fsdp:d8"
# Critic engine (falls back to actor.backend if empty)
critic:
backend: ""
# Ref engine (falls back to actor.backend if empty)
ref:
backend: ""When critic.backend or ref.backend is empty, it automatically inherits from
actor.backend.
Note: The top-level
allocation_modeconfig field is deprecated and only retained for backward compatibility with legacy SPMD launchers (local/ray/slurm). It is ignored by the single-controller scheduler. Use the per-enginebackendfields shown above instead.
<backend>:<parallelism_dims>
For example, fsdp:d4t2 means: use the FSDP backend with data parallelism 4 and tensor
parallelism 2.
| Dimension | Abbreviation | Description | Valid For |
|---|---|---|---|
| Data | d |
Number of model replicas | All backends |
| Tensor | t |
Split operations across GPUs | All backends |
| Pipeline | p |
Split layers across GPUs in stages | Megatron, Archon |
| Context | c |
Split sequence length across GPUs | All backends |
| Expert | e |
Split MoE experts across GPUs | Megatron, Archon |
Dimensions are specified as <abbrev><size>, e.g., d4t2 means data parallel size 4
and tensor parallel size 2.
The total GPUs for a component is computed as:
world_size = dp × tp × pp × cp
Expert parallelism (e) does not increase world size—it redistributes how experts are
placed within the existing GPU mesh.
| Backend String | GPUs per Engine | Notes |
|---|---|---|
fsdp:d8 |
8 | 8 data-parallel replicas |
sglang:d2t4 |
8 | 2 instances × 4 TP GPUs |
megatron:d2p2t4 |
16 | 2 DP × 2 PP × 4 TP |
megatron:d2p2t4e4 |
16 | Same mesh, 4-way expert par |
# 16-GPU setup: 8 inference + 8 training
rollout:
backend: "sglang:d2t4" # 2 × 4 = 8 GPUs
actor:
backend: "fsdp:d4t2" # 4 × 2 = 8 GPUs| Backend | Supported Dimensions |
|---|---|
sglang |
d, t |
vllm |
d, t, p |
For inference, d represents the number of independent server instances, and each
instance uses t × p GPUs.
Note that the internal backend configurations do not affect how AReaL allocates GPUs.
Given rollout.backend: "sglang:d4t4", you can also configure sglang.dp_size=4,
sglang.ep_size=4, and sglang.enable_dp_attention=True. In this case, we launch 4
model replicas each with 4 GPUs. Within each instance, SGLang will still use DP
attention and expert parallelism to distribute computations in attention and expert
layers.
| Backend | Supported Dimensions | Use Case |
|---|---|---|
fsdp |
d, t, c |
Default for simple parallelism |
megatron |
d, t, p, c, e |
Required for pipeline or expert parallel |
archon |
d, t, p, c, e |
Alternative to Megatron (experimental) |
Important: An explicit backend prefix is required in all allocation strings. Bare dimension strings (e.g.,
d4t2) are no longer accepted. Always specify the backend explicitly:fsdp:d4t2,megatron:d2p2t4,sglang:d4t2.
For Mixture-of-Experts models, Megatron/Archon supports different parallelism strategies for attention and FFN (expert) modules using the hybrid syntax:
megatron:(attn:<attn_dims>|ffn:<ffn_dims>)
This enables MoE Parallel Folding, which reduces the minimum GPU requirement for combined context and expert parallelism.
- Pipeline parallel size (
p) must be identical forattnandffn - World size must match (if
dis omitted inffn, it is derived automatically) - Expert parallel (
e) is only valid in theffnsection
actor:
backend: "megatron:(attn:d4p2t2c2|ffn:d2p2t4e2)"| Module | dp | pp | tp | cp | ep | World Size |
|---|---|---|---|---|---|---|
| attn | 4 | 2 | 2 | 2 | - | 32 |
| ffn | 2 | 2 | 4 | - | 2 | 32 |
- Fine-tuning Large MoE Models - Tutorial for Megatron backend
- Archon: PyTorch-Native Training Engine - Tutorial for Archon backend
- Megatron Performance Best Practice