perf: support fine-grained activation offloading#2279
Conversation
Exposes Megatron-Core fine_grained_activation_offloading and offload_modules through PolicyConfig so training can offload specific submodule activations (moe_act, core_attn, qkv_linear, mlp_norm, attn_norm) to CPU. Works for both dense and MoE models. Validation of module names is left to Megatron. Signed-off-by: sna <sna@nvidia.com>
8a2292d to
feee53e
Compare
|
/ok to test feee53e |
terrykong
left a comment
There was a problem hiding this comment.
Review Summary
Nice feature addition — well-scoped passthrough to Megatron for fine-grained activation offloading. A few doc corrections and suggestions below.
Performance Evidence
This PR adds a memory-reduction feature but the description doesn't include benchmark numbers. Could you share peak GPU memory (GiB) and tokens/sec with and without activation offloading for a representative config (e.g. a MoE model with ["moe_act"] offloading)? Even a single run would help users know what to expect. A reference to the upstream Megatron-LM feature guide is also welcome.
Documentation
No user-facing docs added under docs/. Consider adding a brief section or pointer to the upstream Megatron-LM activation offloading guide.
Tests (optional nit)
Zero test coverage for the new validation logic. The existing TestApplyPerformanceConfig class in tests/unit/models/megatron/test_megatron_setup.py would be a natural home for a happy-path test (valid config sets attributes) and error-path test (empty list raises ValueError).
Generated by Claude Code
| "fine_grained_activation_offloading is True." | ||
| ) | ||
| model_cfg.fine_grained_activation_offloading = True | ||
| model_cfg.offload_modules = offload_modules |
There was a problem hiding this comment.
nemo_rl/models/megatron/setup.py:533
Megatron-Bridge calls set_ideal_affinity_for_current_gpu() when this feature is enabled to optimize PCIe/DRAM transfer throughput via NUMA-aware CPU affinity. NeMo-RL doesn't call this, so users may not get the full performance benefit. Worth considering whether to add it here or document the gap.
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
|
/ok to test c7aae0d |
A stray closing parenthesis after the raise ValueError block caused a SyntaxError, blocking the ruff/ruff-format pre-commit hooks in CI. Signed-off-by: sna <sna@nvidia.com>
|
/ok to test daab28b |
Signed-off-by: sna <sna@nvidia.com>
|
/ok to test d6cd6bc |
|
/claude review |
|
|
||
| model_cfg = MagicMock() | ||
| model_cfg.gated_linear_unit = True | ||
| config = { |
There was a problem hiding this comment.
Nit: the docstring says "defaults to []" but .get("offload_modules") returns None when the key is absent, not []. The test is correct — None fails the isinstance(..., list) check — but the docstring is misleading.
| config = { | |
| """When enabled but offload_modules key is absent, defaults to None → raises.""" |
| # Offload specific module activations to CPU to reduce peak GPU memory. | ||
| # Works with both dense and MoE models. Different from | ||
| # optimizer_cpu_offload which offloads optimizer states. | ||
| # Requires transformer_engine implementation. | ||
| fine_grained_activation_offloading: NotRequired[bool] | ||
| # Modules to offload when fine_grained_activation_offloading is True. | ||
| # Required (no default). Valid values: | ||
| # "attn_norm", "qkv_linear", "core_attn", "attn_proj", "mlp_norm", | ||
| # "expert_fc1", "moe_act". Note: "attn_proj" requires "core_attn". | ||
| # See: https://github.com/NVIDIA/Megatron-LM/blob/d30c3ae5469fe3f6a64d4fd2e63b6e7f7844ea81/megatron/core/transformer/transformer_config.py#L1440-L1448 | ||
| offload_modules: NotRequired[list[str]] |
There was a problem hiding this comment.
Per config-conventions: "Reflect the default in the exemplar YAMLs under examples/configs/*.yaml." These two new keys aren't present in any exemplar YAML (e.g. grpo_math_1B_megatron.yaml). Since they're NotRequired and opt-in, a commented-out entry with a brief note in one megatron exemplar would be sufficient to make the feature discoverable.
- Add fine_grained_activation_offloading and offload_modules to all megatron-capable exemplar configs (grpo_math_1B, grpo_math_1B_megatron, sft, dpo, distillation_math, distillation_math_megatron). - Sync tests/unit/reference_configs/*.yaml with the new keys so test_reference_configs_up_to_date passes. - Trim setup.py block comment to a single line per reviewer feedback. - Fix test docstring to reflect that .get() defaults to None. Signed-off-by: seonjinn <sna@nvidia.com>
|
/ok to test ed86517 |
|
/claude review |
|
|
||
| model_cfg = MagicMock() | ||
| model_cfg.gated_linear_unit = True | ||
| offload_modules = ["mlp", "moe_act"] |
There was a problem hiding this comment.
Nit: "mlp" is not in the documented valid options ("attn_norm", "qkv_linear", "core_attn", "attn_proj", "mlp_norm", "expert_fc1", "moe_act"). While Megatron validates downstream, the happy-path test should use valid module names to avoid confusion.
| offload_modules = ["mlp", "moe_act"] | |
| offload_modules = ["mlp_norm", "moe_act"] |
- Add NVTE_CPU_OFFLOAD_V1=1 note (TE >= 2.10.0) to TypedDict comment in policy/__init__.py so users see the env requirement up front rather than via a late Megatron-Bridge validation error. - Document the NUMA affinity gap in megatron/setup.py: Megatron-Bridge's standalone path calls set_ideal_affinity_for_current_gpu() when this feature is on; NeMo-RL does not, so this comment points users who care about offload bandwidth at the external workaround. Signed-off-by: seonjinn <sna@nvidia.com>
|
/ok to test d89ac84 |
NotRequired[list[str]] caused pydantic to reject `offload_modules: null` in YAML, breaking L1 run_vlm_grpo (and any recipe loading an exemplar megatron config when the feature is off). Wrap with Optional so the exemplar default `null` is accepted; the runtime validation in setup.py already raises if the feature is on with a non-list / empty value. Signed-off-by: seonjinn <sna@nvidia.com>
|
/ok to test aaece7b |
Signed-off-by: seonjinn <sna@nvidia.com>
|
/ok to test bcccabf |
Signed-off-by: seonjinn <sna@nvidia.com>
|
/ok to test 08d121d |
|
@terrykong Can you review this PR when you have a chance? |
|
@seonjinn see this comment to see if still relevant https://github.com/NVIDIA-NeMo/RL/pull/2279/changes#r3106448383 |
What does this PR do ?
Support fine-grained activation offloading for Megatron policy
This feature can offloads per-module activations to CPU to reduce peak GPU memory, distinct from optimizer_cpu_offload.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information