Skip to content

perf: support fine-grained activation offloading#2279

Open
seonjinn wants to merge 20 commits into
mainfrom
sj/fine-grained-activation-offload
Open

perf: support fine-grained activation offloading#2279
seonjinn wants to merge 20 commits into
mainfrom
sj/fine-grained-activation-offload

Conversation

@seonjinn
Copy link
Copy Markdown
Contributor

What does this PR do ?

Support fine-grained activation offloading for Megatron policy
This feature can offloads per-module activations to CPU to reduce peak GPU memory, distinct from optimizer_cpu_offload.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@seonjinn seonjinn requested review from a team as code owners April 17, 2026 04:44
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@seonjinn seonjinn requested review from a team as code owners April 17, 2026 05:11
Exposes Megatron-Core fine_grained_activation_offloading and
offload_modules through PolicyConfig so training can offload specific
submodule activations (moe_act, core_attn, qkv_linear, mlp_norm,
attn_norm) to CPU. Works for both dense and MoE models. Validation of
module names is left to Megatron.

Signed-off-by: sna <sna@nvidia.com>
@seonjinn seonjinn force-pushed the sj/fine-grained-activation-offload branch from 8a2292d to feee53e Compare April 17, 2026 07:08
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test feee53e

@seonjinn seonjinn self-assigned this Apr 17, 2026
@seonjinn seonjinn requested a review from terrykong April 17, 2026 21:10
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Nice feature addition — well-scoped passthrough to Megatron for fine-grained activation offloading. A few doc corrections and suggestions below.

Performance Evidence

This PR adds a memory-reduction feature but the description doesn't include benchmark numbers. Could you share peak GPU memory (GiB) and tokens/sec with and without activation offloading for a representative config (e.g. a MoE model with ["moe_act"] offloading)? Even a single run would help users know what to expect. A reference to the upstream Megatron-LM feature guide is also welcome.

Documentation

No user-facing docs added under docs/. Consider adding a brief section or pointer to the upstream Megatron-LM activation offloading guide.

Tests (optional nit)

Zero test coverage for the new validation logic. The existing TestApplyPerformanceConfig class in tests/unit/models/megatron/test_megatron_setup.py would be a natural home for a happy-path test (valid config sets attributes) and error-path test (empty list raises ValueError).

Generated by Claude Code

Comment thread nemo_rl/models/policy/__init__.py Outdated
Comment thread nemo_rl/models/megatron/setup.py Outdated
"fine_grained_activation_offloading is True."
)
model_cfg.fine_grained_activation_offloading = True
model_cfg.offload_modules = offload_modules
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nemo_rl/models/megatron/setup.py:533

Megatron-Bridge calls set_ideal_affinity_for_current_gpu() when this feature is enabled to optimize PCIe/DRAM transfer throughput via NUMA-aware CPU affinity. NeMo-RL doesn't call this, so users may not get the full performance benefit. Worth considering whether to add it here or document the gap.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seonjinn wdyt of this?

Comment thread nemo_rl/models/megatron/setup.py
seonjinn and others added 3 commits April 23, 2026 14:31
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test c7aae0d

seonjinn added 2 commits May 14, 2026 14:16
A stray closing parenthesis after the raise ValueError block caused a
SyntaxError, blocking the ruff/ruff-format pre-commit hooks in CI.

Signed-off-by: sna <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test daab28b

Signed-off-by: sna <sna@nvidia.com>
@seonjinn seonjinn requested a review from a team as a code owner May 15, 2026 18:52
@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label May 15, 2026
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test d6cd6bc

@seonjinn
Copy link
Copy Markdown
Contributor Author

/claude review


model_cfg = MagicMock()
model_cfg.gated_linear_unit = True
config = {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the docstring says "defaults to []" but .get("offload_modules") returns None when the key is absent, not []. The test is correct — None fails the isinstance(..., list) check — but the docstring is misleading.

Suggested change
config = {
"""When enabled but offload_modules key is absent, defaults to None → raises."""

Comment thread nemo_rl/models/policy/__init__.py Outdated
Comment on lines +240 to +250
# Offload specific module activations to CPU to reduce peak GPU memory.
# Works with both dense and MoE models. Different from
# optimizer_cpu_offload which offloads optimizer states.
# Requires transformer_engine implementation.
fine_grained_activation_offloading: NotRequired[bool]
# Modules to offload when fine_grained_activation_offloading is True.
# Required (no default). Valid values:
# "attn_norm", "qkv_linear", "core_attn", "attn_proj", "mlp_norm",
# "expert_fc1", "moe_act". Note: "attn_proj" requires "core_attn".
# See: https://github.com/NVIDIA/Megatron-LM/blob/d30c3ae5469fe3f6a64d4fd2e63b6e7f7844ea81/megatron/core/transformer/transformer_config.py#L1440-L1448
offload_modules: NotRequired[list[str]]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per config-conventions: "Reflect the default in the exemplar YAMLs under examples/configs/*.yaml." These two new keys aren't present in any exemplar YAML (e.g. grpo_math_1B_megatron.yaml). Since they're NotRequired and opt-in, a commented-out entry with a brief note in one megatron exemplar would be sufficient to make the feature discoverable.

- Add fine_grained_activation_offloading and offload_modules to all
  megatron-capable exemplar configs (grpo_math_1B, grpo_math_1B_megatron,
  sft, dpo, distillation_math, distillation_math_megatron).
- Sync tests/unit/reference_configs/*.yaml with the new keys so
  test_reference_configs_up_to_date passes.
- Trim setup.py block comment to a single line per reviewer feedback.
- Fix test docstring to reflect that .get() defaults to None.

Signed-off-by: seonjinn <sna@nvidia.com>
@seonjinn seonjinn requested a review from a team as a code owner May 23, 2026 07:10
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test ed86517

@seonjinn
Copy link
Copy Markdown
Contributor Author

/claude review


model_cfg = MagicMock()
model_cfg.gated_linear_unit = True
offload_modules = ["mlp", "moe_act"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "mlp" is not in the documented valid options ("attn_norm", "qkv_linear", "core_attn", "attn_proj", "mlp_norm", "expert_fc1", "moe_act"). While Megatron validates downstream, the happy-path test should use valid module names to avoid confusion.

Suggested change
offload_modules = ["mlp", "moe_act"]
offload_modules = ["mlp_norm", "moe_act"]

- Add NVTE_CPU_OFFLOAD_V1=1 note (TE >= 2.10.0) to TypedDict comment in
  policy/__init__.py so users see the env requirement up front rather
  than via a late Megatron-Bridge validation error.
- Document the NUMA affinity gap in megatron/setup.py: Megatron-Bridge's
  standalone path calls set_ideal_affinity_for_current_gpu() when this
  feature is on; NeMo-RL does not, so this comment points users who care
  about offload bandwidth at the external workaround.

Signed-off-by: seonjinn <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test d89ac84

NotRequired[list[str]] caused pydantic to reject `offload_modules: null`
in YAML, breaking L1 run_vlm_grpo (and any recipe loading an exemplar
megatron config when the feature is off). Wrap with Optional so the
exemplar default `null` is accepted; the runtime validation in setup.py
already raises if the feature is on with a non-list / empty value.

Signed-off-by: seonjinn <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test aaece7b

Signed-off-by: seonjinn <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test bcccabf

Signed-off-by: seonjinn <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test 08d121d

@seonjinn
Copy link
Copy Markdown
Contributor Author

@terrykong Can you review this PR when you have a chance?

@terrykong
Copy link
Copy Markdown
Collaborator

@seonjinn see this comment to see if still relevant https://github.com/NVIDIA-NeMo/RL/pull/2279/changes#r3106448383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants