perf: support fine-grained activation offloading by seonjinn · Pull Request #2279 · NVIDIA-NeMo/RL

seonjinn · 2026-04-17T04:44:04Z

What does this PR do ?

Support fine-grained activation offloading for Megatron policy
This feature can offloads per-module activations to CPU to reduce peak GPU memory, distinct from optimizer_cpu_offload.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

copy-pr-bot · 2026-04-17T04:44:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Exposes Megatron-Core fine_grained_activation_offloading and offload_modules through PolicyConfig so training can offload specific submodule activations (moe_act, core_attn, qkv_linear, mlp_norm, attn_norm) to CPU. Works for both dense and MoE models. Validation of module names is left to Megatron. Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-04-17T20:40:10Z

/ok to test feee53e

terrykong

Review Summary

Nice feature addition — well-scoped passthrough to Megatron for fine-grained activation offloading. A few doc corrections and suggestions below.

Performance Evidence

This PR adds a memory-reduction feature but the description doesn't include benchmark numbers. Could you share peak GPU memory (GiB) and tokens/sec with and without activation offloading for a representative config (e.g. a MoE model with ["moe_act"] offloading)? Even a single run would help users know what to expect. A reference to the upstream Megatron-LM feature guide is also welcome.

Documentation

No user-facing docs added under docs/. Consider adding a brief section or pointer to the upstream Megatron-LM activation offloading guide.

Tests (optional nit)

Zero test coverage for the new validation logic. The existing TestApplyPerformanceConfig class in tests/unit/models/megatron/test_megatron_setup.py would be a natural home for a happy-path test (valid config sets attributes) and error-path test (empty list raises ValueError).

Generated by Claude Code

terrykong · 2026-04-19T07:35:14Z

+                "fine_grained_activation_offloading is True."
+            )
+        model_cfg.fine_grained_activation_offloading = True
+        model_cfg.offload_modules = offload_modules


nemo_rl/models/megatron/setup.py:533

Megatron-Bridge calls set_ideal_affinity_for_current_gpu() when this feature is enabled to optimize PCIe/DRAM transfer throughput via NUMA-aware CPU affinity. NeMo-RL doesn't call this, so users may not get the full performance benefit. Worth considering whether to add it here or document the gap.

@seonjinn wdyt of this?

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

…ation-offload

seonjinn · 2026-05-14T21:05:54Z

/ok to test c7aae0d

A stray closing parenthesis after the raise ValueError block caused a SyntaxError, blocking the ruff/ruff-format pre-commit hooks in CI. Signed-off-by: sna <sna@nvidia.com>

…ation-offload

seonjinn · 2026-05-15T05:39:17Z

/ok to test daab28b

Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-05-15T18:53:17Z

/ok to test d6cd6bc

seonjinn · 2026-05-23T07:02:27Z

/claude review

claude · 2026-05-23T07:04:12Z

+
+        model_cfg = MagicMock()
+        model_cfg.gated_linear_unit = True
+        config = {


Nit: the docstring says "defaults to []" but .get("offload_modules") returns None when the key is absent, not []. The test is correct — None fails the isinstance(..., list) check — but the docstring is misleading.

Suggested change

config = {

"""When enabled but offload_modules key is absent, defaults to None → raises."""

claude · 2026-05-23T07:04:16Z

+    # Offload specific module activations to CPU to reduce peak GPU memory.
+    # Works with both dense and MoE models. Different from
+    # optimizer_cpu_offload which offloads optimizer states.
+    # Requires transformer_engine implementation.
+    fine_grained_activation_offloading: NotRequired[bool]
+    # Modules to offload when fine_grained_activation_offloading is True.
+    # Required (no default). Valid values:
+    # "attn_norm", "qkv_linear", "core_attn", "attn_proj", "mlp_norm",
+    # "expert_fc1", "moe_act". Note: "attn_proj" requires "core_attn".
+    # See: https://github.com/NVIDIA/Megatron-LM/blob/d30c3ae5469fe3f6a64d4fd2e63b6e7f7844ea81/megatron/core/transformer/transformer_config.py#L1440-L1448
+    offload_modules: NotRequired[list[str]]


Per config-conventions: "Reflect the default in the exemplar YAMLs under examples/configs/*.yaml." These two new keys aren't present in any exemplar YAML (e.g. grpo_math_1B_megatron.yaml). Since they're NotRequired and opt-in, a commented-out entry with a brief note in one megatron exemplar would be sufficient to make the feature discoverable.

- Add fine_grained_activation_offloading and offload_modules to all megatron-capable exemplar configs (grpo_math_1B, grpo_math_1B_megatron, sft, dpo, distillation_math, distillation_math_megatron). - Sync tests/unit/reference_configs/*.yaml with the new keys so test_reference_configs_up_to_date passes. - Trim setup.py block comment to a single line per reviewer feedback. - Fix test docstring to reflect that .get() defaults to None. Signed-off-by: seonjinn <sna@nvidia.com>

seonjinn · 2026-05-23T07:10:56Z

/ok to test ed86517

seonjinn · 2026-05-23T07:11:37Z

/claude review

claude · 2026-05-23T07:12:53Z

+
+        model_cfg = MagicMock()
+        model_cfg.gated_linear_unit = True
+        offload_modules = ["mlp", "moe_act"]


Nit: "mlp" is not in the documented valid options ("attn_norm", "qkv_linear", "core_attn", "attn_proj", "mlp_norm", "expert_fc1", "moe_act"). While Megatron validates downstream, the happy-path test should use valid module names to avoid confusion.

Suggested change

offload_modules = ["mlp", "moe_act"]

offload_modules = ["mlp_norm", "moe_act"]

- Add NVTE_CPU_OFFLOAD_V1=1 note (TE >= 2.10.0) to TypedDict comment in policy/__init__.py so users see the env requirement up front rather than via a late Megatron-Bridge validation error. - Document the NUMA affinity gap in megatron/setup.py: Megatron-Bridge's standalone path calls set_ideal_affinity_for_current_gpu() when this feature is on; NeMo-RL does not, so this comment points users who care about offload bandwidth at the external workaround. Signed-off-by: seonjinn <sna@nvidia.com>

seonjinn · 2026-05-23T07:13:09Z

/ok to test d89ac84

NotRequired[list[str]] caused pydantic to reject `offload_modules: null` in YAML, breaking L1 run_vlm_grpo (and any recipe loading an exemplar megatron config when the feature is off). Wrap with Optional so the exemplar default `null` is accepted; the runtime validation in setup.py already raises if the feature is on with a non-list / empty value. Signed-off-by: seonjinn <sna@nvidia.com>

seonjinn · 2026-05-23T07:39:12Z

/ok to test aaece7b

Signed-off-by: seonjinn <sna@nvidia.com>

seonjinn · 2026-05-23T18:25:38Z

/ok to test bcccabf

Signed-off-by: seonjinn <sna@nvidia.com>

seonjinn · 2026-05-26T01:48:59Z

/ok to test 08d121d

seonjinn · 2026-05-26T03:24:03Z

@terrykong Can you review this PR when you have a chance?

terrykong · 2026-05-26T03:44:16Z

@seonjinn see this comment to see if still relevant https://github.com/NVIDIA-NeMo/RL/pull/2279/changes#r3106448383

seonjinn requested review from a team as code owners April 17, 2026 04:44

seonjinn requested review from a team as code owners April 17, 2026 05:11

seonjinn force-pushed the sj/fine-grained-activation-offload branch from 8a2292d to feee53e Compare April 17, 2026 07:08

seonjinn self-assigned this Apr 17, 2026

seonjinn requested a review from terrykong April 17, 2026 21:10

terrykong reviewed Apr 19, 2026

View reviewed changes

seonjinn and others added 3 commits April 23, 2026 14:31

Update nemo_rl/models/policy/__init__.py

0b530c4

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Update nemo_rl/models/megatron/setup.py

d5df80a

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Merge remote-tracking branch 'origin/main' into sj/fine-grained-activ…

f742bf8

…ation-offload

copy-pr-bot Bot temporarily deployed to public May 14, 2026 21:06 Inactive

seonjinn added 2 commits May 14, 2026 14:16

fix: remove stray paren in setup.py raising ValueError

e237987

A stray closing parenthesis after the raise ValueError block caused a SyntaxError, blocking the ruff/ruff-format pre-commit hooks in CI. Signed-off-by: sna <sna@nvidia.com>

Merge remote-tracking branch 'origin/main' into sj/fine-grained-activ…

4f4681c

…ation-offload

copy-pr-bot Bot temporarily deployed to public May 15, 2026 05:39 Inactive

fix: pin NeMo Gym docs URL to v0.2.1 (latest 404)

06b4d4a

Signed-off-by: sna <sna@nvidia.com>

seonjinn requested a review from a team as a code owner May 15, 2026 18:52

github-actions Bot added the Documentation Improvements or additions to documentation label May 15, 2026

copy-pr-bot Bot temporarily deployed to public May 15, 2026 18:53 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 01:12 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 01:17 Inactive

claude Bot reviewed May 23, 2026

View reviewed changes

seonjinn requested a review from a team as a code owner May 23, 2026 07:10

copy-pr-bot Bot temporarily deployed to public May 23, 2026 07:11 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 23, 2026 07:11 Failure

copy-pr-bot Bot had a problem deploying to nemo-ci May 23, 2026 07:11 Error

copy-pr-bot Bot temporarily deployed to public May 23, 2026 07:11 Inactive

claude Bot reviewed May 23, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 23, 2026 07:13 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 23, 2026 07:14 Failure

Merge branch 'main' into sj/fine-grained-activation-offload

bcccabf

Signed-off-by: seonjinn <sna@nvidia.com>

Merge branch 'main' into sj/fine-grained-activation-offload

08d121d

Signed-off-by: seonjinn <sna@nvidia.com>

	config = {
	"""When enabled but offload_modules key is absent, defaults to None → raises."""

	offload_modules = ["mlp", "moe_act"]
	offload_modules = ["mlp_norm", "moe_act"]

Conversation

seonjinn commented Apr 17, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 17, 2026

Uh oh!

seonjinn commented Apr 17, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Review Summary

Performance Evidence

Documentation

Tests (optional nit)

Uh oh!

Uh oh!

Uh oh!

terrykong Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seonjinn commented May 14, 2026

Uh oh!

seonjinn commented May 15, 2026

Uh oh!

seonjinn commented May 15, 2026

Uh oh!

seonjinn commented May 23, 2026

Uh oh!

claude Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

seonjinn commented May 23, 2026

Uh oh!

seonjinn commented May 23, 2026

Uh oh!

claude Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

seonjinn commented May 23, 2026

Uh oh!

seonjinn commented May 23, 2026

Uh oh!

seonjinn commented May 23, 2026

Uh oh!

seonjinn commented May 26, 2026

Uh oh!

seonjinn commented May 26, 2026

Uh oh!

terrykong commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants