fix(pt): recognize AOTInductor-wrapped CUDA OOM in AutoBatchSize#5418
fix(pt): recognize AOTInductor-wrapped CUDA OOM in AutoBatchSize#5418OutisLi wants to merge 2 commits intodeepmodeling:masterfrom
Conversation
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
deepmd/pt/utils/auto_batch_size.py (1)
79-96: Optional: consolidate the threeempty_cache()+return Truebranches.Minor readability nit — the three OOM-positive branches repeat the same side effect. You can fold them into a single exit point without changing behavior.
♻️ Suggested consolidation
- if any(m in msg for msg in msgs for m in plain_oom_markers): - torch.cuda.empty_cache() - return True - - # AOTInductor (.pt2) wraps the underlying CUDA OOM as a generic - # ``run_func_(...) API call failed at .../model_container_runner.cpp``. - # ... - aoti_wrapped = any( - "run_func_(" in msg and "model_container_runner" in msg for msg in msgs - ) - if aoti_wrapped: - torch.cuda.empty_cache() - return True - - return False + plain_oom = any(m in msg for msg in msgs for m in plain_oom_markers) + # AOTInductor (.pt2) wraps the underlying CUDA OOM as a generic + # ``run_func_(...) API call failed at .../model_container_runner.cpp``. + # The original "CUDA out of memory" text is printed to stderr only, so + # we match on the wrapper signature. If the root cause is not OOM, + # ``execute()`` will shrink to batch size 1 and raise ``OutOfMemoryError``. + aoti_wrapped = any( + "run_func_(" in msg and "model_container_runner" in msg for msg in msgs + ) + if plain_oom or aoti_wrapped: + torch.cuda.empty_cache() + return True + return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/utils/auto_batch_size.py` around lines 79 - 96, The three places that call torch.cuda.empty_cache() and return True (the plain OOM marker check using plain_oom_markers, the earlier OOM detection loop over msgs, and the AOTInductor wrapper check that sets aoti_wrapped) should be consolidated to a single exit point: compute the boolean condition variables (e.g., plain_match = any(m in msg for msg in msgs for m in plain_oom_markers), other_match = ... , aoti_wrapped = any("run_func_(" in msg and "model_container_runner" in msg for msg in msgs)), combine them (e.g., if plain_match or other_match or aoti_wrapped) then call torch.cuda.empty_cache() once and return True; update the surrounding function (the OOM detection logic in auto_batch_size) to use these names so the duplicated side effects are removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@deepmd/pt/utils/auto_batch_size.py`:
- Around line 79-96: The three places that call torch.cuda.empty_cache() and
return True (the plain OOM marker check using plain_oom_markers, the earlier OOM
detection loop over msgs, and the AOTInductor wrapper check that sets
aoti_wrapped) should be consolidated to a single exit point: compute the boolean
condition variables (e.g., plain_match = any(m in msg for msg in msgs for m in
plain_oom_markers), other_match = ... , aoti_wrapped = any("run_func_(" in msg
and "model_container_runner" in msg for msg in msgs)), combine them (e.g., if
plain_match or other_match or aoti_wrapped) then call torch.cuda.empty_cache()
once and return True; update the surrounding function (the OOM detection logic
in auto_batch_size) to use these names so the duplicated side effects are
removed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d8e9df32-3cac-49f1-ae75-1e812e9778fb
📒 Files selected for processing (1)
deepmd/pt/utils/auto_batch_size.py
There was a problem hiding this comment.
Pull request overview
Improves PyTorch AutoBatchSize OOM detection for .pt2 AOTInductor-packaged models by recognizing AOTInductor’s wrapped CUDA OOM failures, allowing batch size to shrink instead of crashing.
Changes:
- Expand OOM detection to scan the exception chain (
__cause__/__context__) for known CUDA/cusolver OOM markers. - Add detection for AOTInductor wrapper error signatures (
run_func_(+model_container_runner) and treat them as OOM. - Keep GPU cache cleanup (
torch.cuda.empty_cache()) when an OOM is detected.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5418 +/- ##
==========================================
- Coverage 80.46% 80.46% -0.01%
==========================================
Files 823 823
Lines 86625 86645 +20
Branches 4139 4139
==========================================
+ Hits 69701 69717 +16
Misses 15651 15651
- Partials 1273 1277 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
njzjz-bot
left a comment
There was a problem hiding this comment.
Nice fix. I walked through the exception-chain handling and the AOTInductor wrapper detection in AutoBatchSize, and the fallback behavior still looks bounded/reasonable. The current CI matrix is green, so I'm happy with this as-is.
— OpenClaw 2026.4.22 (model: gpt-5.4)
When running `dp --pt-expt test` (or any path that goes through
`deepmd.pt_expt.infer.deep_eval`) against a `.pt2` AOTInductor
package, `AutoBatchSize` doubles the batch on every success. For
models with a large `sel` the exploration eventually saturates GPU
memory, and the CUDA caching allocator raises the usual
``CUDA out of memory`` from inside the AOTInductor runtime.
AOTInductor then rewraps that error as a generic
RuntimeError: run_func_(...) API call failed at
.../aoti_runner/model_container_runner.cpp, line 144
The original "CUDA out of memory" text is printed only to stderr,
so the old `is_oom_error` -- which keyed on a short list of
substrings in `e.args[0]` -- never matched. `execute()` therefore
did not shrink the batch; the exception propagated and the run
crashed on a GPU that was otherwise completely idle (as confirmed by
monitoring `nvidia-smi --query-compute-apps`, which showed dp itself
as the sole consumer holding tens of GiB just before the failure).
Widen `is_oom_error` to:
* walk the exception chain via `__cause__` / `__context__`, so that a
future PyTorch preserving the original OOM text is handled for free;
* keep matching the four plain CUDA OOM markers on every message in
the chain;
* additionally treat the AOTInductor wrapper signature
(`run_func_(` plus `model_container_runner`) as an OOM candidate.
If the AOTInductor wrapper ever hides a non-OOM failure, the batch
shrinker will halve down to 1 and then raise `OutOfMemoryError`, so
the fallback is bounded -- non-OOM bugs still surface with a clear
terminal error rather than being silently retried forever.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
source/tests/pt/test_auto_batch_size.py (1)
14-43: Consider adding a negative case and a directOutOfMemoryErrorcase.The three new tests cover the chain-traversal and AOTI-wrapper paths well, but two coverage gaps remain that would harden against future regressions:
- The direct
torch.cuda.OutOfMemoryErrorbranch (early return at the top ofis_oom_error) isn't exercised — instantiating one is a bit awkward (it requires CUDA), but amock.patchontorch.cuda.OutOfMemoryErroror constructing viaRuntimeError-like surrogate plusisinstancepatching can do it; alternatively a simple skip-if-no-CUDA path also works.- There's no negative test asserting that an unrelated
RuntimeError(e.g.RuntimeError("shape mismatch")) returnsFalseand thatempty_cacheis not called. This guards against the marker list/AOTI heuristic accidentally widening into false positives, which would silently clear caches and shrink batch sizes on non-OOM bugs.♻️ Suggested addition
+ `@mock.patch`("deepmd.pt.utils.auto_batch_size.torch.cuda.empty_cache") + def test_is_oom_error_non_oom_runtime_error(self, empty_cache) -> None: + auto_batch_size = AutoBatchSize(256, 2.0) + self.assertFalse( + auto_batch_size.is_oom_error(RuntimeError("shape mismatch")) + ) + empty_cache.assert_not_called() + + `@mock.patch`("deepmd.pt.utils.auto_batch_size.torch.cuda.empty_cache") + def test_is_oom_error_non_runtime_error(self, empty_cache) -> None: + auto_batch_size = AutoBatchSize(256, 2.0) + self.assertFalse(auto_batch_size.is_oom_error(ValueError("nope"))) + empty_cache.assert_not_called()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/tests/pt/test_auto_batch_size.py` around lines 14 - 43, Add two tests to TestAutoBatchSize covering (1) direct torch.cuda.OutOfMemoryError and (2) a negative unrelated RuntimeError: create a test that patches torch.cuda.OutOfMemoryError (or mocks isinstance checks) and asserts AutoBatchSize.is_oom_error returns True and torch.cuda.empty_cache is called, and add another test that calls AutoBatchSize.is_oom_error(RuntimeError("shape mismatch")) asserting it returns False and that the patched torch.cuda.empty_cache was NOT called; reference the AutoBatchSize.is_oom_error method and the existing tests that patch "deepmd.pt.utils.auto_batch_size.torch.cuda.empty_cache" to mirror setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@source/tests/pt/test_auto_batch_size.py`:
- Around line 14-43: Add two tests to TestAutoBatchSize covering (1) direct
torch.cuda.OutOfMemoryError and (2) a negative unrelated RuntimeError: create a
test that patches torch.cuda.OutOfMemoryError (or mocks isinstance checks) and
asserts AutoBatchSize.is_oom_error returns True and torch.cuda.empty_cache is
called, and add another test that calls
AutoBatchSize.is_oom_error(RuntimeError("shape mismatch")) asserting it returns
False and that the patched torch.cuda.empty_cache was NOT called; reference the
AutoBatchSize.is_oom_error method and the existing tests that patch
"deepmd.pt.utils.auto_batch_size.torch.cuda.empty_cache" to mirror setup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 04fc2d82-1a39-480a-957e-57976382f46c
📒 Files selected for processing (2)
deepmd/pt/utils/auto_batch_size.pysource/tests/pt/test_auto_batch_size.py
|
|
||
| # AOTInductor (.pt2) wraps the underlying CUDA OOM as a generic | ||
| # ``run_func_(...) API call failed at .../model_container_runner.cpp``. | ||
| # https://github.com/deepmodeling/deepmd-kit/issues/4594 |
There was a problem hiding this comment.
I think #4594 is cited in the wrong place.
That issue looks like a plain CUDA OOM report (RuntimeError: CUDA error: out of memory), but this comment block is specifically describing the AOTInductor/PT2 wrapper signature (run_func_(...) / model_container_runner.cpp). So #4594 seems to support the plain OOM markers above rather than this wrapper-specific branch.
Could you either move #4594 back to the plain OOM section, or replace it with a reference that actually contains the wrapper signature?
— OpenClaw 2026.4.22 (model: gpt-5.4)
When running
dp --pt-expt test(or any path that goes throughdeepmd.pt_expt.infer.deep_eval) against a.pt2AOTInductor package,AutoBatchSizedoubles the batch on every success. For models with a largeselthe exploration eventually saturates GPU memory, and the CUDA caching allocator raises the usualCUDA out of memoryfrom inside the AOTInductor runtime. AOTInductor then rewraps that error as a genericRuntimeError: run_func_(...) API call failed at
.../aoti_runner/model_container_runner.cpp, line 144
The original "CUDA out of memory" text is printed only to stderr, so the old
is_oom_error-- which keyed on a short list of substrings ine.args[0]-- never matched.execute()therefore did not shrink the batch; the exception propagated and the run crashed on a GPU that was otherwise completely idle (as confirmed by monitoringnvidia-smi --query-compute-apps, which showed dp itself as the sole consumer holding tens of GiB just before the failure). Widenis_oom_errorto:__cause__/__context__, so that a future PyTorch preserving the original OOM text is handled for free;run_func_(plusmodel_container_runner) as an OOM candidate. If the AOTInductor wrapper ever hides a non-OOM failure, the batch shrinker will halve down to 1 and then raiseOutOfMemoryError, so the fallback is bounded -- non-OOM bugs still surface with a clear terminal error rather than being silently retried forever.Summary by CodeRabbit