Hey!
I ran into a CUDA OOM that surprised me a bit, because I expected an NVIDIA H100 80GB to be sufficient for this workload.
The run was using my slightly modified bad prompt dataset and a config.toml with almost entirely default settings.
Summary
The auto-benchmark selected batch size 128, but the run later failed with an out-of-memory error at:
Obtaining residuals for good prompts
When I manually reduced the batch size to 64, the run continued successfully.
Observations
- Auto-selected batch size: 128
- Failure point: Obtaining residuals for good prompts
- Manual batch size of 64 works
- VRAM during successful run:
- peaked at ~74 GB
- then settled to ~65 GB
Question
Could someone explain why the run would OOM after the auto-benchmark had already chosen batch size 128?
I’m guessing the residual extraction step has a higher memory requirement than the earlier benchmark path, but I’m still learning the framework and wanted to confirm whether that is expected.
Likely cause
The auto-selected batch size seems to be based on a lighter generation workload than the later residual-collection step.
Autotuning succeeds at batch size 128 during normal generation, but get_residuals_batched() appears to use much more VRAM because it enables output_hidden_states=True and builds per-layer residual tensors.
So the batch size chosen by the benchmark is valid for response generation, but not necessarily for residual extraction. The Qwen3.5 traceback also suggests extra temporary memory is needed in its linear-attention path, which may contribute to the OOM.
Environment
- GPU: NVIDIA H100 80GB
- Model: Qwen3.5-27B-BF16
- Config: mostly default
config.toml
- Dataset: lightly modified bad prompt dataset
Error log below
Detected 1 CUDA device(s) (79.18 GB total VRAM):
- GPU 0: NVIDIA H100 80GB HBM3 (79.18 GB)
Loading model Qwen/Qwen3.5-27B...
- Trying dtype auto... Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 100%|██████████████████████████| 1184/1184 [00:08<00:00, 147.43it/s]
Ok
- LoRA adapters initialized (targets: out_proj, down_proj, o_proj)
- Transformer model with 64 layers
- Abliterable components:
- attn.o_proj: 1 modules per layer
- mlp.down_proj: 1 modules per layer
Resident system RAM: 1.71 GB
Allocated GPU VRAM: 51.00 GB
Reserved GPU VRAM: 51.18 GB
Loading good prompts from mlabonne/harmless_alpaca...
Loading bad prompts from ./bad_prompts...
Generating train split: 442 examples [00:00, 187734.92 examples/s]
Generating test split: 130 examples [00:00, 107758.80 examples/s]
Determining optimal batch size...
- Trying batch size 1... Ok (14 tokens/s)
- Trying batch size 2... Ok (26 tokens/s)
- Trying batch size 4... Ok (52 tokens/s)
- Trying batch size 8... Ok (77 tokens/s)
- Trying batch size 16... Ok (153 tokens/s)
- Trying batch size 32... Ok (268 tokens/s)
- Trying batch size 64... Ok (482 tokens/s)
- Trying batch size 128... Ok (584 tokens/s)
- Chosen batch size: 128
Checking for common response prefix...
Loading good evaluation prompts from mlabonne/harmless_alpaca...
- 100 prompts loaded
- Obtaining first-token probability distributions...
Loading bad evaluation prompts from ./bad_prompts...
Generating train split: 100%|███████████| 442/442 [00:00<00:00, 176998.51 examples/s]
Generating test split: 100%|████████████| 130/130 [00:00<00:00, 102646.75 examples/s]
- 130 prompts loaded
- Counting model refusals...
- Initial refusals: 121/130
Calculating per-layer refusal directions...
- Obtaining residuals for good prompts...
╭──────────────────────── Traceback (most recent call last) ────────────────────────╮
│ /root/code/heretic/.venv/bin/heretic:10 in │
│ /root/code/heretic/src/heretic/main.py:925 in main │
│ /root/code/heretic/src/heretic/main.py:439 in run │
│ /root/code/heretic/src/heretic/model.py:672 in get_residuals_batched │
│ /root/code/heretic/src/heretic/model.py:626 in get_residuals │
│ /root/code/heretic/src/heretic/model.py:580 in generate │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/peft/peft_model.py:2048 │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/generation/... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
╰───────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB.
GPU 0 has a total capacity of 79.18 GiB of which 120.31 MiB is free.
Including non-PyTorch memory, this process has 79.05 GiB memory in use.
Of the allocated memory 76.69 GiB is allocated by PyTorch,
and 1.69 GiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
to avoid fragmentation.
See documentation for Memory Management:
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
Hey!
I ran into a CUDA OOM that surprised me a bit, because I expected an NVIDIA H100 80GB to be sufficient for this workload.
The run was using my slightly modified bad prompt dataset and a
config.tomlwith almost entirely default settings.Summary
The auto-benchmark selected batch size 128, but the run later failed with an out-of-memory error at:
When I manually reduced the batch size to 64, the run continued successfully.
Observations
Question
Could someone explain why the run would OOM after the auto-benchmark had already chosen batch size 128?
I’m guessing the residual extraction step has a higher memory requirement than the earlier benchmark path, but I’m still learning the framework and wanted to confirm whether that is expected.
Likely cause
The auto-selected batch size seems to be based on a lighter generation workload than the later residual-collection step.
Autotuning succeeds at batch size 128 during normal generation, but
get_residuals_batched()appears to use much more VRAM because it enablesoutput_hidden_states=Trueand builds per-layer residual tensors.So the batch size chosen by the benchmark is valid for response generation, but not necessarily for residual extraction. The Qwen3.5 traceback also suggests extra temporary memory is needed in its linear-attention path, which may contribute to the OOM.
Environment
config.tomlError log below
Detected 1 CUDA device(s) (79.18 GB total VRAM):
Loading model Qwen/Qwen3.5-27B...
Loading weights: 100%|██████████████████████████| 1184/1184 [00:08<00:00, 147.43it/s]
Ok
Resident system RAM: 1.71 GB
Allocated GPU VRAM: 51.00 GB
Reserved GPU VRAM: 51.18 GB
Loading good prompts from mlabonne/harmless_alpaca...
Loading bad prompts from ./bad_prompts...
Generating train split: 442 examples [00:00, 187734.92 examples/s]
Generating test split: 130 examples [00:00, 107758.80 examples/s]
Determining optimal batch size...
Checking for common response prefix...
Loading good evaluation prompts from mlabonne/harmless_alpaca...
Loading bad evaluation prompts from ./bad_prompts...
Generating train split: 100%|███████████| 442/442 [00:00<00:00, 176998.51 examples/s]
Generating test split: 100%|████████████| 130/130 [00:00<00:00, 102646.75 examples/s]
Calculating per-layer refusal directions...
╭──────────────────────── Traceback (most recent call last) ────────────────────────╮
│ /root/code/heretic/.venv/bin/heretic:10 in │
│ /root/code/heretic/src/heretic/main.py:925 in main │
│ /root/code/heretic/src/heretic/main.py:439 in run │
│ /root/code/heretic/src/heretic/model.py:672 in get_residuals_batched │
│ /root/code/heretic/src/heretic/model.py:626 in get_residuals │
│ /root/code/heretic/src/heretic/model.py:580 in generate │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/peft/peft_model.py:2048 │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/generation/... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
╰───────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB.
GPU 0 has a total capacity of 79.18 GiB of which 120.31 MiB is free.
Including non-PyTorch memory, this process has 79.05 GiB memory in use.
Of the allocated memory 76.69 GiB is allocated by PyTorch,
and 1.69 GiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
to avoid fragmentation.
See documentation for Memory Management:
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables