OOM Error - H100 80GB - Qwen3.5-27B

Hey!

I ran into a CUDA OOM that surprised me a bit, because I expected an NVIDIA H100 80GB to be sufficient for this workload.

The run was using my **slightly modified bad prompt dataset** and a `config.toml` with **almost entirely default settings**.

## Summary

The auto-benchmark selected **batch size 128**, but the run later failed with an **out-of-memory error** at:

> `Obtaining residuals for good prompts`

When I manually reduced the batch size to **64**, the run continued successfully.

## Observations

- Auto-selected batch size: **128**
- Failure point: **Obtaining residuals for good prompts**
- Manual batch size of **64** works
- VRAM during successful run:
  - peaked at **~74 GB**
  - then settled to **~65 GB**

## Question

Could someone explain why the run would OOM **after** the auto-benchmark had already chosen batch size 128?

I’m guessing the residual extraction step has a higher memory requirement than the earlier benchmark path, but I’m still learning the framework and wanted to confirm whether that is expected.

## Likely cause

The auto-selected batch size seems to be based on a lighter generation workload than the later residual-collection step.

Autotuning succeeds at batch size 128 during normal generation, but `get_residuals_batched()` appears to use much more VRAM because it enables `output_hidden_states=True` and builds per-layer residual tensors.

So the batch size chosen by the benchmark is valid for response generation, but not necessarily for residual extraction. The Qwen3.5 traceback also suggests extra temporary memory is needed in its linear-attention path, which may contribute to the OOM.

## Environment

- **GPU:** NVIDIA H100 80GB
- **Model:** [Qwen3.5-27B-BF16](https://huggingface.co/Qwen/Qwen3.5-27B)
- **Config:** mostly default `config.toml`
- **Dataset:** lightly modified bad prompt dataset

### Error log below

Detected 1 CUDA device(s) (79.18 GB total VRAM):
* GPU 0: NVIDIA H100 80GB HBM3 (79.18 GB)

Loading model Qwen/Qwen3.5-27B...
* Trying dtype auto... Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 100%|██████████████████████████| 1184/1184 [00:08<00:00, 147.43it/s]
Ok
* LoRA adapters initialized (targets: out_proj, down_proj, o_proj)
* Transformer model with 64 layers
* Abliterable components:
  * attn.o_proj: 1 modules per layer
  * mlp.down_proj: 1 modules per layer

Resident system RAM: 1.71 GB
Allocated GPU VRAM: 51.00 GB
Reserved GPU VRAM: 51.18 GB

Loading good prompts from mlabonne/harmless_alpaca...
* 400 prompts loaded

Loading bad prompts from ./bad_prompts...
Generating train split: 442 examples [00:00, 187734.92 examples/s]
Generating test split: 130 examples [00:00, 107758.80 examples/s]
* 442 prompts loaded

Determining optimal batch size...
* Trying batch size 1... Ok (14 tokens/s)
* Trying batch size 2... Ok (26 tokens/s)
* Trying batch size 4... Ok (52 tokens/s)
* Trying batch size 8... Ok (77 tokens/s)
* Trying batch size 16... Ok (153 tokens/s)
* Trying batch size 32... Ok (268 tokens/s)
* Trying batch size 64... Ok (482 tokens/s)
* Trying batch size 128... Ok (584 tokens/s)
* Chosen batch size: 128

Checking for common response prefix...
* None found

Loading good evaluation prompts from mlabonne/harmless_alpaca...
* 100 prompts loaded
* Obtaining first-token probability distributions...

Loading bad evaluation prompts from ./bad_prompts...
Generating train split: 100%|███████████| 442/442 [00:00<00:00, 176998.51 examples/s]
Generating test split: 100%|████████████| 130/130 [00:00<00:00, 102646.75 examples/s]
* 130 prompts loaded
* Counting model refusals...
* Initial refusals: 121/130

Calculating per-layer refusal directions...
* Obtaining residuals for good prompts...
╭──────────────────────── Traceback (most recent call last) ────────────────────────╮
│ /root/code/heretic/.venv/bin/heretic:10 in <module>                               │
│ /root/code/heretic/src/heretic/main.py:925 in main                                │
│ /root/code/heretic/src/heretic/main.py:439 in run                                 │
│ /root/code/heretic/src/heretic/model.py:672 in get_residuals_batched              │
│ /root/code/heretic/src/heretic/model.py:626 in get_residuals                      │
│ /root/code/heretic/src/heretic/model.py:580 in generate                           │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/peft/peft_model.py:2048     │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/generation/... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
│ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
╰───────────────────────────────────────────────────────────────────────────────────╯

OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB.
GPU 0 has a total capacity of 79.18 GiB of which 120.31 MiB is free.
Including non-PyTorch memory, this process has 79.05 GiB memory in use.
Of the allocated memory 76.69 GiB is allocated by PyTorch,
and 1.69 GiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
to avoid fragmentation.
See documentation for Memory Management:
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM Error - H100 80GB - Qwen3.5-27B #220

Summary

Observations

Question

Likely cause

Environment

Error log below

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

OOM Error - H100 80GB - Qwen3.5-27B #220

Description

Summary

Observations

Question

Likely cause

Environment

Error log below

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions