Skip to content

OOM Error - H100 80GB - Qwen3.5-27B #220

@Wakarimasen-Ai

Description

@Wakarimasen-Ai

Hey!

I ran into a CUDA OOM that surprised me a bit, because I expected an NVIDIA H100 80GB to be sufficient for this workload.

The run was using my slightly modified bad prompt dataset and a config.toml with almost entirely default settings.

Summary

The auto-benchmark selected batch size 128, but the run later failed with an out-of-memory error at:

Obtaining residuals for good prompts

When I manually reduced the batch size to 64, the run continued successfully.

Observations

  • Auto-selected batch size: 128
  • Failure point: Obtaining residuals for good prompts
  • Manual batch size of 64 works
  • VRAM during successful run:
    • peaked at ~74 GB
    • then settled to ~65 GB

Question

Could someone explain why the run would OOM after the auto-benchmark had already chosen batch size 128?

I’m guessing the residual extraction step has a higher memory requirement than the earlier benchmark path, but I’m still learning the framework and wanted to confirm whether that is expected.

Likely cause

The auto-selected batch size seems to be based on a lighter generation workload than the later residual-collection step.

Autotuning succeeds at batch size 128 during normal generation, but get_residuals_batched() appears to use much more VRAM because it enables output_hidden_states=True and builds per-layer residual tensors.

So the batch size chosen by the benchmark is valid for response generation, but not necessarily for residual extraction. The Qwen3.5 traceback also suggests extra temporary memory is needed in its linear-attention path, which may contribute to the OOM.

Environment

  • GPU: NVIDIA H100 80GB
  • Model: Qwen3.5-27B-BF16
  • Config: mostly default config.toml
  • Dataset: lightly modified bad prompt dataset

Error log below

Detected 1 CUDA device(s) (79.18 GB total VRAM):

  • GPU 0: NVIDIA H100 80GB HBM3 (79.18 GB)

Loading model Qwen/Qwen3.5-27B...

  • Trying dtype auto... Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
    Loading weights: 100%|██████████████████████████| 1184/1184 [00:08<00:00, 147.43it/s]
    Ok
  • LoRA adapters initialized (targets: out_proj, down_proj, o_proj)
  • Transformer model with 64 layers
  • Abliterable components:
    • attn.o_proj: 1 modules per layer
    • mlp.down_proj: 1 modules per layer

Resident system RAM: 1.71 GB
Allocated GPU VRAM: 51.00 GB
Reserved GPU VRAM: 51.18 GB

Loading good prompts from mlabonne/harmless_alpaca...

  • 400 prompts loaded

Loading bad prompts from ./bad_prompts...
Generating train split: 442 examples [00:00, 187734.92 examples/s]
Generating test split: 130 examples [00:00, 107758.80 examples/s]

  • 442 prompts loaded

Determining optimal batch size...

  • Trying batch size 1... Ok (14 tokens/s)
  • Trying batch size 2... Ok (26 tokens/s)
  • Trying batch size 4... Ok (52 tokens/s)
  • Trying batch size 8... Ok (77 tokens/s)
  • Trying batch size 16... Ok (153 tokens/s)
  • Trying batch size 32... Ok (268 tokens/s)
  • Trying batch size 64... Ok (482 tokens/s)
  • Trying batch size 128... Ok (584 tokens/s)
  • Chosen batch size: 128

Checking for common response prefix...

  • None found

Loading good evaluation prompts from mlabonne/harmless_alpaca...

  • 100 prompts loaded
  • Obtaining first-token probability distributions...

Loading bad evaluation prompts from ./bad_prompts...
Generating train split: 100%|███████████| 442/442 [00:00<00:00, 176998.51 examples/s]
Generating test split: 100%|████████████| 130/130 [00:00<00:00, 102646.75 examples/s]

  • 130 prompts loaded
  • Counting model refusals...
  • Initial refusals: 121/130

Calculating per-layer refusal directions...

  • Obtaining residuals for good prompts...
    ╭──────────────────────── Traceback (most recent call last) ────────────────────────╮
    │ /root/code/heretic/.venv/bin/heretic:10 in │
    │ /root/code/heretic/src/heretic/main.py:925 in main │
    │ /root/code/heretic/src/heretic/main.py:439 in run │
    │ /root/code/heretic/src/heretic/model.py:672 in get_residuals_batched │
    │ /root/code/heretic/src/heretic/model.py:626 in get_residuals │
    │ /root/code/heretic/src/heretic/model.py:580 in generate │
    │ /root/code/heretic/.venv/lib/python3.12/site-packages/peft/peft_model.py:2048 │
    │ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/generation/... │
    │ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
    │ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
    │ /root/code/heretic/.venv/lib/python3.12/site-packages/transformers/models/qwen... │
    ╰───────────────────────────────────────────────────────────────────────────────────╯

OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB.
GPU 0 has a total capacity of 79.18 GiB of which 120.31 MiB is free.
Including non-PyTorch memory, this process has 79.05 GiB memory in use.
Of the allocated memory 76.69 GiB is allocated by PyTorch,
and 1.69 GiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
to avoid fragmentation.
See documentation for Memory Management:
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions