Skip to content

Fix transformers 5.x compatibility + GPU inference optimisations#268

Open
markstrefford wants to merge 4 commits intolyogavin:mainfrom
markstrefford:main
Open

Fix transformers 5.x compatibility + GPU inference optimisations#268
markstrefford wants to merge 4 commits intolyogavin:mainfrom
markstrefford:main

Conversation

@markstrefford
Copy link
Copy Markdown

@markstrefford markstrefford commented Mar 9, 2026

Summary

Compatibility fixes for transformers 5.x, bitsandbytes 0.49+, PyTorch 2.10+

  • Made optimum.bettertransformer import optional (no longer required to be installed)
  • Added _is_stateful attribute for GenerationMixin compatibility with transformers 5.x
  • Fixed DynamicCache handling (no longer subscriptable in transformers 5.x)
  • Fixed pre-quantized 4-bit weight loading (check_quantized_param -> param_needs_quantization)
  • Fixed decoder layer output handling (returns tensor directly, not tuple, in transformers 5.x)
  • Added position_embeddings support for new rotary embedding API
  • Updated requirements.txt to reflect actual dependencies
  • Added CUDA PyTorch install instructions to README

GPU inference optimisations

  • Multi-layer GPU batching: loads multiple layers onto GPU simultaneously based on available VRAM, computes them back-to-back, and cleans up once per batch instead of per layer. Reduces clean_memory() calls from ~83 to ~9 per forward pass on a 70B model.
  • Model reuse between forward passes: reuses the model skeleton instead of deleting and recreating it every forward() call. Eliminates repeated BetterTransformer/SDPA detection and AutoModelForCausalLM.from_config() overhead.
  • Cached batch sizing: layer size estimation only performed once, not every forward pass.
  • New layers_per_batch parameter: "auto" (default), integer to override, or 1 for original behaviour. Fully backward-compatible.

Test plan

  • Tested with unsloth/Meta-Llama-3.1-70B-Instruct-bnb-4bit on RTX 3070 (8GB VRAM)
  • Model loads and runs inference successfully through all 83 layers
  • Correct output generated ("Washington D.C." for capital of US question)
  • layers_per_batch="auto" correctly auto-sizes based on free VRAM
  • Model reuse eliminates repeated init messages
  • Steady-state per-pass throughput roughly doubled
  • Verify with non-quantized (FP16) model
  • Verify with compression='4bit' (prefetching disabled path)
  • Verify on other architectures (ChatGLM, QWen, Mistral)

Environment

  • Python 3.11, PyTorch 2.10.0+cu126, transformers 5.3.0, bitsandbytes 0.49.2, Windows 11

🤖 Generated with Claude Code

markstrefford and others added 4 commits March 9, 2026 08:52
…handling

Prevents ImportError when optimum is not installed, allowing airllm to
work without BetterTransformer support.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…2.10+

- Add _is_stateful attribute for GenerationMixin compatibility
- Handle DynamicCache (no longer subscriptable in transformers 5.x)
- Fix pre-quantized 4-bit weight loading (check_quantized_param -> param_needs_quantization)
- Fix decoder layer output handling (returns tensor, not tuple in transformers 5.x)
- Add position_embeddings support for new rotary embedding API
- Update requirements.txt to reflect actual dependencies
- Add CUDA PyTorch install instructions to README

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Documents all six breaking changes found and how they were resolved,
useful as a reference for others hitting the same issues.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Load multiple layers onto GPU simultaneously based on available VRAM,
  compute them back-to-back, and clean up once per batch instead of per
  layer. Reduces clean_memory() calls from 83 to ~9 per forward pass.
- Reuse model skeleton between forward passes instead of deleting and
  recreating it every time. Eliminates repeated BetterTransformer/SDPA
  detection and model reconstruction overhead.
- Cache batch size calculation so layer file is only read once.
- New layers_per_batch parameter: "auto" (default), integer, or 1 for
  original behaviour.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@markstrefford markstrefford changed the title Fix compatibility with transformers 5.x, bitsandbytes 0.49+, PyTorch 2.10+ Fix transformers 5.x compatibility + GPU inference optimisations Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant