Fix transformers 5.x compatibility + GPU inference optimisations by markstrefford · Pull Request #268 · lyogavin/airllm

markstrefford · 2026-03-09T11:38:53Z

Summary

Compatibility fixes for transformers 5.x, bitsandbytes 0.49+, PyTorch 2.10+

Made optimum.bettertransformer import optional (no longer required to be installed)
Added _is_stateful attribute for GenerationMixin compatibility with transformers 5.x
Fixed DynamicCache handling (no longer subscriptable in transformers 5.x)
Fixed pre-quantized 4-bit weight loading (check_quantized_param -> param_needs_quantization)
Fixed decoder layer output handling (returns tensor directly, not tuple, in transformers 5.x)
Added position_embeddings support for new rotary embedding API
Updated requirements.txt to reflect actual dependencies
Added CUDA PyTorch install instructions to README

GPU inference optimisations

Multi-layer GPU batching: loads multiple layers onto GPU simultaneously based on available VRAM, computes them back-to-back, and cleans up once per batch instead of per layer. Reduces clean_memory() calls from ~83 to ~9 per forward pass on a 70B model.
Model reuse between forward passes: reuses the model skeleton instead of deleting and recreating it every forward() call. Eliminates repeated BetterTransformer/SDPA detection and AutoModelForCausalLM.from_config() overhead.
Cached batch sizing: layer size estimation only performed once, not every forward pass.
New layers_per_batch parameter: "auto" (default), integer to override, or 1 for original behaviour. Fully backward-compatible.

Test plan

Tested with unsloth/Meta-Llama-3.1-70B-Instruct-bnb-4bit on RTX 3070 (8GB VRAM)
Model loads and runs inference successfully through all 83 layers
Correct output generated ("Washington D.C." for capital of US question)
layers_per_batch="auto" correctly auto-sizes based on free VRAM
Model reuse eliminates repeated init messages
Steady-state per-pass throughput roughly doubled
Verify with non-quantized (FP16) model
Verify with compression='4bit' (prefetching disabled path)
Verify on other architectures (ChatGLM, QWen, Mistral)

Environment

Python 3.11, PyTorch 2.10.0+cu126, transformers 5.3.0, bitsandbytes 0.49.2, Windows 11

🤖 Generated with Claude Code

…handling Prevents ImportError when optimum is not installed, allowing airllm to work without BetterTransformer support. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…2.10+ - Add _is_stateful attribute for GenerationMixin compatibility - Handle DynamicCache (no longer subscriptable in transformers 5.x) - Fix pre-quantized 4-bit weight loading (check_quantized_param -> param_needs_quantization) - Fix decoder layer output handling (returns tensor, not tuple in transformers 5.x) - Add position_embeddings support for new rotary embedding API - Update requirements.txt to reflect actual dependencies - Add CUDA PyTorch install instructions to README Co-Authored-By: Claude Opus 4.6 <[email protected]>

Documents all six breaking changes found and how they were resolved, useful as a reference for others hitting the same issues. Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Load multiple layers onto GPU simultaneously based on available VRAM, compute them back-to-back, and clean up once per batch instead of per layer. Reduces clean_memory() calls from 83 to ~9 per forward pass. - Reuse model skeleton between forward passes instead of deleting and recreating it every time. Eliminates repeated BetterTransformer/SDPA detection and model reconstruction overhead. - Cache batch size calculation so layer file is only read once. - New layers_per_batch parameter: "auto" (default), integer, or 1 for original behaviour. Co-Authored-By: Claude Opus 4.6 <[email protected]>

markstrefford and others added 4 commits March 9, 2026 08:52

Make optimum.bettertransformer import optional and broaden exception …

a76043a

…handling Prevents ImportError when optimum is not installed, allowing airllm to work without BetterTransformer support. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add detailed writeup of transformers 5.x compatibility fixes

7bc5933

Documents all six breaking changes found and how they were resolved, useful as a reference for others hitting the same issues. Co-Authored-By: Claude Opus 4.6 <[email protected]>

markstrefford changed the title ~~Fix compatibility with transformers 5.x, bitsandbytes 0.49+, PyTorch 2.10+~~ Fix transformers 5.x compatibility + GPU inference optimisations Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix transformers 5.x compatibility + GPU inference optimisations#268

Fix transformers 5.x compatibility + GPU inference optimisations#268
markstrefford wants to merge 4 commits intolyogavin:mainfrom
markstrefford:main

markstrefford commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

markstrefford commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compatibility fixes for transformers 5.x, bitsandbytes 0.49+, PyTorch 2.10+

GPU inference optimisations

Test plan

Environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

markstrefford commented Mar 9, 2026 •

edited

Loading