Fix transformers 5.x compatibility + GPU inference optimisations#268
Open
markstrefford wants to merge 4 commits intolyogavin:mainfrom
Open
Fix transformers 5.x compatibility + GPU inference optimisations#268markstrefford wants to merge 4 commits intolyogavin:mainfrom
markstrefford wants to merge 4 commits intolyogavin:mainfrom
Conversation
…handling Prevents ImportError when optimum is not installed, allowing airllm to work without BetterTransformer support. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…2.10+ - Add _is_stateful attribute for GenerationMixin compatibility - Handle DynamicCache (no longer subscriptable in transformers 5.x) - Fix pre-quantized 4-bit weight loading (check_quantized_param -> param_needs_quantization) - Fix decoder layer output handling (returns tensor, not tuple in transformers 5.x) - Add position_embeddings support for new rotary embedding API - Update requirements.txt to reflect actual dependencies - Add CUDA PyTorch install instructions to README Co-Authored-By: Claude Opus 4.6 <[email protected]>
Documents all six breaking changes found and how they were resolved, useful as a reference for others hitting the same issues. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Load multiple layers onto GPU simultaneously based on available VRAM, compute them back-to-back, and clean up once per batch instead of per layer. Reduces clean_memory() calls from 83 to ~9 per forward pass. - Reuse model skeleton between forward passes instead of deleting and recreating it every time. Eliminates repeated BetterTransformer/SDPA detection and model reconstruction overhead. - Cache batch size calculation so layer file is only read once. - New layers_per_batch parameter: "auto" (default), integer, or 1 for original behaviour. Co-Authored-By: Claude Opus 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Compatibility fixes for transformers 5.x, bitsandbytes 0.49+, PyTorch 2.10+
optimum.bettertransformerimport optional (no longer required to be installed)_is_statefulattribute forGenerationMixincompatibility with transformers 5.xDynamicCachehandling (no longer subscriptable in transformers 5.x)check_quantized_param->param_needs_quantization)position_embeddingssupport for new rotary embedding APIrequirements.txtto reflect actual dependenciesGPU inference optimisations
clean_memory()calls from ~83 to ~9 per forward pass on a 70B model.forward()call. Eliminates repeated BetterTransformer/SDPA detection andAutoModelForCausalLM.from_config()overhead.layers_per_batchparameter:"auto"(default), integer to override, or1for original behaviour. Fully backward-compatible.Test plan
unsloth/Meta-Llama-3.1-70B-Instruct-bnb-4biton RTX 3070 (8GB VRAM)layers_per_batch="auto"correctly auto-sizes based on free VRAMEnvironment
🤖 Generated with Claude Code