Skip to content

feat(gguf): add Qwen3.5 (qwen3-next) hybrid MoE GGUF loader#2049

Open
emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
emanueleDiVizio:feat/qwen35-gguf
Open

feat(gguf): add Qwen3.5 (qwen3-next) hybrid MoE GGUF loader#2049
emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
emanueleDiVizio:feat/qwen35-gguf

Conversation

@emanueleDiVizio
Copy link
Copy Markdown

@emanueleDiVizio emanueleDiVizio commented Apr 2, 2026

Summary

This PR adds full GGUF support for the Qwen3.5 architecture (internally qwen3-next), including hybrid GDN (GatedDeltaNet) + attention layers and MoE variants.

Qwen3.5 is not a standard transformer: it combines recurrent GDN layers, attention blocks, and mixture-of-experts routing. Supporting it in GGUF required implementing both the architecture and its tensor layout/dtype semantics.

This enables inference of quantized Qwen3.5 models (e.g. Qwen3.5-35B-Instruct GGUF) on both Metal and CUDA backends.

Key changes

Model support

  • New quantized_qwen3_next.rs implementing:
    • Hybrid GDN + attention execution
    • MoE (including shared expert variant)
    • Dense (non-MoE) variant

GGUF compatibility fixes

  • Correct conv1d weight layout transformation (kernel, dim → dim, 1, kernel)
  • RoPE computation uses model dtype instead of forced F32
  • QRmsNorm casts weights to input dtype (BF16 compatibility)
  • V-head expansion matches GGUF tensor layout
  • SharedExpert gate properly dequantized and reshaped

Runtime state

  • Hybrid cache:
    • GDN recurrent state (fixed size)
    • Attention KV cache (sequence-dependent)
  • Proper reset between requests

Testing

  • Verified on:
    • Qwen3.5-35B-Instruct GGUF (MoE)
    • Qwen3.5 dense GGUF variant
  • Tested on Metal and CUDA backends
  • Validated multi-turn conversation correctness

Add GGUF quantized model support for the Qwen3.5 architecture
(qwen3-next), which combines:
- Full attention layers with GDN (Gated DeltaNet) recurrent layers
- Mixture-of-Experts with shared experts
- Support for both MoE (Qwen3.5-35B) and dense variants

Key implementation details:
- QRmsNorm casts weights to input dtype for GGUF BF16 compatibility
- RoPE uses model dtype instead of F32 for GGUF BF16
- GDN conv1d weight layout transposed from (kernel,dim) to (dim,1,kernel)
- V-head expansion uses tiled layout for both MoE and dense
- Local hybrid cache (GDN + attention KV) with proper cleanup
- SharedExpert gate dequantized and reshaped for Linear

Registers `Qwen35` and `Qwen35Moe` GGUF architecture variants.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant