feat(gguf): add Qwen3.5 (qwen3-next) hybrid MoE GGUF loader by emanueleDiVizio · Pull Request #2049 · EricLBuehler/mistral.rs

emanueleDiVizio · 2026-04-02T17:29:12Z

Summary

This PR adds full GGUF support for the Qwen3.5 architecture (internally qwen3-next), including hybrid GDN (GatedDeltaNet) + attention layers and MoE variants.

Qwen3.5 is not a standard transformer: it combines recurrent GDN layers, attention blocks, and mixture-of-experts routing. Supporting it in GGUF required implementing both the architecture and its tensor layout/dtype semantics.

This enables inference of quantized Qwen3.5 models (e.g. Qwen3.5-35B-Instruct GGUF) on both Metal and CUDA backends.

Key changes

Model support

New quantized_qwen3_next.rs implementing:
- Hybrid GDN + attention execution
- MoE (including shared expert variant)
- Dense (non-MoE) variant

GGUF compatibility fixes

Correct conv1d weight layout transformation (kernel, dim → dim, 1, kernel)
RoPE computation uses model dtype instead of forced F32
QRmsNorm casts weights to input dtype (BF16 compatibility)
V-head expansion matches GGUF tensor layout
SharedExpert gate properly dequantized and reshaped

Runtime state

Hybrid cache:
- GDN recurrent state (fixed size)
- Attention KV cache (sequence-dependent)
Proper reset between requests

Testing

Verified on:
- Qwen3.5-35B-Instruct GGUF (MoE)
- Qwen3.5 dense GGUF variant
Tested on Metal and CUDA backends
Validated multi-turn conversation correctness

Add GGUF quantized model support for the Qwen3.5 architecture (qwen3-next), which combines: - Full attention layers with GDN (Gated DeltaNet) recurrent layers - Mixture-of-Experts with shared experts - Support for both MoE (Qwen3.5-35B) and dense variants Key implementation details: - QRmsNorm casts weights to input dtype for GGUF BF16 compatibility - RoPE uses model dtype instead of F32 for GGUF BF16 - GDN conv1d weight layout transposed from (kernel,dim) to (dim,1,kernel) - V-head expansion uses tiled layout for both MoE and dense - Local hybrid cache (GDN + attention KV) with proper cleanup - SharedExpert gate dequantized and reshaped for Linear Registers `Qwen35` and `Qwen35Moe` GGUF architecture variants.

… gguf pipeline

emanueleDiVizio force-pushed the feat/qwen35-gguf branch from 3f88cdc to bbd4272 Compare April 2, 2026 17:35

emanueleDiVizio added 2 commits April 2, 2026 19:44

Add deltanet (GatedDeltaNet) module required by Qwen3.5 GGUF loader

8b49b58

Fix API compat: register QQwen3Next in model_config, fix deltanet and…

6082841

… gguf pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gguf): add Qwen3.5 (qwen3-next) hybrid MoE GGUF loader#2049

feat(gguf): add Qwen3.5 (qwen3-next) hybrid MoE GGUF loader#2049
emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
emanueleDiVizio:feat/qwen35-gguf

emanueleDiVizio commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emanueleDiVizio commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Model support

GGUF compatibility fixes

Runtime state

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

emanueleDiVizio commented Apr 2, 2026 •

edited

Loading