fix(metal): GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes by emanueleDiVizio · Pull Request #2047 · EricLBuehler/mistral.rs

emanueleDiVizio · 2026-04-02T17:15:59Z

Summary

This PR fixes multiple correctness, performance, and stability issues encountered while running mistral.rs on Apple Silicon (M-series) with real multi-user inference workloads (Qwen3.5 MoE + Mixtral).

The changes focus on:

Metal backend correctness (GDN + KV cache)
Scheduler behaviour under load (PagedAttention)
Robustness in concurrent serving scenarios
MLX integration improvements for attention kernels

Several of these issues only surface under concurrent decode or long-running sessions.

Key changes

Scheduler (from upstream PRs #2031/#2034)

Fix O(N²) thrashing in PagedAttention scheduler under mixed waiting/active workloads
Introduce FCFS priority ordering to prevent starvation

GDN / Metal

Fix dtype mismatch (bfloat vs bfloat16_t) in Metal kernels
Add per-sequence fallback for concurrent decode when recurrent offsets diverge

Stability

Replace panic on client disconnect with error handling
Return error instead of panic on block allocation failure (race condition)

Performance / Features

Increase Metal KV cache default max_seq_len (4K → 16K)
Add optional MLX SDPA backend with Metal flash attention (head_dim=256 support)

Test plan

Validated on Apple Silicon (M-series)
Tested with Qwen3.5 MoE (GDN) and Mixtral
Scheduler fixes verified under concurrent request load

… cache

…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.

…ts diverge

…allocation

…l prefill Add an optional MLX SDPA backend using steel flash attention kernels for Metal prefill. Enable head_dim=256 support for models like Qwen3.5 that use larger attention head dimensions.

emanueleDiVizio added 9 commits April 2, 2026 19:08

fix(metal): use bfloat instead of bfloat16_t in GDN Metal kernels

642243d

fix(metal): add include guard to float8.metal for PagedAttention

8d03002

feat(metal): increase default max_seq_len from 4K to 16K for Metal KV…

ad3cb23

… cache

fix(paged_attention): fix O(N^2) thrashing + FCFS priority in PA sche…

5e7dad2

…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.

fix: don't panic when sending error response to disconnected client

27e79fa

fix: GDN concurrent decode per-sequence fallback when recurrent offse…

a3c2db0

…ts diverge

fix(paged_attention): return error instead of panic on missing block …

4e3e9e4

…allocation

feat(metal): add MLX SDPA backend with steel flash attention for Meta…

f5098ed

…l prefill Add an optional MLX SDPA backend using steel flash attention kernels for Metal prefill. Enable head_dim=256 support for models like Qwen3.5 that use larger attention head dimensions.

feat: derive Clone for Model (wraps Arc, cheap clone)

af3e7f0

emanueleDiVizio force-pushed the fix/metal-fixes branch from 3bbe0a9 to af3e7f0 Compare April 2, 2026 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metal): GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes#2047

fix(metal): GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes#2047
emanueleDiVizio wants to merge 9 commits intoEricLBuehler:masterfrom
emanueleDiVizio:fix/metal-fixes

emanueleDiVizio commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emanueleDiVizio commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Scheduler (from upstream PRs #2031/#2034)

GDN / Metal

Stability

Performance / Features

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

emanueleDiVizio commented Apr 2, 2026 •

edited

Loading