Release v0.8.0 · EricLBuehler/mistral.rs

What's Changed

Tweaks to docs and readme by @EricLBuehler in #1854
Upgrade Metal standard from 3.0 to 3.1 by @lizzzcai in #1861
fix stable diffusion readme by @setoelkahfi in #1857
Use cudaforge for kernel build by @guoqingbao in #1856
Bump bytes from 1.11.0 to 1.11.1 by @dependabot[bot] in #1865
Fix accuracy of fused glu metal and cuda impls by @EricLBuehler in #1867
Bump time from 0.3.45 to 0.3.47 by @dependabot[bot] in #1868
Fix for ViT + flash attn case by @EricLBuehler in #1869
Parallel + I/O pipelined ISQ by @EricLBuehler in #1870
Fix gptoss sliding window case with prefix caching by @EricLBuehler in #1871
Change gguf files delimiter to ';' by @synek317 in #1873
GPT-OSS paged attention with sinks support, MoE prefill kernels across CUDA, Metal, and CPU by @EricLBuehler in #1872
Fix streaming sse hang on error event by @EricLBuehler in #1875
Support Qwen 3 Next by @EricLBuehler in #1864
Fix completions ignoring logprobs by @EricLBuehler in #1877
Fixes for Qwen 3 VL family by @EricLBuehler in #1878
Add new quant method: F8Q8 by @EricLBuehler in #1883
fix(docker): install git in CUDA builders for flash-attn-v3 CUTLASS fetch by @glaziermag in #1885
Bump to 0.7.1-alpha.1 by @EricLBuehler in #1880
fix(core): use unix seconds for streaming chunk created timestamp by @glaziermag in #1887
feat: tvos metal support by @setoelkahfi in #1891
Rewrite paged attention for block-level prefix caching with KV gather kernels by @EricLBuehler in #1890
Fix contiguous error with phi3 gguf by @EricLBuehler in #1892
fix(core): handle missing BOS token in calibration path by @glaziermag in #1895
feat: add optional save_file for url image generation response format by @setoelkahfi in #1893
fix(metal): load metallib from memory instead of temp file for sandbox compatability by @EricLBuehler in #1898
fix(build): enable vendored Swagger UI for offline compilation by @EricLBuehler in #1899
fix(cuda): account for tensor storage offset in GDN kernel launches by @EricLBuehler in #1900
fix(cuda): account for tensor storage offset in moe kernel launches by @EricLBuehler in #1901
Implement GGUF for Mistral3 by @Cooksey99 in #1771
feat(rust sdk): deferred media prefixing, typed errors, and API cleanup, restructure examples by @EricLBuehler in #1904
feat(models): add Voxtral Mini 4B real-time speech recognition model by @EricLBuehler in #1905
fix(ci): add Metal and CUDA+NCCL compile checks by @EricLBuehler in #1907
fix(device_map): pre-allocate masks per device to reduce OOM pressure by @EricLBuehler in #1908
feat(pyo3): release GIL around blocking Runner operations to improve Python SDK by @EricLBuehler in #1909
feat(server-core): make utoipa-swagger-ui an optional feature by @EricLBuehler in #1910
fix(server-core): terminate SSE streams when response channel closes by @EricLBuehler in #1943
ci: disable docs deployment on forks by @haricot in #1942
fix: memory limit constants for 32-bit targets in attention and ISQ by @setoelkahfi in #1933
fix(gguf): verify_arch_any used AND logic instead of OR by @n-engine in #1916
fix(#1934): emulate negative step range in chat templates by @haricot in #1941
fix(vision): correct Qwen VL multi-turn image processing and thinking model token decoding by @EricLBuehler in #1950
Update MCP client documentation link in README by @naufraghi in #1935
feat(models): support Qwen 3.5 model family by @EricLBuehler in #1993
feat(cli): add --uqff-base-model and --uqff-repo-id flags to quantize command by @EricLBuehler in #1994
fix(cli): ensure readme matches older versions by @EricLBuehler in #1995
fix(isq): bits standardize format for numerical isq setting by @EricLBuehler in #1997
fix(metal): upgrade paged-attn to Metal 3.1 for native bfloat16 support by @ljchang in #2010
Small fix for Voxtral: load params.json before config.json if present by @jam10o-new in #1979
Fix UQFF loading for MoE models in Qwen2Loader by @glaziermag in #1977
fix(metal): auto-retry on iOS Metal background GPU permission error by @EricLBuehler in #2015
fix(ring): support Ring backend in properply in more models by @EricLBuehler in #2016
fix(cache): set hybrid recurrent state_indices during prompt cache reset by @EricLBuehler in #2017
feat(quant): add MXFP4 ISQ with optimized decode kernels by @EricLBuehler in #2018
refactor(wrapper-crates): reduce duplicated builder and request glue by @EricLBuehler in #2019
fix(docs): duplicate entry in SUMMARY.md breaks docs build by @EricLBuehler in #2020
Implement the Gemma 4 model by @EricLBuehler in #2046

New Contributors

@lizzzcai made their first contribution in #1861
@setoelkahfi made their first contribution in #1857
@synek317 made their first contribution in #1873
@glaziermag made their first contribution in #1885
@n-engine made their first contribution in #1916
@naufraghi made their first contribution in #1935
@ljchang made their first contribution in #2010
@jam10o-new made their first contribution in #1979

Full Changelog: v0.7.0...v0.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!