What's Changed
- Tweaks to docs and readme by @EricLBuehler in #1854
- Upgrade Metal standard from 3.0 to 3.1 by @lizzzcai in #1861
- fix stable diffusion readme by @setoelkahfi in #1857
- Use cudaforge for kernel build by @guoqingbao in #1856
- Bump bytes from 1.11.0 to 1.11.1 by @dependabot[bot] in #1865
- Fix accuracy of fused glu metal and cuda impls by @EricLBuehler in #1867
- Bump time from 0.3.45 to 0.3.47 by @dependabot[bot] in #1868
- Fix for ViT + flash attn case by @EricLBuehler in #1869
- Parallel + I/O pipelined ISQ by @EricLBuehler in #1870
- Fix gptoss sliding window case with prefix caching by @EricLBuehler in #1871
- Change gguf files delimiter to ';' by @synek317 in #1873
- GPT-OSS paged attention with sinks support, MoE prefill kernels across CUDA, Metal, and CPU by @EricLBuehler in #1872
- Fix streaming sse hang on error event by @EricLBuehler in #1875
- Support Qwen 3 Next by @EricLBuehler in #1864
- Fix completions ignoring logprobs by @EricLBuehler in #1877
- Fixes for Qwen 3 VL family by @EricLBuehler in #1878
- Add new quant method: F8Q8 by @EricLBuehler in #1883
- fix(docker): install git in CUDA builders for flash-attn-v3 CUTLASS fetch by @glaziermag in #1885
- Bump to 0.7.1-alpha.1 by @EricLBuehler in #1880
- fix(core): use unix seconds for streaming chunk created timestamp by @glaziermag in #1887
- feat: tvos metal support by @setoelkahfi in #1891
- Rewrite paged attention for block-level prefix caching with KV gather kernels by @EricLBuehler in #1890
- Fix contiguous error with phi3 gguf by @EricLBuehler in #1892
- fix(core): handle missing BOS token in calibration path by @glaziermag in #1895
- feat: add optional save_file for url image generation response format by @setoelkahfi in #1893
- fix(metal): load metallib from memory instead of temp file for sandbox compatability by @EricLBuehler in #1898
- fix(build): enable vendored Swagger UI for offline compilation by @EricLBuehler in #1899
- fix(cuda): account for tensor storage offset in GDN kernel launches by @EricLBuehler in #1900
- fix(cuda): account for tensor storage offset in moe kernel launches by @EricLBuehler in #1901
- Implement GGUF for Mistral3 by @Cooksey99 in #1771
- feat(rust sdk): deferred media prefixing, typed errors, and API cleanup, restructure examples by @EricLBuehler in #1904
- feat(models): add Voxtral Mini 4B real-time speech recognition model by @EricLBuehler in #1905
- fix(ci): add Metal and CUDA+NCCL compile checks by @EricLBuehler in #1907
- fix(device_map): pre-allocate masks per device to reduce OOM pressure by @EricLBuehler in #1908
- feat(pyo3): release GIL around blocking Runner operations to improve Python SDK by @EricLBuehler in #1909
- feat(server-core): make utoipa-swagger-ui an optional feature by @EricLBuehler in #1910
- fix(server-core): terminate SSE streams when response channel closes by @EricLBuehler in #1943
- ci: disable docs deployment on forks by @haricot in #1942
- fix: memory limit constants for 32-bit targets in attention and ISQ by @setoelkahfi in #1933
- fix(gguf):
verify_arch_anyused AND logic instead of OR by @n-engine in #1916 - fix(#1934): emulate negative step range in chat templates by @haricot in #1941
- fix(vision): correct Qwen VL multi-turn image processing and thinking model token decoding by @EricLBuehler in #1950
- Update MCP client documentation link in README by @naufraghi in #1935
- feat(models): support Qwen 3.5 model family by @EricLBuehler in #1993
- feat(cli): add --uqff-base-model and --uqff-repo-id flags to quantize command by @EricLBuehler in #1994
- fix(cli): ensure readme matches older versions by @EricLBuehler in #1995
- fix(isq): bits standardize format for numerical isq setting by @EricLBuehler in #1997
- fix(metal): upgrade paged-attn to Metal 3.1 for native bfloat16 support by @ljchang in #2010
- Small fix for Voxtral: load params.json before config.json if present by @jam10o-new in #1979
- Fix UQFF loading for MoE models in Qwen2Loader by @glaziermag in #1977
- fix(metal): auto-retry on iOS Metal background GPU permission error by @EricLBuehler in #2015
- fix(ring): support Ring backend in properply in more models by @EricLBuehler in #2016
- fix(cache): set hybrid recurrent state_indices during prompt cache reset by @EricLBuehler in #2017
- feat(quant): add MXFP4 ISQ with optimized decode kernels by @EricLBuehler in #2018
- refactor(wrapper-crates): reduce duplicated builder and request glue by @EricLBuehler in #2019
- fix(docs): duplicate entry in SUMMARY.md breaks docs build by @EricLBuehler in #2020
- Implement the Gemma 4 model by @EricLBuehler in #2046
New Contributors
- @lizzzcai made their first contribution in #1861
- @setoelkahfi made their first contribution in #1857
- @synek317 made their first contribution in #1873
- @glaziermag made their first contribution in #1885
- @n-engine made their first contribution in #1916
- @naufraghi made their first contribution in #1935
- @ljchang made their first contribution in #2010
- @jam10o-new made their first contribution in #1979
Full Changelog: v0.7.0...v0.8.0