This release makes constrained generation much faster and more stable.
By aligning grammar advancement with the token decoding flow, throughput improved from 87 tokens/s to 94 tokens/s on Qwen3 4B, with a 100 tokens/s baseline. Grammar compiler caching also removed most of the repeated setup cost, cutting 10 consecutive calls from 22.3s to 10.7s, now close to the 10.1s baseline.
What's Changed
- Update dependencies to latest versions (align public API with latest MLXLM changes)
- Align grammar advance with token iterator flow (faster generation)
- Add grammar compiler cache and reuse (faster subsequent generation)
- Normalize grammar mask length by @iSapozhnik (fixes VLM support)
- Skip advancing after grammar matcher terminates (no more warnings)
- Add more structural tags (Optional, Repeat, Plus, Star)
Full Changelog: 0.0.4...0.1.0