This release makes constrained generation much faster and more stable.

By aligning grammar advancement with the token decoding flow, throughput improved from 87 tokens/s to 94 tokens/s on Qwen3 4B, with a 100 tokens/s baseline. Grammar compiler caching also removed most of the repeated setup cost, cutting 10 consecutive calls from 22.3s to 10.7s, now close to the 10.1s baseline.

What's Changed

Update dependencies to latest versions (align public API with latest MLXLM changes)
Align grammar advance with token iterator flow (faster generation)
Add grammar compiler cache and reuse (faster subsequent generation)
Normalize grammar mask length by @iSapozhnik (fixes VLM support)
Skip advancing after grammar matcher terminates (no more warnings)
Add more structural tags (Optional, Repeat, Plus, Star)

Full Changelog: 0.0.4...0.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.1.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!