Skip to content

0.1.0

Latest

Choose a tag to compare

@petrukha-ivan petrukha-ivan released this 06 Apr 20:00
68a169b

This release makes constrained generation much faster and more stable.

By aligning grammar advancement with the token decoding flow, throughput improved from 87 tokens/s to 94 tokens/s on Qwen3 4B, with a 100 tokens/s baseline. Grammar compiler caching also removed most of the repeated setup cost, cutting 10 consecutive calls from 22.3s to 10.7s, now close to the 10.1s baseline.

What's Changed

  • Update dependencies to latest versions (align public API with latest MLXLM changes)
  • Align grammar advance with token iterator flow (faster generation)
  • Add grammar compiler cache and reuse (faster subsequent generation)
  • Normalize grammar mask length by @iSapozhnik (fixes VLM support)
  • Skip advancing after grammar matcher terminates (no more warnings)
  • Add more structural tags (Optional, Repeat, Plus, Star)

Full Changelog: 0.0.4...0.1.0