This is an agentic harness to build reproducible LLM experiments and benchmarks. I built this to automate the experiment effort I was spending to evaluate local models for language learning content.
You're welcome to download the source and build it, but note that this project is in a pre-release state. This means I am pushing straight to main and I do not guarantee the stability of APIs/data formats. If/when I do a release on github, the development process and stability contracts will change.
Sidebar icons are from Game Icons by Lorc, Delapouite, and contributors, licensed under CC BY 3.0. The main logo (raccoon putting on gloves) was made using GPT Image 2.0 and a basic image editor.

