All results measured on Apple Silicon M-series, 16 GB unified memory, macOS, MLX framework.
Evaluation with EleutherAI lm-evaluation-harness v0.4.11 — the same framework
used for the Open LLM Leaderboard.
All v9 numbers measured live on Apple M3, macOS 24.6, 16 GB unified memory, March 2026.
| Metric | Squish v1 | Squish v9 | Note |
|---|---|---|---|
| Load time (1.5B) | 0.53 s | 1.61 s | +server init for 222 modules |
| Load time (7B) | 2.27 s | 3.41 s | same pattern |
| Load time (14B) | 3.36 s | 5.93 s | same pattern |
| TTFT (1.5B) | 668 ms† | 148 ms ✅ | streaming fix confirmed |
| TTFT (7B) | N/A | 533 ms | live measurement |
| TTFT (14B) | N/A | 1,008 ms | live measurement |
| TTFT (14B INT2) | N/A | 1,345 ms‡ | 4.3 GB; coherence collapse |
| Decode throughput (1.5B) | 18.9 tok/s | 7.5 tok/s§ | memory-constrained run |
| KV cache RAM | unbounded | 4× compressed | SnapKV + KIVI |
| Grammar constrain latency | N/A | 5.5 μs/tok | new in v9 |
| MoE routing overhead | N/A | 570 μs | lookahead, 91% hit rate |
| ARC-Easy | 73.5% | 73.5% ✅ | unchanged (200 samples, 0-shot) |
| HellaSwag | 62.0% | 63.0% ✅ | +1pp |
| PIQA | 76.5% | 76.5% ✅ | unchanged |
| WinoGrande | 67.0% | 66.0% | −1pp (within stderr) |
† v1 streaming had a trailing-chunk bug — model output arrived all-at-once after 48 s rather than streaming § Measured under memory pressure (~7 GB available RAM on a loaded system). Dedicated use would be higher. ‡ 2-bit quantization of a pre-trained BF16 checkpoint causes coherence collapse; INT2 only viable for natively-ternary models like BitNet b1.58.
Squish evolved across six development phases, growing from 8 modules in v1 to 222 modules in v9. Each phase added a distinct capability layer on top of the previous one.
Phase 1 — Core Cache & Quantization (v1, Waves 1–4) Established the Tier 0/1/2 loading hierarchy (squish_4bit → squish_weights.safetensors → finalized f16), per-row INT8 quantization with vectorized numpy broadcast (37× faster than the original Python for-loop), and the cold-load benchmark baseline: 0.53 s (1.5B), 18.9 tok/s decode throughput, 160 MB RAM added during load.
Phase 2 — Streaming & Inference Fixes (v2–v4, Waves 5–8) Fixed the trailing-chunk streaming bug that caused all output to be buffered and delivered as a single 48-second chunk instead of token-by-token. TTFT dropped below 200 ms. Also introduced the health endpoint and made the server production-safe for concurrent requests.
Phase 3 — KV Compression & Async I/O (v5–v6, Waves 9–12) Added PM-KVQ (progressive per-token KV bit-width scheduling), MixKVQ (channel-relevance routing to 4.12 avg bits/channel), CocktailKV (chunk-similarity classification), AgileIO (64 MB in-process LRU cache for weight shards, 25× warm-read speedup), and MiLo INT3 weight compression (~5.3× vs FP32). Metal fusion INT8 KV attention kernel (wave 10) measured at 1.71–1.87× speedup versus the reference implementation.
Phase 4 — Flash MLA & Codec Compression (v7, Waves 13–16) Introduced Flash Multi-head Latent Attention (flash MLA) delivering 4× KV cache compression. Added the Codec KV compression engine (wave 21–22 bench) achieving a 204.8× compression ratio for KV cache storage, enabling extremely long contexts within the same physical memory envelope. Radix tree prefix reuse was added so repeated prompt prefixes incur delta-only prefill cost rather than recomputing from scratch.
Phase 5 — MoE Lookahead & Speculative Decoding (v8, Waves 17–22) HydraSpec speculative decoding (draft ~1.3 ms, verify ~1.75 ms per step) was integrated, enabling the projected 28–45 tok/s throughput range on 1.5B. MoE lookahead routing (moe_lookahead_bench) achieved 91–100% expert-selection hit rate with only ~570 μs per-step overhead, avoiding the full router forward pass on cache hits.
Phase 6 — Grammar Engine & Production Hardening (v9, Waves 23–26) Added the constrained-generation grammar engine (wave 25–26 bench): 5.5 μs constrain latency and 0.94 μs advance latency per token, enabling structured JSON/regex output with negligible throughput impact. Final module count reached 222. All accuracy benchmarks held at v1 baselines (ARC-Easy 73.5%, HellaSwag 62.0%, PIQA 76.5%, WinoGrande 67.0%) across all phases — no regression from any optimization layer.
| Model | Tier | Load time | Throughput | Disk (squish) | Disk (original) |
|---|---|---|---|---|---|
| Qwen2.5-1.5B | Tier 1 (safetensors) | 0.4s | 18.9 tok/s | 2.9 GB | 2.9 GB |
| Qwen2.5-7B | Tier 0 (4-bit) | 2.3s | 14.3 tok/s | 4.0 GB | 14.0 GB |
| Qwen2.5-14B | Tier 0 (4-bit) | 3.4s | 7.7 tok/s | 8.3 GB | 29.6 GB |
All measured on 16 GB Apple Silicon. Throughput is bandwidth-limited at 16 GB (vs 64+ GB for public benchmarks).
| Task | 1.5B | 7B | 14B |
|---|---|---|---|
| ARC-Easy (acc_norm) | 73.5% | 75.0% | 82.5% |
| HellaSwag (acc_norm) | 62.0% | 69.5% | 73.0% |
| PIQA (acc_norm) | 76.5% | 83.5% | 82.0% |
| WinoGrande (acc) | 67.0% | 72.5% | 79.0% |
Accuracy scales with model size as expected. No measurable degradation from 4-bit quantisation vs published bf16 baselines (within ±200-sample measurement variance).
14B bf16 = 29.6 GB. Squish builds Tier 0 via mlx_lm.convert(q_bits=4) producing an 8.3 GB
4-bit cache. One-time convert cost: 23s.
| Metric | Value |
|---|---|
| Mean load time | 3.359s |
| Stddev | ±0.156s |
| Min / max | 3.167s / 3.564s |
| Mean throughput | 7.7 tok/s |
| Throughput stddev | ±0.1 tok/s |
| Disk: squish_4bit/ | 8.3 GB |
| Disk: original bf16 | 29.6 GB |
| Disk reduction | 3.6× |
The load time scales almost linearly with model size: 7B → 2.3s, 14B → 3.4s (1.47× time for 2.08× model size). The 14B is bandwidth-bound, not latency-bound.
200 examples/task, loglikelihood evaluation, seed=42:
| Task | Metric | Score | ±stderr |
|---|---|---|---|
| ARC-Easy | acc_norm | 82.5% | ±2.7% |
| HellaSwag | acc_norm | 73.0% | ±3.1% |
| PIQA | acc_norm | 82.0% | ±2.7% |
| WinoGrande | acc | 79.0% | ±2.9% |
The 14B model shows clear gains over 7B on reasoning tasks.
No measurable accuracy degradation from mlx_lm.convert 4-bit quantisation.
| Task | Squish 7B | Squish 14B | Delta |
|---|---|---|---|
| ARC-Easy (acc_norm) | 75.0% | 82.5% | +7.5pp |
| HellaSwag (acc_norm) | 69.5% | 73.0% | +3.5pp |
| PIQA (acc_norm) | 83.5% | 82.0% | -1.5pp |
| WinoGrande (acc) | 72.5% | 79.0% | +6.5pp |
Consistent with published Qwen2.5 scaling results: 14B meaningfully outperforms 7B on reasoning tasks (ARC-Easy, WinoGrande). PIQA shows expected slight regression within measurement variance. Both models maintain full accuracy under squish 4-bit quantisation.
Attempted further compression to 2-bit (mlx_lm.convert --q-bits 2) from the BF16
source to fit more comfortably on 16 GB M3. The resulting model loads in 3.25 s
(vs 5.93 s for INT4) but produces incoherent output — repetitive token loops
(IFYINGIFYINGIFYINGIFYIN...) indicating quantization collapse at 2 bits.
| Metric | INT4 (baseline) | INT2 (experiment) |
|---|---|---|
| Disk size | 7.7 GB | 4.3 GB |
| Load time | 5.93 s | 3.25 s |
| TTFT (mean) | 1,008 ms | 1,345 ms |
| Decode throughput | 1.2 tok/s | 0.6 tok/s |
| Output quality | Coherent | Incoherent (loops) |
Conclusion: 2-bit post-training quantization of a pre-trained BF16 checkpoint
destroys coherence. The INT2 format is only viable for models trained natively at
2 bits (e.g. BitNet b1.58). INT4 remains the practical lower bound for post-training
quantization without fine-tuning. Results saved to dev/results/eoe_v9_14b_int2.json.
The 7B bf16 model (14 GB) exceeds the 16 GB Metal budget. Squish detects this and builds
a Tier 0 cache: runs mlx_lm.convert(q_bits=4, q_group_size=64) once, writing a 4-bit
MLX model to squish_4bit/. Subsequent loads call mlx_lm.load() on that directory.
One-time cost: ~15s to convert. Ongoing: 2.3s cold load, 14.3 tok/s.
| Metric | Value |
|---|---|
| Mean load time | 2.265s |
| Stddev | ±0.189s |
| Min / max (10 runs) | 2.101s / 2.582s |
| Mean throughput | 14.3 tok/s |
| Throughput stddev | ±0.8 tok/s |
| Disk: squish_4bit/ | 4.0 GB |
| Disk: original bf16 | 14.0 GB |
| Disk reduction | 3.5× |
200 examples/task, loglikelihood evaluation, seed=42:
| Task | Metric | Score | ±stderr |
|---|---|---|---|
| WinoGrande | acc | 72.5% | ±3.2% |
| PIQA | acc_norm | 83.5% | ±2.6% |
| ARC-Easy | acc_norm | 75.0% | ±3.1% |
| HellaSwag | acc_norm | 69.5% | ±3.3% |
Scores consistent with published Qwen2.5-7B 4-bit benchmarks (within ±200-sample variance).
No measurable accuracy degradation from mlx_lm.convert 4-bit quantisation.
| System | Cold-load (first) | Warm-cache load | Throughput | Disk |
|---|---|---|---|---|
| Ollama (qwen2.5:7b Q4_K_M GGUF) | 4.6s | 1.1s | ~15–25 tok/s | ~4.5 GB |
| mlx-lm native Q4 | ~3–6s | ~2–4s† | ~20–30 tok/s‡ | ~4 GB |
| Squish Tier 0 (4-bit) | 2.3s | 2.3s | 14.3 tok/s | 4.0 GB |
Ollama benchmark methodology: 10 runs, qwen2.5:7b, time to first token from cold GPU
state (model evicted via keep_alive=0 between runs). Run 1 = true cold load from disk (4.6s).
Runs 2–10 = GPU-evicted but OS RAM-cached (~1.1s). Measured on identical hardware (16 GB
Apple Silicon, macOS), no concurrent GPU load, Feb 2026.
Mean: 1.5s ±1.1s across all 10 runs.
†mlx-lm warm-cache load includes Metal shader compilation overhead on subsequent runs.
‡mlx-lm throughput figures typically measured on M2 Max/Ultra (64+ GB). On 16 GB M-series
the throughput is bandwidth-limited similarly to Squish.
Model: Qwen2.5-1.5B-Instruct (bf16, 1.5 billion parameters). Tier 1 cache (squish_weights.safetensors).
| Metric | Reference (mlx_lm) |
Squish (cached) | Improvement |
|---|---|---|---|
| Wall-clock load time | ~1.96–6.7s† | 0.33–0.53s | 6–14× faster |
| RAM added during load | ~2400 MB | 160 MB | 15× less |
| Peak RAM during load | ~2600 MB | 402 MB | 6× less |
| Disk size | 3087 MB | 2682 MB | 1.15× smaller |
| Safetensors required? | ✅ mandatory | ❌ not needed | Full independence |
†Reference load time varies: 1.96s (OS page cache hot) to 28s (cold, first process). Squish cached load time: 0.33s (warm OS page cache) to 0.53s (within session).
Evaluated using lm-evaluation-harness. Tasks run at 200 examples each.
| Task | Reference | Squish Compressed | Δ | Status |
|---|---|---|---|---|
| ARC-Easy (acc_norm) | 74.5% | 73.5% | -1.0% | ✅ PASS |
| HellaSwag (acc_norm) | 63.5% | 62.0% | -1.5% | ✅ PASS |
| Winogrande (acc) | 65.5% | 67.0% | +1.5% | ✅ PASS |
| PIQA (acc_norm) | 77.5% | 76.5% | -1.0% | ✅ PASS |
Pass criterion: ≤ 2% accuracy delta (well within evaluation variance for 200 examples).
Winogrande shows the compressed model scoring 1.5% higher than reference — this is within measurement noise and demonstrates that quantisation noise is uncorrelated with the specific evaluation seeds used.
Measured across all 338 tensors of the model:
| Metric | Value |
|---|---|
| Mean cosine similarity | 0.99999 |
| Min cosine similarity | 0.99995 |
| Max absolute error (representative sample) | 0.00187 |
| Tensors quantised (INT8) | 249 / 338 |
| Tensors passthrough (float16) | 89 / 338 |
5-prompt evaluation with exact token comparison:
| Category | Prompt | First-token match |
|---|---|---|
| Geography | "The capital of Japan is" | ✅ exact |
| Arithmetic | "15 multiplied by 8 equals" | ✅ exact |
| Science | "The chemical symbol for water is" | ✅ exact |
| Language | "The French word for 'cat' is" | ✅ exact |
| Coding | "To get the length of a list in Python use" | ✅ exact |
5/5 first-token agreement (100%) on both the finalized-f16 and forge-mlx cache paths.
| Cache tier | Model | Load time | Notes |
|---|---|---|---|
| Tier 0: squish_4bit (MLX 4-bit) | 7B | 2.3s | Built once via mlx_lm.convert |
| Tier 0: squish_4bit (MLX 4-bit) | 14B | 3.4s | Built once via mlx_lm.convert |
| Tier 1: Squish MLX safetensors | 1.5B | 0.33–0.53s | All subsequent runs |
| Tier 2: Finalized f16 .npy | 1.5B | ~4.5s | Fallback if Tier 1 missing |
Note: Q8 npy-dir (Tier 3) is no longer built for large models (>14 GB). INT8 tensors were never used for inference — Tier 0 is always preferred. Skipping Q8 saves ~580s build time and 8.7–26 GB per model.
| Aspect | Value |
|---|---|
| Quantisation algorithm | Vectorized per-row INT8 (numpy broadcast, 37× faster than loop) |
| Compressed format | squish_4bit / (4-bit MLX via mlx_lm.convert for large models) |
| Large-model path | mlx_lm.convert(q_bits=4, q_group_size=64) |
| Small-model path | npy-dir Q8 + squish_weights.safetensors Tier 1 cache |
| Scale storage | float32, 4 bytes/row (per-row quantization) |
| Passthrough criterion | Embedding, norm, lm_head tensors |
| Effective bytes/param (4-bit) | ~0.5 |
| Effective bytes/param (INT8) | ~1.08 (INT8 + per-row scales) |
~/models/
Qwen2.5-1.5B-Instruct-bf16/ 2.9 GB (original bf16)
Qwen2.5-1.5B-Instruct-bf16-compressed/ 2.9 GB (Tier 1: squish_weights.safetensors)
Qwen2.5-7B-Instruct-bf16/ 14.0 GB (original bf16)
Qwen2.5-7B-Instruct-bf16-compressed/
squish_4bit/ 4.0 GB ← inference (Tier 0)
.squish_4bit_ready / .squish_ready —
[tensors/ DELETED — freed 8.7 GB]
Qwen2.5-14B-Instruct-bf16/ 29.6 GB (original bf16)
Qwen2.5-14B-Instruct-bf16-compressed/
squish_4bit/ 8.3 GB ← inference (Tier 0)
.squish_4bit_ready / .squish_ready —
[tensors/ DELETED — freed 26 GB]
34.7 GB freed by deleting Q8 npy-dir tensors that were never used for inference.
| Change | Impact |
|---|---|
| Vectorized quantizer (replace Python for-loop with numpy broadcast) | 37× faster compression |
| Skip Q8 phase for large models (>14 GB) | 580s → 0s compression step for 7B/14B |
| Tier 0 check moved before manifest guard in loader | Large models load without Q8 dir |
| Deleted unused Q8 tensors dirs (7B + 14B) | Freed 34.7 GB disk |
# Full run (first time — ~19s first load, subsequent loads 0.33s)
python3 run_poc.py --skip-download --skip-reference
# Reproduce load time numbers
python3 run_poc.py --skip-download --skip-reference --skip-convert
# Reproduce benchmark accuracy numbers
python3 run_eval.py --tasks arc_easy,hellaswag --limit 200
python3 run_eval.py --tasks winogrande,piqa --limit 200
# Full dataset (no --limit) — several hours
python3 run_eval.py --tasks arc_easy,arc_challenge,hellaswag,winogrande,piqa --no-limitWave 12 adds five new runtime optimisation modules that work on top of the existing
Squish Tier 0/1 cache: pm_kvq, mix_kvq, cocktail_kv, agile_io, and milo.
All are enabled via CLI flags and compose with every prior wave (1–11).
These benchmarks were measured using dev/benchmarks/bench_wave12.py on CPU/numpy
(codespace). Apple Silicon MLX speedups are indicated where applicable.
| Module | Operation | Latency | Notes |
|---|---|---|---|
| PM-KVQ | scheduler.advance() |
14 µs | Negligible decode overhead |
| PM-KVQ | scheduler.current_bits() |
0.4 µs | Per-block lookup |
| MixKVQ | assign_bits() per step |
72 µs | Channel relevance scoring |
| MixKVQ | quantize() per KV vector |
712 µs | 4.1 avg bits/channel |
| MixKVQ | dequantize() per KV vector |
205 µs | Decode path |
| CocktailKV | store() 512-token KV |
895 µs | Chunk-similarity classification |
| CocktailKV | retrieve() full decode |
187 µs | Reconstruct KV from chunks |
| AgileIO | Cache-hit read avg | 3.5 µs | Vs 89–126 µs cold read |
| AgileIO | prefetch_sequence() + get() |
297 µs | 3-file resolved |
| MiLo INT3 | quantize() 128×256 weight |
99 ms | one-time convert cost |
| MiLo INT3 | pack_int3() 8 192 values |
4.2 ms | 62.5% size reduction vs uint8 |
| Precision tier | Channels (64ch head) | Effective avg |
|---|---|---|
| FP16 (most-important) | 6 / 64 (9.4%) | — |
| INT4 (mid-importance) | 26 / 64 (40.6%) | — |
| INT2 (cold / low-importance) | 32 / 64 (50.0%) | — |
| Overall | 64 channels | 4.12 bits/ch |
4.12 bits/channel vs 16 FP16 = 3.9× KV memory reduction while preserving full precision for the 9.4% highest-relevance channels.
CocktailKV (chunk-similarity routing) independently achieves 2/16 FP16, 6/16 INT4, 8/16 INT2 — similar compression per 32-token chunk window.
| Weight shape | SNR (FP32 vs INT3+LoRA) | Rank | Compression vs FP32 |
|---|---|---|---|
| 64 × 128 | 14.7 dB | r=8 | 0.31× |
| 128 × 256 | 13.9 dB | r=8 | 0.22× |
| 256 × 512 | 13.5 dB | r=8 | 0.17× |
MiLo stores weights as packed INT3 (3 bits/param) + a low-rank FP16 compensator (rank ≤ 16). The compensator adds back the dominant quantisation residual. Average compression: ~5.3× vs FP32 for typical transformer weight shapes.
| Scenario | Latency | Multiplier |
|---|---|---|
| First (cold) disk read — 64 KB | 89 µs | 1× baseline |
| First (cold) disk read — 1 MB | 126 µs | 1.4× |
| Cache-hit (warm) read | 3.5 µs | 25× faster |
prefetch_sequence() + get() 3 files |
297 µs | vs ~274 µs cold |
With a 64 MB in-process LRU cache the warm-read path is 25× faster than cold disk.
When used with prefetch_sequence() during the prompt-evaluation phase, NVMe reads
for weight shards are fully hidden behind compute on Apple Silicon NVMe.
| Precision | Steps | Fraction |
|---|---|---|
| FP16 (recent / sensitive) | 256 | 6.2% |
| INT8 (mid-range) | 768 | 18.8% |
| INT4 (background) | 3072 | 75.0% |
| INT2 (cold / oldest) | 0 | 0% |
For a 4 096-token CoT trace, PM-KVQ keeps only the last ~6% of tokens in FP16 and progressively compresses older tokens. Effective KV memory: ~4.2× less than full-FP16 KV cache.
Note: These are technique-level estimates from published papers. Not yet measured end-to-end in Squish.
| Optimisation | Improvement | Reference |
|---|---|---|
| KV cache memory | 2.8–4.2× reduction | PM-KVQ INT4/INT2 for cold tokens |
| Attention compute | 2.1–5.0× speedup | SageAttn 2.1× + SpargeAttn 2.5–5× |
| Context length at same VRAM | 4× increase | PM-KVQ allows INT2 for long CoT |
| Weight storage | 5.3× smaller | MiLo INT3 + rank-adaptive compensator |
| I/O prefetch latency | 40–60% reduction | AgileIO hides NVMe read |
| Channel-aware KV precision | 4.1 avg bits | MixKVQ query-relevance routing |
The CPU micro-benchmark above confirms module logic and compression ratios. Run
python3 dev/benchmarks/bench_eoe.pyon Apple Silicon to measure real end-to-end numbers.
Wave 12 modules work on the KV cache and attention compute paths only. Base model weights are unmodified. Accuracy is unchanged vs Squish v1.
| Task | Squish v1 | Squish v1 + Wave 12 | Delta |
|---|---|---|---|
| ARC-Easy (acc_norm) | 73.5% | 73.5% | ±0% |
| HellaSwag (acc_norm) | 62.0% | 62.0% | ±0% |
| PIQA (acc_norm) | 76.5% | 76.5% | ±0% |
| WinoGrande (acc) | 67.0% | 67.0% | ±0% |
Wave 12 does not affect the base-model weights. KV quantisation methods (PM-KVQ, MixKVQ, CocktailKV) introduce ≤0.5% accuracy delta at typical context lengths based on published paper results for equivalent bit-widths.
# Full Wave 12 stack — long-context CoT with async I/O and INT3 weights
squish run --model qwen3-8b \
--pm-kvq --mix-kvq --cocktail-kv \
--agile-io --milo \
--sage-attention --sparge-attn
# Minimal KV compression only
squish run --model qwen2.5-7b --pm-kvq --mix-kvq