Squish v9.0.0 β Cutting-Edge Attention Variants & Distributed Inference
Release Summary
Squish v9.0.0 introduces 28 new modules across Wave 25 (Cutting-Edge Attention Variants & Compute Fusion) and Wave 26 (Distributed Inference & Production Reliability).
Total modules now: 222 | Total tests: 4,876 | Test coverage: 100%
Wave 25: Cutting-Edge Attention Variants & Compute Fusion (14 modules)
Production-ready attention patterns from DeepSeek-V2/V3, kernel fusions, and speculative decode enhancements:
- FlashMLA β DeepSeek-V2 multi-head latent attention; 4Γ KV compression; 0.55 Β΅s append, 38.65 Β΅s attend
- NativeSparseAttn β DeepSeek-V3 block-sparse + sliding-window; ~87% attention sparsity; 646.6 Β΅s forward
- FusedSampler β Fused temperature/top-k/top-p/min-p/rep-penalty single-pass sampling; 1767 Β΅s vocab=32k
- KVDefrag β Online KV cache defragmentation; eliminates fragmentation ratio; 349 Β΅s defrag
- DualChunkAttn β Intra+inter-chunk for 1M+ contexts; O(chunkΒ²) not O(seqΒ²); 93.3 Β΅s forward
- ActivationOffload β Layer activation offload to CPU; peak GPU memory β; 6.34 Β΅s fetch
- MorphAttn β Per-layer pattern selection (full/sparse/linear); ~40% FLOP reduction at seq=2048
- HydraSpec β Multi-draft head speculation; n_heads tokens/step; 1229 Β΅s verify
- SeqCompact β In-place KV compaction after token pruning; zero-copy repack; 141 Β΅s
- LatencyPredictor β OLS latency forecasting for scheduling; 0.82 Β΅s predict; sub-microsecond
- ParallelSampler β Best-of-n sampling with diversity; quality improvement with n candidates
- ContextSummarizer β Inference-time context compression; keep semantics, shed tokens; 62.5 Β΅s
- TokenWatermark β Kirchenbauer statistical watermarking; detectable attribution
- SchemaGen β FSM-accelerated constrained JSON; zero invalid tokens; 5.38 Β΅s constrain
Wave 26: Distributed Inference & Production Reliability (14 modules)
Tensor/sequence parallelism, request scheduling, safety, monitoring, and audit logging:
- TensorParallel β Row/column tensor sharding + all-reduce; linear memory scaling
- SequenceParallel β Ulysses-style sequence scatter/gather; attention FLOPs distributed
- KVMigrate β Live KV migration + checksum; zero-recompute worker handoff
- DisaggPrefill β Disaggregated prefillβdecode; hardware specialisation
- RequestPreempt β SRPT preemption scheduler; priority inversion elimination
- InferGateway β Smart routing + health + load balancing; single ingress, N workers
- ModelVersionSwap β Zero-downtime version swaps; canary β promote β rollback in-flight
- ProductionProfiler β APM per-op tracking; p50/p99/p999 per operation; sub-200ns record
- AdaptiveBatcher β Throughput/latency SLO-aware batching; 1.91 Β΅s next_batch
- SafetyLayer β Inline safety classification; zero extra forward pass
- SemanticResponseCache β Embedding-similarity dedup; exact + fuzzy cache hits
- RateLimiter β Token-bucket per-tenant limiting; 0.92 Β΅s consume
- SchemaValidator β JSON schema validation; 100% schema-compliant outputs
- AuditLogger β SHA-256 chained audit log; tamper-evident request provenance; 1.92 Β΅s log
Highlights
β
222 modules total across 26 waves (v1βv9)
β
4,876 unit + integration tests β 100% coverage
β
Micro-benchmarks for all modules (Wave 25+26 in dev/benchmarks/bench_wave25_26.py)
β
Demo GIF (dev/demos/squish-v9-demo.gif) β 1.95 MB, 10+ scenes from Wave 25+26
β
arXiv paper draft (docs/paper.md) β abstract, background, architecture, benchmarks, ethics
β
HuggingFace integration (dev/publish_hf.py) β ready to publish pre-squished weights
β
Production hardening β fault tolerance, observability, schema validation, audit logging
Documentation
- README.md β Quick start, CLI examples, feature matrix
- MODULES.md β Wave-by-wave module tables
- docs/paper.md β Formal paper, benchmarks, architecture
- dev/benchmarks/bench_eoe.py β End-to-end hardware benchmark harness
- docs/benchmark_wave25_26.md β v9 micro-benchmark results
What's Next?
Phase 3: Hardware Validation
Run end-to-end benchmarks on M-series hardware:
squish serve --model qwen2.5:1.5b --port 11435 &
python3 dev/benchmarks/bench_eoe.py --runs 5 --output results/eoe_2026_03_12.json
# Results β README + paper Section 4.1 (TTFT/tok-s)Phase 4: Community & Publication
- MMLU evaluation:
lm_eval --tasks mmlu --limit 14042βdocs/RESULTS.md+ paper - HuggingFace weights:
python3 dev/publish_hf.py --model-dir ~/.cache/squish/... - Community posts: Hacker News, r/LocalLLaMA, Twitter/X
- arXiv submission:
docs/paper.mdβ LaTeX, submit to arxiv.org
Installation
pip install squish
# Pull a model (auto-caches after first conversion)
squish pull qwen2.5:1.5b
# Run inference at sub-second load time
squish run qwen2.5:1.5b "What is machine learning?"
# Drop-in OpenAI-compatible server
squish serve qwen2.5:1.5b --port 11435
curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"squish","messages":[{"role":"user","content":"Hello!"}]}'Acknowledgments
Squish builds on work from MLX, HuggingFace, Meta (Llama), OpenAI, AnthropicAI, Stanford (SWEET), Microsoft (AWQ), QuIP#, VPTQ and other research communities. See papers.md Section 2 for full citations.