Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
-
Updated
May 24, 2026 - Python
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!
KV Cache with PagedAttention vs PagedAttention + TurboQuant - experiments across token sizes comparing memory, latency, and accuracy.
[MLSys-26] FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
Clean from-scratch inference engine for shannon-prime-lattice. NTT-based attention, two-node CRT-sharded inference path, KSTE-encoded KV state.
Clean from-scratch math core for shannon-prime-lattice: KSTE encoder, Friedman sieve, ARM (HRR in CRT cyclotomic ring), CRT NTT primitives, Position-as-Arithmetic.
Prime Power Transformer: A Number-Theoretic Architecture for Compute
Umbrella for the decentralized cooperative AI training/inference architecture built on the prime-factored coordinate lattice and the dominance order. Theory + Systems + Roadmap papers, contracts, offload pattern.
Add a description, image, and links to the kvcache-optimization topic page so that developers can more easily learn about it.
To associate your repository with the kvcache-optimization topic, visit your repo's landing page and select "manage topics."