Memory-Fluid LLM Inference Engine
Run models larger than your VRAM — at full GPU speed.
Large language models don't fit in VRAM. A 70B-parameter model at FP16 needs 140 GB of GPU memory. Even quantized to Q4, that's still 35 GB — more than a consumer RTX 4090's 24 GB.
Current solutions:
- CPU offloading → 10–50× slower inference
- Model parallelism → requires multiple expensive GPUs
- Aggressive quantization → degrades output quality
Air.rs treats VRAM as a streaming cache, not a storage device. Instead of loading the entire model into GPU memory, it streams layers from NVMe → RAM → VRAM in a triple-buffered pipeline that hides PCIe transfer latency behind kernel execution.
┌──────────────────────────────────────────────────────────────┐
│ Air.rs Pipeline │
│ │
│ NVMe SSD ──mmap──→ System RAM ──PCIe DMA──→ VRAM │
│ (model.gguf) (page cache) (ping-pong buf) │
│ │
│ While GPU executes layer N, │
│ PCIe is already uploading layer N+1, │
│ and NVMe is prefetching layer N+2. │
└──────────────────────────────────────────────────────────────┘
Result: Run 70B+ models on a single consumer GPU at near-native speed.
- 🚀 Layer-Streamed Inference — only one transformer block is in VRAM at a time
- 🔁 Triple-Buffer Pipeline — overlaps NVMe reads, PCIe transfers, and GPU kernels
- 📄 Native GGUF Support — directly memory-maps quantized model files with zero parsing overhead
- 🗺️ 4KB Page-Aligned DMA — transfers are snapped to OS page boundaries for optimal throughput
- 💾 KV-Cache Shuttle — swaps attention caches between RAM and VRAM per-layer
- 🔌 OpenAI-Compatible API — drop-in
/v1/chat/completionsendpoint via Axum - 🐍 Python Bindings — optional PyO3 module for Python integration
- ⚡ Fused Kernels — candle-core CUDA backend with cudarc 0.13
src/
├── main.rs # Entry point
├── lib.rs # Module declarations, constants
├── loader.rs # GGUF parser — extracts tensor offsets from file metadata
├── manifest.rs # Execution planner — groups tensors into page-aligned chunks
├── uploader.rs # Transfer engine — async triple-buffered NVMe→VRAM pipeline
├── orchestrator.rs # Tensor hydrator — maps VRAM pointers into Candle tensors
├── generator.rs # Inference loop — layer-streamed token generation
├── kv_cache.rs # KV-cache manager — shuttles attention state RAM↔VRAM
├── api.rs # OpenAI-compatible HTTP API (Axum)
└── python.rs # Optional PyO3 bindings
| Requirement | Version |
|---|---|
| Rust | 1.75+ (2021 edition) |
| CUDA Toolkit | 12.x |
| NVIDIA GPU | Compute capability 7.0+ (Turing/Ampere/Ada/Hopper) |
| MSVC (Windows) | Visual Studio 2022 Build Tools |
| OS | Windows 10/11, Linux (Ubuntu 22.04+) |
Use the provided build script that auto-configures the MSVC and CUDA environment:
.\build_air.ps1# Ensure CUDA Toolkit is installed and nvcc is on PATH
cargo build --release --features cudacargo run --release --features cudaAir.rs exposes an OpenAI-compatible API. Once running, send requests like:
curl -X POST http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-70b-q4",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128
}'- Load —
loader.rsparses the GGUF file header to extract exact byte offsets of every tensor - Plan —
manifest.rsgroups tensors into layer chunks with 4KB-aligned DMA boundaries - Stream —
uploader.rsruns an async pipeline:madvise()prefetches the next chunk into the OS page cache while the current chunk is being DMA'd to VRAM viahtod_sync_copy - Execute —
orchestrator.rswraps the raw VRAM buffer into Candle tensors using pointer arithmetic (the "magic trick" of offset calculation) - Cache —
kv_cache.rsdownloads the attention KV-cache back to RAM after each layer, then re-uploads it when that layer is needed again - Repeat — the pipeline runs layer-by-layer, token-by-token, never exceeding one layer's worth of VRAM
⚠️ Alpha — Core pipeline architecture is implemented and compiles. Kernel fusion, full inference loop, and benchmarks are in active development.
- GGUF loader with exact byte-offset tensor mapping
- Page-aligned DMA manifest builder
- Triple-buffered async transfer engine
- VRAM pointer → Candle tensor hydration
- KV-cache RAM↔VRAM shuttle
- OpenAI-compatible API scaffolding
- Full transformer block kernel execution
- Token sampling with temperature/top-p
- GBNF grammar-constrained generation
- Multi-GPU support (NVLink/PCIe)
- Benchmarks vs llama.cpp, vLLM, exllama
- candle — Rust ML framework with CUDA support
- llama.cpp — GGUF format and quantization reference
- AirLLM — original layer-streaming concept in Python
MIT © Sunay Hegde
