Air.rs

Memory-Fluid LLM Inference Engine
Run models larger than your VRAM — at full GPU speed.

The Problem

Large language models don't fit in VRAM. A 70B-parameter model at FP16 needs 140 GB of GPU memory. Even quantized to Q4, that's still 35 GB — more than a consumer RTX 4090's 24 GB.

Current solutions:

CPU offloading → 10–50× slower inference
Model parallelism → requires multiple expensive GPUs
Aggressive quantization → degrades output quality

The Air.rs Solution

Air.rs treats VRAM as a streaming cache, not a storage device. Instead of loading the entire model into GPU memory, it streams layers from NVMe → RAM → VRAM in a triple-buffered pipeline that hides PCIe transfer latency behind kernel execution.

 ┌──────────────────────────────────────────────────────────────┐
 │                    Air.rs Pipeline                           │
 │                                                              │
 │  NVMe SSD ──mmap──→ System RAM ──PCIe DMA──→ VRAM           │
 │     (model.gguf)       (page cache)       (ping-pong buf)   │
 │                                                              │
 │  While GPU executes layer N,                                 │
 │  PCIe is already uploading layer N+1,                        │
 │  and NVMe is prefetching layer N+2.                          │
 └──────────────────────────────────────────────────────────────┘

Result: Run 70B+ models on a single consumer GPU at near-native speed.

Features

🚀 Layer-Streamed Inference — only one transformer block is in VRAM at a time
🔁 Triple-Buffer Pipeline — overlaps NVMe reads, PCIe transfers, and GPU kernels
📄 Native GGUF Support — directly memory-maps quantized model files with zero parsing overhead
🗺️ 4KB Page-Aligned DMA — transfers are snapped to OS page boundaries for optimal throughput
💾 KV-Cache Shuttle — swaps attention caches between RAM and VRAM per-layer
🔌 OpenAI-Compatible API — drop-in /v1/chat/completions endpoint via Axum
🐍 Python Bindings — optional PyO3 module for Python integration
⚡ Fused Kernels — candle-core CUDA backend with cudarc 0.13

Architecture

src/
├── main.rs           # Entry point
├── lib.rs            # Module declarations, constants
├── loader.rs         # GGUF parser — extracts tensor offsets from file metadata
├── manifest.rs       # Execution planner — groups tensors into page-aligned chunks
├── uploader.rs       # Transfer engine — async triple-buffered NVMe→VRAM pipeline
├── orchestrator.rs   # Tensor hydrator — maps VRAM pointers into Candle tensors
├── generator.rs      # Inference loop — layer-streamed token generation
├── kv_cache.rs       # KV-cache manager — shuttles attention state RAM↔VRAM
├── api.rs            # OpenAI-compatible HTTP API (Axum)
└── python.rs         # Optional PyO3 bindings

Prerequisites

Requirement	Version
Rust	1.75+ (2021 edition)
CUDA Toolkit	12.x
NVIDIA GPU	Compute capability 7.0+ (Turing/Ampere/Ada/Hopper)
MSVC (Windows)	Visual Studio 2022 Build Tools
OS	Windows 10/11, Linux (Ubuntu 22.04+)

Quick Start

Building on Windows

Use the provided build script that auto-configures the MSVC and CUDA environment:

.\build_air.ps1

Building manually

# Ensure CUDA Toolkit is installed and nvcc is on PATH
cargo build --release --features cuda

Running

cargo run --release --features cuda

Usage

Air.rs exposes an OpenAI-compatible API. Once running, send requests like:

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-70b-q4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

How It Works

Load — loader.rs parses the GGUF file header to extract exact byte offsets of every tensor
Plan — manifest.rs groups tensors into layer chunks with 4KB-aligned DMA boundaries
Stream — uploader.rs runs an async pipeline: madvise() prefetches the next chunk into the OS page cache while the current chunk is being DMA'd to VRAM via htod_sync_copy
Execute — orchestrator.rs wraps the raw VRAM buffer into Candle tensors using pointer arithmetic (the "magic trick" of offset calculation)
Cache — kv_cache.rs downloads the attention KV-cache back to RAM after each layer, then re-uploads it when that layer is needed again
Repeat — the pipeline runs layer-by-layer, token-by-token, never exceeding one layer's worth of VRAM

Project Status

⚠️ Alpha — Core pipeline architecture is implemented and compiles. Kernel fusion, full inference loop, and benchmarks are in active development.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cargo		.cargo
assets		assets
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
build_air.ps1		build_air.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Air.rs

The Problem

The Air.rs Solution

Features

Architecture

Prerequisites

Quick Start

Building on Windows

Building manually

Running

Usage

How It Works

Project Status

Roadmap

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Air.rs

The Problem

The Air.rs Solution

Features

Architecture

Prerequisites

Quick Start

Building on Windows

Building manually

Running

Usage

How It Works

Project Status

Roadmap

Acknowledgments

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages