| I want to... | Go to... |
|---|---|
| Install mistral.rs | Installation Guide |
| Understand cargo features | Cargo Features |
| Run a model | CLI Reference |
| Use the HTTP API | HTTP Server |
| Create & publish UQFF models | UQFF Guide |
| Fix an error | Troubleshooting |
| Configure environment | Configuration |
| Check model support | Supported Models |
| Build agents | Agentic Features Guide |
- Installation Guide - Install mistral.rs on your system
- Cargo Features - Complete cargo features reference
- CLI Reference - Complete CLI command reference
- CLI TOML Configuration - Configure via TOML files
- Create & Publish UQFF Models - Quantize models and upload to Hugging Face
- Troubleshooting - Common issues and solutions
- Python SDK - Python package documentation
- Python Installation - Python SDK installation guide
- Rust SDK - Rust crate documentation
- HTTP Server - OpenAI-compatible HTTP API
- OpenResponses API - Stateful conversation API
- Supported Models - Complete model list and compatibility
- Multimodal Models - Multimodal model overview
- Image Generation - Diffusion models
- Embeddings - Embedding model overview
Click to expand model guides
Text Models:
- DeepSeek V2 | DeepSeek V3
- Gemma 2 | Gemma 3 | Gemma 3n | Gemma 4
- GLM4 | GLM-4.7-Flash | GLM-4.7
- Qwen 3 | Qwen 3 Next | SmolLM3 | GPT-OSS
Multimodal Models:
- Idefics 2 | Idefics 3
- LLaVA | Llama 3.2 Vision | Llama 4
- MiniCPM-O 2.6 | Mistral 3
- Phi 3.5 MoE | Phi 3.5 Vision | Phi 4 Multimodal
- Qwen 2-VL | Qwen 3 VL | Qwen 3.5
Other Models:
- Quantization Overview - All supported quantization methods
- ISQ (In-Situ Quantization) - Quantize models at load time
- UQFF Format - Pre-quantized model format | Layout
- Topology - Per-layer quantization and device mapping
- Importance Matrix - Improve ISQ accuracy
- Adapter Models - LoRA and X-LoRA support
- LoRA/X-LoRA Examples
- Non-Granular Scalings - X-LoRA optimization
- AnyMoE - Create MoE models from dense models
- MatFormer - Dynamic model sizing
- Device Mapping - Multi-GPU and CPU offloading
- PagedAttention - Efficient KV cache management
- Speculative Decoding - Accelerate generation with draft models
- Flash Attention - Accelerated attention
- MLA - Multi-head Latent Attention
- Distributed Inference
- Agentic Features Guide - Web search, tool callbacks, agents, MCP, tool dispatch
- Tool Calling - Function calling support
- Web Search - Integrated web search
- Chat Templates - Template customization
- Sampling Options - Generation parameters
- TOML Selector - Model selection syntax
- Multi-Model Support - Load multiple models
- MCP Client - Connect to external tools
- MCP Server - Serve models over MCP
- MCP Configuration
- MCP Transports
- MCP Advanced Usage
- Configuration - Environment variables and server defaults
- Engine Internals - Engine behaviors and recovery
- Supported Models - Complete compatibility tables
See the main README for contribution guidelines.
