Rust implemenation of Voxtral-4B-TTS — Mistral AI's 4B-parameter text-to-speech model. Runs on macOS (Apple Silicon via MLX) and Linux (CPU or CUDA via libtorch). No Python required. Includes both CLI and API server. Ready for agent harness.
Download the platform-specific zip from GitHub Releases:
| Platform | Asset |
|---|---|
| macOS (Apple Silicon) | voxtral-tts-macos-aarch64.zip |
| Linux x86_64 (CPU) | voxtral-tts-linux-x86_64.zip |
| Linux x86_64 (CUDA) | voxtral-tts-linux-x86_64-cuda.zip |
| Linux ARM64 (CPU) | voxtral-tts-linux-aarch64.zip |
| Linux ARM64 (CUDA) | voxtral-tts-linux-aarch64-cuda.zip |
# Example: macOS Apple Silicon
curl -LO https://github.com/second-state/voxtral_tts_rs/releases/latest/download/voxtral-tts-macos-aarch64.zip
unzip voxtral-tts-macos-aarch64.zip
cd voxtral-tts-macos-aarch64bash <(curl -sSf https://raw.githubusercontent.com/second-state/voxtral_tts_rs/main/scripts/download_model.sh)This downloads consolidated.safetensors (8 GB), params.json, tekken.json, and 20 voice embeddings into models/voxtral-4b-tts/.
The release zip includes pre-converted voice embeddings (.safetensors). Copy them into the model directory:
cp voice_embedding/*.safetensors models/voxtral-4b-tts/voice_embedding/./voxtral-tts models/voxtral-4b-tts \
--text "Hello, this is Voxtral TTS!" \
--voice neutral_female \
--output output.wav./voxtral-tts-server models/voxtral-4b-tts --port 8080curl -X POST http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Hello world","voice":"alloy"}' \
-o output.wavvoxtral-tts <MODEL_DIR> --text "..." [OPTIONS]
| Flag | Default | Description |
|---|---|---|
<MODEL_DIR> |
(required) | Path to model directory |
--text, -t |
(required) | Text to synthesize |
--voice, -v |
neutral_female |
Voice name or OpenAI alias |
--output, -o |
output.wav |
Output WAV file path |
--temperature |
0.7 |
Sampling temperature (higher = more variation) |
--max-tokens |
4096 |
Maximum generation tokens |
--reference-audio |
Voice reference audio file (for voice cloning) | |
--list-voices |
Print available voices and exit |
Examples:
# English with a casual voice
./voxtral-tts models/voxtral-4b-tts --text "Hey, what's up?" --voice casual_male -o casual.wav
# French
./voxtral-tts models/voxtral-4b-tts --text "Bonjour le monde!" --voice fr_female -o bonjour.wav
# List all voices
./voxtral-tts models/voxtral-4b-tts --list-voices --text ""voxtral-tts-server <MODEL_DIR> [OPTIONS]
| Flag | Default | Description |
|---|---|---|
<MODEL_DIR> |
(required) | Path to model directory |
--host |
127.0.0.1 |
Bind host address |
--port |
8080 |
Bind port |
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check — returns {"status":"ok"} |
/v1/models |
GET | List available models |
/v1/audio/speech |
POST | Generate speech (OpenAI-compatible) |
Request body:
{
"model": "voxtral-4b-tts",
"input": "Text to synthesize",
"voice": "neutral_female",
"response_format": "wav",
"speed": 1.0,
"stream": false
}| Field | Type | Default | Description |
|---|---|---|---|
input |
string | (required) | Text to synthesize (max 4096 chars) |
model |
string | voxtral-4b-tts |
Model name |
voice |
string | alloy |
Voice name or OpenAI alias |
response_format |
string | wav |
Output format: wav, pcm, mp3, flac, ogg, opus |
speed |
float | 1.0 |
Speed multiplier (0.25–4.0, reserved) |
stream |
bool | false |
Enable SSE streaming |
Non-streaming returns binary audio with the appropriate content type (audio/wav, audio/pcm, audio/mpeg, audio/flac, or audio/ogg).
Streaming ("stream": true) returns Server-Sent Events with base64 PCM chunks:
data: {"type":"speech.audio.delta","delta":"<base64 16-bit LE PCM>"}
data: {"type":"speech.audio.delta","delta":"<base64 16-bit LE PCM>"}
data: {"type":"speech.audio.done"}
Examples:
# Non-streaming WAV
curl -X POST http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Hello world","voice":"alloy"}' \
-o output.wav
# Streaming
curl -N -X POST http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Hello world","voice":"alloy","stream":true}'| Voice | Language | Gender |
|---|---|---|
casual_female, casual_male |
English | F, M |
cheerful_female |
English | F |
neutral_female, neutral_male |
English | F, M |
fr_male, fr_female |
French | M, F |
es_male, es_female |
Spanish | M, F |
de_male, de_female |
German | M, F |
pt_male, pt_female |
Portuguese | M, F |
it_male, it_female |
Italian | M, F |
nl_male, nl_female |
Dutch | M, F |
ar_male |
Arabic | M |
hi_male, hi_female |
Hindi | M, F |
| Alias | Maps to |
|---|---|
alloy |
neutral_female |
echo |
casual_male |
fable |
cheerful_female |
onyx |
neutral_male |
nova |
casual_female |
shimmer |
fr_female |
| Platform | Requirements |
|---|---|
| macOS (Apple Silicon) | Xcode Command Line Tools, CMake, Rust 1.75+ |
| Linux (CPU) | GCC/Clang, Rust 1.75+ |
| Linux (CUDA) | NVIDIA driver 535+, CUDA toolkit, Rust 1.75+ |
git clone https://github.com/second-state/voxtral_tts_rs.git
cd voxtral_tts_rs
git submodule update --init --recursive
cargo build --release --no-default-features --features mlxgit clone https://github.com/second-state/voxtral_tts_rs.git
cd voxtral_tts_rs
# Download libtorch (auto-detects x86_64 or aarch64)
bash scripts/download_libtorch.sh cpu # CPU only
bash scripts/download_libtorch.sh cuda # CUDA 12.6
# Build
export LIBTORCH=$(pwd)/libtorch
export LIBTORCH_BYPASS_VERSION_CHECK=1
cargo build --releaseThe model ships voice embeddings as PyTorch .pt files. Convert them to .safetensors (required for MLX, optional for libtorch):
pip install torch safetensors
python3 -c "
import torch, os
from safetensors.torch import save_file
d = 'models/voxtral-4b-tts/voice_embedding'
for f in os.listdir(d):
if f.endswith('.pt'):
t = torch.load(os.path.join(d, f), map_location='cpu', weights_only=True)
save_file({'embedding': t}, os.path.join(d, f.replace('.pt', '.safetensors')))
print(f'Converted {f}')
"cargo testThe model has three components totalling 4B parameters:
Text ──> Tekken Tokenizer ──> Token IDs
│
v
Voice Embedding ──> Backbone Decoder (3.4B, 26 layers) ──> Hidden States
│
v
Flow-Matching Transformer (390M) ──> 37 Audio Codes/Frame
│
v
Voxtral Codec Decoder (300M) ──> 24kHz Mono Waveform
| Component | Parameters | Architecture |
|---|---|---|
| Backbone Decoder | 3.4B | 26-layer Mistral transformer, dim=3072, 32 heads (8 KV), SwiGLU, RoPE |
| Flow-Matching Transformer | 390M | 3-layer bidirectional transformer, Euler ODE (7 steps), CFG |
| Voxtral Codec Decoder | 300M | 4 conv+transformer blocks, strides [1,2,2,2], 240-channel output |
| Variable | Description |
|---|---|
RUST_LOG |
Log verbosity: error, warn, info (default), debug, trace |
LIBTORCH |
Path to libtorch directory (Linux/tch backend only) |
LIBTORCH_BYPASS_VERSION_CHECK |
Set to 1 to skip libtorch version check |
Benchmarked on Apple M4 Max (MLX backend, Metal GPU). Long text (>400 chars) is automatically split into sentence chunks, each generated independently.
| Test | Wall (s) | Audio (s) | RTF |
|---|---|---|---|
| CLI: Short English (neutral_female) | 9.88 | 1.36 | 7.26 |
| CLI: Medium English (neutral_male) | 29.56 | 6.24 | 4.74 |
| CLI: French (fr_female) | 24.59 | 5.04 | 4.88 |
| CLI: Long text multi-chunk | 170.44 | 40.56 | 4.20 |
| API: Short (alloy) | 9.44 | 2.16 | 4.37 |
| API: Medium (neutral_female) | 33.36 | 8.08 | 4.13 |
| API: Spanish (es_male) | 11.16 | 2.56 | 4.36 |
| API: Long text multi-chunk | 165.35 | 40.24 | 4.11 |
| API: Short MP3 (alloy) | 7.57 | — | — |
RTF = real-time factor (wall time / audio duration). Lower is better; RTF < 1 means faster than real-time.
- ~3.0 frames/s sustained generation speed (~0.33s per frame)
- ~3.3s fixed prefill overhead per chunk (dominates short text RTF)
- Best RTF on long text (~4.1x) where prefill cost is amortized across hundreds of frames
- Short text has higher RTF (7.3x) because the 3.3s prefill dominates the 1.4s of audio
Apache-2.0
The Voxtral model weights are licensed under Mistral AI Non-Production License.