Skip to content

second-state/voxtral_tts_rs

Repository files navigation

Rust CLI and API server for Voxtral TTS from Mistral

Rust implemenation of Voxtral-4B-TTS — Mistral AI's 4B-parameter text-to-speech model. Runs on macOS (Apple Silicon via MLX) and Linux (CPU or CUDA via libtorch). No Python required. Includes both CLI and API server. Ready for agent harness.

Quick Start

1. Download the release

Download the platform-specific zip from GitHub Releases:

Platform Asset
macOS (Apple Silicon) voxtral-tts-macos-aarch64.zip
Linux x86_64 (CPU) voxtral-tts-linux-x86_64.zip
Linux x86_64 (CUDA) voxtral-tts-linux-x86_64-cuda.zip
Linux ARM64 (CPU) voxtral-tts-linux-aarch64.zip
Linux ARM64 (CUDA) voxtral-tts-linux-aarch64-cuda.zip
# Example: macOS Apple Silicon
curl -LO https://github.com/second-state/voxtral_tts_rs/releases/latest/download/voxtral-tts-macos-aarch64.zip
unzip voxtral-tts-macos-aarch64.zip
cd voxtral-tts-macos-aarch64

2. Download the model

bash <(curl -sSf https://raw.githubusercontent.com/second-state/voxtral_tts_rs/main/scripts/download_model.sh)

This downloads consolidated.safetensors (8 GB), params.json, tekken.json, and 20 voice embeddings into models/voxtral-4b-tts/.

3. Copy voice embeddings to the model folder

The release zip includes pre-converted voice embeddings (.safetensors). Copy them into the model directory:

cp voice_embedding/*.safetensors models/voxtral-4b-tts/voice_embedding/

4. Generate speech (CLI)

./voxtral-tts models/voxtral-4b-tts \
    --text "Hello, this is Voxtral TTS!" \
    --voice neutral_female \
    --output output.wav

5. Start the API server

./voxtral-tts-server models/voxtral-4b-tts --port 8080
curl -X POST http://localhost:8080/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"input":"Hello world","voice":"alloy"}' \
    -o output.wav

CLI Reference

voxtral-tts <MODEL_DIR> --text "..." [OPTIONS]
Flag Default Description
<MODEL_DIR> (required) Path to model directory
--text, -t (required) Text to synthesize
--voice, -v neutral_female Voice name or OpenAI alias
--output, -o output.wav Output WAV file path
--temperature 0.7 Sampling temperature (higher = more variation)
--max-tokens 4096 Maximum generation tokens
--reference-audio Voice reference audio file (for voice cloning)
--list-voices Print available voices and exit

Examples:

# English with a casual voice
./voxtral-tts models/voxtral-4b-tts --text "Hey, what's up?" --voice casual_male -o casual.wav

# French
./voxtral-tts models/voxtral-4b-tts --text "Bonjour le monde!" --voice fr_female -o bonjour.wav

# List all voices
./voxtral-tts models/voxtral-4b-tts --list-voices --text ""

API Server Reference

voxtral-tts-server <MODEL_DIR> [OPTIONS]
Flag Default Description
<MODEL_DIR> (required) Path to model directory
--host 127.0.0.1 Bind host address
--port 8080 Bind port

Endpoints

Endpoint Method Description
/health GET Health check — returns {"status":"ok"}
/v1/models GET List available models
/v1/audio/speech POST Generate speech (OpenAI-compatible)

POST /v1/audio/speech

Request body:

{
    "model": "voxtral-4b-tts",
    "input": "Text to synthesize",
    "voice": "neutral_female",
    "response_format": "wav",
    "speed": 1.0,
    "stream": false
}
Field Type Default Description
input string (required) Text to synthesize (max 4096 chars)
model string voxtral-4b-tts Model name
voice string alloy Voice name or OpenAI alias
response_format string wav Output format: wav, pcm, mp3, flac, ogg, opus
speed float 1.0 Speed multiplier (0.25–4.0, reserved)
stream bool false Enable SSE streaming

Non-streaming returns binary audio with the appropriate content type (audio/wav, audio/pcm, audio/mpeg, audio/flac, or audio/ogg).

Streaming ("stream": true) returns Server-Sent Events with base64 PCM chunks:

data: {"type":"speech.audio.delta","delta":"<base64 16-bit LE PCM>"}
data: {"type":"speech.audio.delta","delta":"<base64 16-bit LE PCM>"}
data: {"type":"speech.audio.done"}

Examples:

# Non-streaming WAV
curl -X POST http://localhost:8080/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"input":"Hello world","voice":"alloy"}' \
    -o output.wav

# Streaming
curl -N -X POST http://localhost:8080/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"input":"Hello world","voice":"alloy","stream":true}'

Voices

20 Preset Voices

Voice Language Gender
casual_female, casual_male English F, M
cheerful_female English F
neutral_female, neutral_male English F, M
fr_male, fr_female French M, F
es_male, es_female Spanish M, F
de_male, de_female German M, F
pt_male, pt_female Portuguese M, F
it_male, it_female Italian M, F
nl_male, nl_female Dutch M, F
ar_male Arabic M
hi_male, hi_female Hindi M, F

OpenAI Voice Aliases

Alias Maps to
alloy neutral_female
echo casual_male
fable cheerful_female
onyx neutral_male
nova casual_female
shimmer fr_female

Build from Source

Prerequisites

Platform Requirements
macOS (Apple Silicon) Xcode Command Line Tools, CMake, Rust 1.75+
Linux (CPU) GCC/Clang, Rust 1.75+
Linux (CUDA) NVIDIA driver 535+, CUDA toolkit, Rust 1.75+

macOS (MLX backend)

git clone https://github.com/second-state/voxtral_tts_rs.git
cd voxtral_tts_rs
git submodule update --init --recursive
cargo build --release --no-default-features --features mlx

Linux (libtorch backend)

git clone https://github.com/second-state/voxtral_tts_rs.git
cd voxtral_tts_rs

# Download libtorch (auto-detects x86_64 or aarch64)
bash scripts/download_libtorch.sh cpu      # CPU only
bash scripts/download_libtorch.sh cuda     # CUDA 12.6

# Build
export LIBTORCH=$(pwd)/libtorch
export LIBTORCH_BYPASS_VERSION_CHECK=1
cargo build --release

Convert voice embeddings

The model ships voice embeddings as PyTorch .pt files. Convert them to .safetensors (required for MLX, optional for libtorch):

pip install torch safetensors
python3 -c "
import torch, os
from safetensors.torch import save_file
d = 'models/voxtral-4b-tts/voice_embedding'
for f in os.listdir(d):
    if f.endswith('.pt'):
        t = torch.load(os.path.join(d, f), map_location='cpu', weights_only=True)
        save_file({'embedding': t}, os.path.join(d, f.replace('.pt', '.safetensors')))
        print(f'Converted {f}')
"

Run tests

cargo test

Architecture

The model has three components totalling 4B parameters:

Text ──> Tekken Tokenizer ──> Token IDs
                                  │
                                  v
Voice Embedding ──> Backbone Decoder (3.4B, 26 layers) ──> Hidden States
                                                               │
                                                               v
                    Flow-Matching Transformer (390M) ──> 37 Audio Codes/Frame
                                                               │
                                                               v
                    Voxtral Codec Decoder (300M) ──> 24kHz Mono Waveform
Component Parameters Architecture
Backbone Decoder 3.4B 26-layer Mistral transformer, dim=3072, 32 heads (8 KV), SwiGLU, RoPE
Flow-Matching Transformer 390M 3-layer bidirectional transformer, Euler ODE (7 steps), CFG
Voxtral Codec Decoder 300M 4 conv+transformer blocks, strides [1,2,2,2], 240-channel output

Environment Variables

Variable Description
RUST_LOG Log verbosity: error, warn, info (default), debug, trace
LIBTORCH Path to libtorch directory (Linux/tch backend only)
LIBTORCH_BYPASS_VERSION_CHECK Set to 1 to skip libtorch version check

Performance

Benchmarked on Apple M4 Max (MLX backend, Metal GPU). Long text (>400 chars) is automatically split into sentence chunks, each generated independently.

Test Wall (s) Audio (s) RTF
CLI: Short English (neutral_female) 9.88 1.36 7.26
CLI: Medium English (neutral_male) 29.56 6.24 4.74
CLI: French (fr_female) 24.59 5.04 4.88
CLI: Long text multi-chunk 170.44 40.56 4.20
API: Short (alloy) 9.44 2.16 4.37
API: Medium (neutral_female) 33.36 8.08 4.13
API: Spanish (es_male) 11.16 2.56 4.36
API: Long text multi-chunk 165.35 40.24 4.11
API: Short MP3 (alloy) 7.57

RTF = real-time factor (wall time / audio duration). Lower is better; RTF < 1 means faster than real-time.

  • ~3.0 frames/s sustained generation speed (~0.33s per frame)
  • ~3.3s fixed prefill overhead per chunk (dominates short text RTF)
  • Best RTF on long text (~4.1x) where prefill cost is amortized across hundreds of frames
  • Short text has higher RTF (7.3x) because the 3.3s prefill dominates the 1.4s of audio

License

Apache-2.0

The Voxtral model weights are licensed under Mistral AI Non-Production License.

About

Rust CLI and API server for Voxtral TTS from Mistral

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors