AI speech models for Apple Silicon, powered by MLX Swift and CoreML.
📖 Read in: English · 中文 · 日本語 · 한국어 · Español · Deutsch · Français · हिन्दी · Português · Русский
On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon — no cloud, no API keys, no data leaves your device.
Install via Homebrew or add as a Swift Package dependency.
Documentation · HuggingFace Models · Blog
- Qwen3-ASR — Speech-to-text / speech recognition (automatic speech recognition, 52 languages)
- Parakeet TDT — Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
- Qwen3-ForcedAligner — Word-level timestamp alignment (audio + text → timestamps)
- Qwen3-TTS — Text-to-speech synthesis (highest quality, streaming, custom speakers, 10 languages)
- CosyVoice TTS — Text-to-speech with streaming, voice cloning, multi-speaker dialogue, and emotion tags (9 languages, DiT flow matching, CAM++ speaker encoder)
- Kokoro TTS — On-device text-to-speech (82M params, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
- Qwen3-TTS CoreML — Text-to-speech (0.6B, CoreML 6-model pipeline, W8A16, iOS/macOS)
- Qwen3.5-Chat — On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
- PersonaPlex — Full-duplex speech-to-speech conversation (7B, audio in → audio out, 18 voice presets)
- DeepFilterNet3 — Speech enhancement / noise suppression (2.1M params, real-time 48kHz)
- FireRedVAD — Offline voice activity detection (DFSMN, CoreML, 100+ languages, 97.6% F1)
- Silero VAD — Streaming voice activity detection (32ms chunks, sub-millisecond latency)
- Pyannote VAD — Offline voice activity detection (10s windows, multi-speaker overlap)
- Speaker Diarization — Who spoke when (Pyannote segmentation + activity-based speaker chaining, or end-to-end Sortformer on Neural Engine)
- Speaker Embeddings — Speaker verification and identification (WeSpeaker ResNet34, 256-dim vectors)
Papers: Qwen3-ASR (Alibaba), Qwen3-TTS (Alibaba), Qwen3 (Alibaba), Parakeet TDT (NVIDIA), CosyVoice 3 (Alibaba), Kokoro (StyleTTS 2), PersonaPlex (NVIDIA), Mimi (Kyutai), Sortformer (NVIDIA)
See Roadmap discussion for what's planned — comments and suggestions welcome!
- 20 Mar 2026 — We Beat Whisper Large v3 with a 600M Model Running Entirely on Your Mac
- 26 Feb 2026 — Speaker Diarization and Voice Activity Detection on Apple Silicon — Native Swift with MLX
- 23 Feb 2026 — NVIDIA PersonaPlex 7B on Apple Silicon — Full-Duplex Speech-to-Speech in Native Swift with MLX
- 12 Feb 2026 — Qwen3-ASR Swift: On-Device ASR + TTS for Apple Silicon — Architecture and Benchmarks
| Model | Task | Streaming | Languages | Sizes |
|---|---|---|---|---|
| Qwen3-ASR-0.6B | Speech → Text | No | 52 languages | 4-bit 680 MB · 8-bit 1.0 GB · CoreML 180 MB |
| Qwen3-ASR-1.7B | Speech → Text | No | 52 languages | 4-bit 2.1 GB · 8-bit 3.2 GB |
| Parakeet-TDT-0.6B | Speech → Text | No | 25 European languages | CoreML INT8 500 MB |
| Qwen3-ForcedAligner-0.6B | Audio + Text → Timestamps | No | Multi | 4-bit 979 MB · 8-bit 1.4 GB · CoreML INT4 630 MB · CoreML INT8 1.0 GB |
| Qwen3-TTS-0.6B Base | Text → Speech | Yes (~120ms) | 10 languages | 4-bit 1.7 GB · 8-bit 2.4 GB · CoreML 1.0 GB |
| Qwen3-TTS-0.6B CustomVoice | Text → Speech | Yes (~120ms) | 10 languages | 4-bit 1.7 GB |
| Qwen3-TTS-1.7B Base | Text → Speech | Yes (~120ms) | 10 languages | 4-bit 3.2 GB · 8-bit 4.8 GB |
| CosyVoice3-0.5B | Text → Speech | Yes (~150ms) | 9 languages | 4-bit 1.2 GB |
| Kokoro-82M | Text → Speech | No | 10 languages | CoreML ~170 MB |
| Qwen3.5-0.8B Chat | Text → Text (LLM) | Yes (streaming) | Multi | MLX INT4 418 MB · CoreML INT8 981 MB |
| PersonaPlex-7B | Speech → Speech | Yes (~2s chunks) | EN | 4-bit 4.9 GB · 8-bit 9.1 GB |
| FireRedVAD | Voice Activity Detection | No (offline) | 100+ languages | CoreML ~1.2 MB |
| Silero-VAD-v5 | Voice Activity Detection | Yes (32ms chunks) | Language-agnostic | MLX · CoreML ~1.2 MB |
| Pyannote-Segmentation-3.0 | VAD + Speaker Segmentation | No (10s windows) | Language-agnostic | MLX ~5.7 MB |
| DeepFilterNet3 | Speech Enhancement | Yes (10ms frames) | Language-agnostic | CoreML FP16 ~4.2 MB |
| WeSpeaker-ResNet34-LM | Speaker Embedding (256-dim) | No | Language-agnostic | MLX · CoreML ~25 MB |
| CAM++ | Speaker Embedding (192-dim) | No | Language-agnostic | CoreML ~14 MB |
| Sortformer | Speaker Diarization (end-to-end) | Yes (chunked) | Language-agnostic | CoreML ~240 MB |
Weight memory is the GPU (MLX) or ANE (CoreML) memory consumed by model parameters. Peak inference includes KV caches, activations, and intermediate tensors.
| Model | Weight Memory | Peak Inference |
|---|---|---|
| Qwen3-ASR-0.6B (4-bit, MLX) | 675 MB | ~2.2 GB |
| Qwen3-ASR-0.6B (INT8, CoreML) | 180 MB | ~400 MB |
| Qwen3-ASR-1.7B (8-bit, MLX) | 2,349 MB | ~4 GB |
| Parakeet-TDT-0.6B (CoreML) | 315 MB | ~400 MB |
| Qwen3-ForcedAligner-0.6B (4-bit, MLX) | 933 MB | ~1.5 GB |
| Qwen3-TTS-1.7B (4-bit, MLX) | 2,300 MB | ~4–6 GB |
| Qwen3-TTS-0.6B (4-bit, MLX) | 977 MB | ~2 GB |
| CosyVoice3-0.5B (4-bit, MLX) | 732 MB | ~2.5 GB |
| Kokoro-82M (CoreML) | 170 MB | ~200 MB |
| Qwen3.5-Chat-0.8B (INT4, MLX) | 418 MB | ~700 MB |
| Qwen3.5-Chat-0.8B (INT8, CoreML) | 981 MB | ~1.2 GB |
| PersonaPlex-7B (8-bit, MLX) | 9,100 MB | ~11 GB |
| PersonaPlex-7B (4-bit, MLX) | 4,900 MB | ~6.5 GB |
| Silero-VAD-v5 (MLX) | 1.2 MB | ~5 MB |
| Silero-VAD-v5 (CoreML) | 0.7 MB | ~3 MB |
| Pyannote-Segmentation-3.0 (MLX) | 6 MB | ~20 MB |
| DeepFilterNet3 (CoreML FP16) | 4.2 MB | ~10 MB |
| WeSpeaker-ResNet34-LM (MLX) | 25 MB | ~50 MB |
- Qwen3-TTS: Best quality, streaming (~120ms), 9 built-in speakers, 10 languages, batch synthesis
- CosyVoice TTS: Streaming (~150ms), 9 languages, voice cloning (CAM++ speaker encoder), multi-speaker dialogue (
[S1] ... [S2] ...), inline emotion/style tags ((happy),(whispers)), DiT flow matching + HiFi-GAN vocoder - Kokoro TTS: Lightweight iOS-ready TTS (82M params), CoreML/Neural Engine, 54 voices, 10 languages, end-to-end model
- PersonaPlex: Full-duplex speech-to-speech (audio in → audio out), streaming (~2s chunks), 18 voice presets, based on Moshi architecture
Requires native ARM Homebrew (/opt/homebrew). Rosetta/x86_64 Homebrew is not supported.
brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speechThen use:
audio transcribe recording.wav
audio speak "Hello world"
audio speak "Hello world" --engine coreml # CoreML (Neural Engine)
audio speak "Hallo Welt" --engine cosyvoice --language german
audio respond --input question.wav --transcriptFor interactive voice conversation with microphone input, see PersonaPlexDemo.
Add to your Package.swift:
dependencies: [
.package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]Import the module you need:
import Qwen3ASR // Speech recognition (MLX)
import ParakeetASR // Speech recognition (CoreML)
import Qwen3TTS // Text-to-speech (Qwen3)
import CosyVoiceTTS // Text-to-speech (streaming)
import KokoroTTS // Text-to-speech (CoreML, iOS-ready)
import Qwen3Chat // On-device LLM chat (CoreML)
import PersonaPlex // Speech-to-speech (full-duplex)
import SpeechVAD // Voice activity detection (pyannote + Silero)
import SpeechEnhancement // Noise suppression (DeepFilterNet3)
import AudioCommon // Shared utilities- Swift 5.9+
- macOS 14+ or iOS 17+
- Apple Silicon (M1/M2/M3/M4)
- Xcode 15+ (with Metal Toolchain — run
xcodebuild -downloadComponent MetalToolchainif missing)
git clone https://github.com/soniqo/speech-swift
cd speech-swift
make buildThis compiles the Swift package and the MLX Metal shader library in one step. The Metal library (mlx.metallib) is required for GPU inference — without it you'll get Failed to load the default metallib at runtime.
For debug builds: make debug. To run unit tests: make test.
PersonaPlexDemo is a ready-to-run macOS voice assistant — tap to talk, get spoken responses in real-time. Uses microphone input with Silero VAD for automatic speech detection, Qwen3-ASR for transcription, and PersonaPlex 7B for speech-to-speech generation. Multi-turn conversation with 18 voice presets and inner monologue transcript display.
make build # from repo root — builds everything including MLX metallib
cd Examples/PersonaPlexDemo
# See Examples/PersonaPlexDemo/README.md for .app bundle instructionsRTF ~0.94 on M2 Max (faster than real-time). Models download automatically on first run (~5.5 GB PersonaPlex + ~400 MB ASR).
- iOSEchoDemo — iOS echo demo (Parakeet ASR + Kokoro TTS, speak and hear it back). Device and simulator.
- PersonaPlexDemo — Conversational voice assistant (mic input, VAD, multi-turn). macOS.
- SpeechDemo — Dictation and TTS synthesis in a tabbed interface. macOS.
Build and run — see each demo's README for instructions.
import Qwen3ASR
// Default: 0.6B model
let model = try await Qwen3ASRModel.fromPretrained()
// Or use the larger 1.7B model for better accuracy
let model = try await Qwen3ASRModel.fromPretrained(
modelId: "aufklarer/Qwen3-ASR-1.7B-MLX-8bit"
)
// Audio can be any sample rate — automatically resampled to 16kHz internally
let transcription = model.transcribe(audio: audioSamples, sampleRate: 16000)
print(transcription)Hybrid mode: CoreML encoder on Neural Engine + MLX text decoder on GPU. Lower power, frees GPU for the encoder pass.
import Qwen3ASR
let encoder = try await CoreMLASREncoder.fromPretrained()
let model = try await Qwen3ASRModel.fromPretrained()
let text = try model.transcribe(audio: audioSamples, sampleRate: 16000, coremlEncoder: encoder)INT8 (180 MB, default) and INT4 (90 MB) variants available. INT8 recommended (cosine similarity > 0.999 vs FP32).
import ParakeetASR
let model = try await ParakeetASRModel.fromPretrained()
let transcription = model.transcribe(audio: audioSamples, sampleRate: 16000)Runs on Neural Engine via CoreML — frees the GPU for concurrent workloads. 25 European languages, ~315 MB.
make build # or: swift build -c release && ./scripts/build_mlx_metallib.sh release
# Default (Qwen3-ASR 0.6B, MLX)
.build/release/audio transcribe audio.wav
# Use 1.7B model
.build/release/audio transcribe audio.wav --model 1.7B
# CoreML encoder (Neural Engine + MLX decoder)
.build/release/audio transcribe --engine qwen3-coreml audio.wav
# Parakeet TDT (CoreML, Neural Engine)
.build/release/audio transcribe --engine parakeet audio.wavimport Qwen3ASR
let aligner = try await Qwen3ForcedAligner.fromPretrained()
// Downloads ~979 MB on first run
let aligned = aligner.align(
audio: audioSamples,
text: "Can you guarantee that the replacement part will be shipped tomorrow?",
sampleRate: 24000
)
for word in aligned {
print("[\(String(format: "%.2f", word.startTime))s - \(String(format: "%.2f", word.endTime))s] \(word.text)")
}swift build -c release
# Align with provided text
.build/release/audio align audio.wav --text "Hello world"
# Transcribe first, then align
.build/release/audio align audio.wavOutput:
[0.12s - 0.45s] Can
[0.45s - 0.72s] you
[0.72s - 1.20s] guarantee
...
Non-autoregressive — single forward pass, no sampling loop. See Forced Aligner for architecture details.
import Qwen3TTS
import AudioCommon // for WAVWriter
let model = try await Qwen3TTSModel.fromPretrained()
// Downloads ~1.7 GB on first run (model + codec weights)
let audio = model.synthesize(text: "Hello world", language: "english")
// Output is 24kHz mono float samples
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)make build
.build/release/audio speak "Hello world" --output output.wav --language englishThe CustomVoice model variant supports 9 built-in speaker voices and natural language instructions for tone/style control. Load it by passing the CustomVoice model ID:
import Qwen3TTS
// Load the CustomVoice model (downloads ~1.7 GB on first run)
let model = try await Qwen3TTSModel.fromPretrained(
modelId: TTSModelVariant.customVoice.rawValue
)
// Synthesize with a specific speaker
let audio = model.synthesize(text: "Hello world", language: "english", speaker: "vivian")
// List available speakers
print(model.availableSpeakers) // ["aiden", "dylan", "eric", ...]CLI:
# Use CustomVoice model with a speaker
.build/release/audio speak "Hello world" --model customVoice --speaker vivian --output vivian.wav
# List available speakers
.build/release/audio speak --model customVoice --list-speakersClone a speaker's voice from a reference audio file. Two modes:
ICL mode (recommended) — encodes reference audio into codec tokens with transcript. Higher quality, reliable EOS:
let (model, encoder) = try await Qwen3TTSModel.fromPretrainedWithEncoder()
let refAudio = try AudioFileLoader.load(url: referenceURL, targetSampleRate: 24000)
let audio = model.synthesizeWithVoiceCloneICL(
text: "Hello world",
referenceAudio: refAudio,
referenceSampleRate: 24000,
referenceText: "Exact transcript of reference audio.",
language: "english",
codecEncoder: encoder
)X-vector mode — speaker embedding only, no transcript needed but lower quality:
let audio = model.synthesizeWithVoiceClone(
text: "Hello world",
referenceAudio: refAudio,
referenceSampleRate: 24000,
language: "english"
)CLI:
.build/release/audio speak "Hello world" --voice-sample reference.wav --output cloned.wavThe CustomVoice model accepts a natural language instruct parameter to control speaking style, tone, emotion, and pacing. The instruction is prepended to the model input in ChatML format.
// Cheerful tone
let audio = model.synthesize(
text: "Welcome to our store!",
language: "english",
speaker: "ryan",
instruct: "Speak in a cheerful, upbeat tone"
)
// Slow and serious
let audio = model.synthesize(
text: "We regret to inform you...",
language: "english",
speaker: "aiden",
instruct: "Read this slowly and solemnly"
)
// Whispering
let audio = model.synthesize(
text: "Can you keep a secret?",
language: "english",
speaker: "vivian",
instruct: "Whisper this softly"
)CLI:
# With style instruction
.build/release/audio speak "Good morning!" --model customVoice --speaker ryan \
--instruct "Speak in a cheerful, upbeat tone" --output cheerful.wav
# Default instruct ("Speak naturally.") is applied automatically when using CustomVoice
.build/release/audio speak "Hello world" --model customVoice --speaker ryan --output natural.wavWhen no --instruct is provided with the CustomVoice model, "Speak naturally." is applied automatically to prevent rambling output. The Base model does not support instruct.
Synthesize multiple texts in a single batched forward pass for higher throughput:
let texts = ["Good morning everyone.", "The weather is nice today.", "Please open the window."]
let audioList = model.synthesizeBatch(texts: texts, language: "english", maxBatchSize: 4)
// audioList[i] is 24kHz mono float samples for texts[i]
for (i, audio) in audioList.enumerated() {
try WAVWriter.write(samples: audio, sampleRate: 24000, to: URL(fileURLWithPath: "output_\(i).wav"))
}# Create a file with one text per line
echo "Hello world.\nGoodbye world." > texts.txt
.build/release/audio speak --batch-file texts.txt --output output.wav --batch-size 4
# Produces output_0.wav, output_1.wav, ...Batch mode amortizes model weight loads across items. Expect ~1.5-2.5x throughput improvement for B=4 on Apple Silicon. Best results when texts produce similar-length audio.
let config = SamplingConfig(temperature: 0.9, topK: 50, repetitionPenalty: 1.05)
let audio = model.synthesize(text: "Hello", language: "english", sampling: config)Emit audio chunks incrementally for low first-packet latency:
let stream = model.synthesizeStream(
text: "Hello, this is streaming synthesis.",
language: "english",
streaming: .lowLatency // ~120ms to first audio chunk
)
for try await chunk in stream {
// chunk.samples: [Float] PCM @ 24kHz
// chunk.isFinal: true on last chunk
playAudio(chunk.samples)
}CLI:
# Default streaming (3-frame first chunk, ~225ms latency)
.build/release/audio speak "Hello world" --stream
# Low-latency (1-frame first chunk, ~120ms latency)
.build/release/audio speak "Hello world" --stream --first-chunk-frames 1For an interactive voice assistant with microphone input, see PersonaPlexDemo — tap to talk, multi-turn conversation with automatic speech detection.
import PersonaPlex
import AudioCommon // for WAVWriter, AudioFileLoader
let model = try await PersonaPlexModel.fromPretrained()
// Downloads ~5.5 GB on first run (temporal 4-bit + depformer + Mimi codec + voice presets)
let audio = try AudioFileLoader.load(url: inputURL, targetSampleRate: 24000)
let (response, textTokens) = model.respond(userAudio: audio, voice: .NATM0)
// response: 24kHz mono float samples
// textTokens: model's inner monologue (SentencePiece token IDs)
try WAVWriter.write(samples: response.audio, sampleRate: 24000, to: outputURL)PersonaPlex generates text tokens alongside audio — the model's internal reasoning. Decode them with the built-in SentencePiece decoder:
let decoder = try SentencePieceDecoder(modelPath: "tokenizer_spm_32k_3.model")
let transcript = decoder.decode(textTokens)
print(transcript) // e.g. "Sure, I can help you with that..."// Receive audio chunks as they're generated (~2s per chunk)
let stream = model.respondStream(userAudio: audio, voice: .NATM0)
for try await chunk in stream {
playAudio(chunk.samples) // play immediately, 24kHz mono
// chunk.textTokens has this chunk's text; final chunk has all tokens
if chunk.isFinal { break }
}18 voice presets available:
- Natural Female: NATF0, NATF1, NATF2, NATF3
- Natural Male: NATM0, NATM1, NATM2, NATM3
- Variety Female: VARF0, VARF1, VARF2, VARF3, VARF4
- Variety Male: VARM0, VARM1, VARM2, VARM3, VARM4
The system prompt steers the model's conversational behavior. Pass any custom prompt as a plain string:
// Custom system prompt (tokenized automatically)
let response = model.respond(
userAudio: audio,
voice: .NATM0,
systemPrompt: "You enjoy having a good conversation."
)
// Or use a preset
let response = model.respond(
userAudio: audio,
voice: .NATM0,
systemPromptTokens: SystemPromptPreset.customerService.tokens
)Available presets: focused (default), assistant, customerService, teacher.
make build
# Basic speech-to-speech
.build/release/audio respond --input question.wav --output response.wav
# With transcript (decodes inner monologue text)
.build/release/audio respond --input question.wav --transcript
# JSON output (audio path, transcript, latency metrics)
.build/release/audio respond --input question.wav --json
# Custom system prompt text
.build/release/audio respond --input question.wav --system-prompt-text "You enjoy having a good conversation."
# Choose a voice and system prompt preset
.build/release/audio respond --input question.wav --voice NATF1 --system-prompt focused
# Tune sampling parameters
.build/release/audio respond --input question.wav --audio-temp 0.6 --repetition-penalty 1.5
# Enable text entropy early stopping (stops if text collapses)
.build/release/audio respond --input question.wav --entropy-threshold 1.0 --entropy-window 5
# List available voices and prompts
.build/release/audio respond --list-voices
.build/release/audio respond --list-promptsimport CosyVoiceTTS
import AudioCommon // for WAVWriter
let model = try await CosyVoiceTTSModel.fromPretrained()
// Downloads ~1.9 GB on first run (LLM + DiT + HiFi-GAN weights)
let audio = model.synthesize(text: "Hello, how are you today?", language: "english")
// Output is 24kHz mono float samples
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)// Streaming: receive audio chunks as they're generated (~150ms to first chunk)
for try await chunk in model.synthesizeStream(text: "Hello, how are you today?", language: "english") {
// chunk.audio: [Float], chunk.sampleRate: Int
playAudio(chunk.audio) // play immediately
}Clone a speaker's voice using the CAM++ speaker encoder (192-dim, CoreML Neural Engine):
import CosyVoiceTTS
import AudioCommon
let model = try await CosyVoiceTTSModel.fromPretrained()
let speaker = try await CamPlusPlusSpeaker.fromPretrained()
// Downloads ~14 MB CAM++ CoreML model on first use
let refAudio = try AudioFileLoader.load(url: referenceURL, targetSampleRate: 16000)
let embedding = try speaker.embed(audio: refAudio, sampleRate: 16000)
// embedding: [Float] of length 192
let audio = model.synthesize(
text: "Hello in a cloned voice!",
language: "english",
speakerEmbedding: embedding
)make build
# Basic synthesis
.build/release/audio speak "Hello world" --engine cosyvoice --language english --output output.wav
# Voice cloning (downloads CAM++ speaker encoder on first use)
.build/release/audio speak "Hello world" --engine cosyvoice --voice-sample reference.wav --output cloned.wav
# Multi-speaker dialogue with voice cloning
.build/release/audio speak "[S1] Hello there! [S2] Hey, how are you?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav
# Inline emotion/style tags
.build/release/audio speak "(excited) Wow, amazing! (sad) But I have to go..." \
--engine cosyvoice -o emotion.wav
# Combined: dialogue + emotions + voice cloning
.build/release/audio speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav
# Custom style instruction
.build/release/audio speak "Hello world" --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav
# Streaming synthesis
.build/release/audio speak "Hello world" --engine cosyvoice --language english --stream --output output.wavimport KokoroTTS
import AudioCommon // for WAVWriter
let tts = try await KokoroTTSModel.fromPretrained()
// Downloads ~170 MB on first run (CoreML models + voice embeddings + dictionaries)
let audio = try tts.synthesize(text: "Hello world", voice: "af_heart")
// Output is 24kHz mono float samples
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)54 preset voices across 10 languages. End-to-end CoreML model, non-autoregressive, no sampling loop. Runs on Neural Engine, frees the GPU entirely.
make build
# Basic synthesis
.build/release/audio kokoro "Hello world" --voice af_heart --output hello.wav
# Choose language
.build/release/audio kokoro "Bonjour le monde" --voice ff_siwis --language fr --output bonjour.wav
# List available voices
.build/release/audio kokoro --list-voices6-model autoregressive pipeline (TextProjector → CodeDecoder → MultiCodeDecoder → SpeechDecoder) running on CoreML. W8A16 palettized weights.
.build/release/audio qwen3-tts-coreml "Hello, how are you?" --output hello.wav
.build/release/audio qwen3-tts-coreml "Guten Tag" --language german --output guten.wavimport Qwen3Chat
let chat = try await Qwen3ChatModel.fromPretrained()
// Downloads ~318 MB on first run (INT4 CoreML model + tokenizer)
// Single response
let response = try chat.generate("What is Swift?", systemPrompt: "Answer briefly.")
print(response)
// Streaming tokens
let stream = chat.chatStream("Tell me a joke", systemPrompt: "Be funny.")
for try await token in stream {
print(token, terminator: "")
}Qwen3-0.6B INT4 quantized for CoreML. Runs on Neural Engine with ~2 tok/s on iPhone, ~15 tok/s on M-series. Supports multi-turn conversation with KV cache, thinking mode (<think> tokens), and configurable sampling (temperature, top-k, top-p, repetition penalty).
Silero VAD v5 processes 32ms audio chunks with sub-millisecond latency — ideal for real-time speech detection from microphones or streams.
import SpeechVAD
let vad = try await SileroVADModel.fromPretrained()
// Or use CoreML (Neural Engine, lower power):
// let vad = try await SileroVADModel.fromPretrained(engine: .coreml)
// Streaming: process 512-sample chunks (32ms @ 16kHz)
let prob = vad.processChunk(samples) // → 0.0...1.0
vad.resetState() // call between different audio streams
// Or detect all segments at once
let segments = vad.detectSpeech(audio: audioSamples, sampleRate: 16000)
for seg in segments {
print("Speech: \(seg.startTime)s - \(seg.endTime)s")
}let processor = StreamingVADProcessor(model: vad)
// Feed audio of any length — events emitted as speech is confirmed
let events = processor.process(samples: audioBuffer)
for event in events {
switch event {
case .speechStarted(let time):
print("Speech started at \(time)s")
case .speechEnded(let segment):
print("Speech: \(segment.startTime)s - \(segment.endTime)s")
}
}
// Flush at end of stream
let final = processor.flush()make build
# Streaming Silero VAD (32ms chunks)
.build/release/audio vad-stream audio.wav
# CoreML backend (Neural Engine)
.build/release/audio vad-stream audio.wav --engine coreml
# With custom thresholds
.build/release/audio vad-stream audio.wav --onset 0.6 --offset 0.4
# JSON output
.build/release/audio vad-stream audio.wav --json
# Batch pyannote VAD (10s sliding windows)
.build/release/audio vad audio.wavimport SpeechVAD
let pipeline = try await DiarizationPipeline.fromPretrained()
// Or use CoreML embeddings (Neural Engine, frees GPU):
// let pipeline = try await DiarizationPipeline.fromPretrained(embeddingEngine: .coreml)
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
print("Speaker \(seg.speakerId): [\(seg.startTime)s - \(seg.endTime)s]")
}
print("\(result.numSpeakers) speakers detected")let model = try await WeSpeakerModel.fromPretrained()
// Or: let model = try await WeSpeakerModel.fromPretrained(engine: .coreml)
let embedding = model.embed(audio: samples, sampleRate: 16000)
// embedding: [Float] of length 256, L2-normalized
// Compare speakers
let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)Extract only a specific speaker's segments using a reference recording:
let pipeline = try await DiarizationPipeline.fromPretrained()
let targetEmb = pipeline.embeddingModel.embed(audio: enrollmentAudio, sampleRate: 16000)
let segments = pipeline.extractSpeaker(
audio: meetingAudio, sampleRate: 16000,
targetEmbedding: targetEmb
)NVIDIA Sortformer predicts per-frame speaker activity for up to 4 speakers directly — no embedding or clustering needed. Runs on Neural Engine.
let diarizer = try await SortformerDiarizer.fromPretrained()
let result = diarizer.diarize(audio: samples, sampleRate: 16000, config: .default)
for seg in result.segments {
print("Speaker \(seg.speakerId): [\(seg.startTime)s - \(seg.endTime)s]")
}make build
# Pyannote diarization (default)
.build/release/audio diarize meeting.wav
# Sortformer diarization (CoreML, Neural Engine)
.build/release/audio diarize meeting.wav --engine sortformer
# CoreML embeddings (Neural Engine, pyannote only)
.build/release/audio diarize meeting.wav --embedding-engine coreml
# JSON output
.build/release/audio diarize meeting.wav --json
# Extract a specific speaker (pyannote only)
.build/release/audio diarize meeting.wav --target-speaker enrollment.wav
# Speaker embedding
.build/release/audio embed-speaker enrollment.wav --json
.build/release/audio embed-speaker enrollment.wav --engine coremlSee Speaker Diarization for architecture details.
import SpeechEnhancement
import AudioCommon // for WAVWriter
let enhancer = try await SpeechEnhancer.fromPretrained()
// Downloads ~4.3 MB on first run (Core ML FP16 model + auxiliary data)
let cleanAudio = try enhancer.enhance(audio: noisyAudio, sampleRate: 48000)
try WAVWriter.write(samples: cleanAudio, sampleRate: 48000, to: outputURL)make build
# Basic noise removal
.build/release/audio denoise noisy.wav
# Custom output path
.build/release/audio denoise noisy.wav --output clean.wavSee Speech Enhancement for architecture details.
All models conform to shared protocols (SpeechRecognitionModel, SpeechGenerationModel, SpeechEnhancementModel, etc.) and can be composed into pipelines:
import SpeechEnhancement
import Qwen3ASR
let enhancer = try await SpeechEnhancer.fromPretrained()
let asr = try await Qwen3ASRModel.fromPretrained()
// Enhance at 48kHz, then transcribe at 16kHz
let clean = try enhancer.enhance(audio: noisyAudio, sampleRate: 48000)
let clean16k = AudioResampler.resample(clean, from: 48000, to: 16000)
let text = asr.transcribe(audio: clean16k, sampleRate: 16000)import SpeechVAD
import Qwen3ASR
import Qwen3TTS
let vad = try await SileroVADModel.fromPretrained()
let asr = try await Qwen3ASRModel.fromPretrained()
let tts = try await Qwen3TTSModel.fromPretrained()
// Detect speech segments, transcribe, re-synthesize
let segments = vad.detectSpeech(audio: audio, sampleRate: 16000)
for seg in segments {
let chunk = Array(audio[Int(seg.startTime * 16000)..<Int(seg.endTime * 16000)])
let text = asr.transcribe(audio: chunk, sampleRate: 16000)
let speech = tts.synthesize(text: text, language: "english")
// speech: 24kHz mono float samples
}import SpeechVAD
import Qwen3ASR
let pipeline = try await DiarizationPipeline.fromPretrained()
let asr = try await Qwen3ASRModel.fromPretrained()
let result = pipeline.diarize(audio: meetingAudio, sampleRate: 16000)
for seg in result.segments {
let chunk = Array(meetingAudio[Int(seg.startTime * 16000)..<Int(seg.endTime * 16000)])
let text = asr.transcribe(audio: chunk, sampleRate: 16000)
print("Speaker \(seg.speakerId) [\(seg.startTime)s-\(seg.endTime)s]: \(text)")
}See Shared Protocols for the full protocol reference.
A standalone HTTP server exposes all models via REST and WebSocket endpoints. Models are loaded lazily on first request.
swift build -c release
.build/release/audio-server --port 8080
# Transcribe audio
curl -X POST http://localhost:8080/transcribe --data-binary @audio.wav -H "Content-Type: audio/wav"
# Text-to-speech
curl -X POST http://localhost:8080/speak -H "Content-Type: application/json" \
-d '{"text": "Hello world", "engine": "cosyvoice"}' -o output.wav
# Speech-to-speech (PersonaPlex)
curl -X POST http://localhost:8080/respond --data-binary @question.wav -o response.wav
# Speech enhancement
curl -X POST http://localhost:8080/enhance --data-binary @noisy.wav -o clean.wav
# Preload all models on startup
.build/release/audio-server --preload --port 8080The primary WebSocket endpoint implements the OpenAI Realtime API protocol — all messages are JSON with a type field, audio is base64-encoded PCM16 24kHz mono.
Client → Server events:
| Event | Description |
|---|---|
session.update |
Configure engine, language, audio format |
input_audio_buffer.append |
Send base64 PCM16 audio chunk |
input_audio_buffer.commit |
Transcribe accumulated audio (ASR) |
input_audio_buffer.clear |
Clear audio buffer |
response.create |
Request TTS synthesis |
Server → Client events:
| Event | Description |
|---|---|
session.created |
Session initialized |
session.updated |
Configuration confirmed |
input_audio_buffer.committed |
Audio committed for transcription |
conversation.item.input_audio_transcription.completed |
ASR result |
response.audio.delta |
Base64 PCM16 audio chunk (TTS) |
response.audio.done |
Audio streaming complete |
response.done |
Response complete with metadata |
error |
Error with type and message |
const ws = new WebSocket('ws://localhost:8080/v1/realtime');
// ASR: send audio, get transcription
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64PCM16 }));
ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
// → receives: conversation.item.input_audio_transcription.completed
// TTS: send text, get streamed audio
ws.send(JSON.stringify({
type: 'response.create',
response: { modalities: ['audio', 'text'], instructions: 'Hello world' }
}));
// → receives: response.audio.delta (base64 chunks), response.audio.done, response.doneAn example HTML client is at Examples/websocket-client.html — open it in a browser while the server is running.
The server is a separate AudioServer module and audio-server executable — it does not add Hummingbird/WebSocket to the main audio CLI.
| Model | Backend | RTF | 10s audio processed in |
|---|---|---|---|
| Qwen3-ASR-0.6B (4-bit) | MLX | ~0.06 | ~0.6s |
| Qwen3-ASR-0.6B (INT8) | CoreML + MLX | ~0.09 | ~0.9s |
| Qwen3-ASR-1.7B (8-bit) | MLX | ~0.11 | ~1.1s |
| Parakeet-TDT-0.6B (INT8) | CoreML (Neural Engine) | ~0.09 cold, ~0.03 warm | ~0.9s / ~0.3s |
| Whisper-large-v3 | whisper.cpp (Q5_0) | ~0.10 | ~1.0s |
| Whisper-small | whisper.cpp (Q5_0) | ~0.04 | ~0.4s |
| Model | Framework | 20s audio | RTF |
|---|---|---|---|
| Qwen3-ForcedAligner-0.6B (4-bit) | MLX Swift (debug) | ~365ms | ~0.018 |
Single non-autoregressive forward pass — no sampling loop. Audio encoder dominates (~328ms), decoder single-pass is ~37ms. 55x faster than real-time.
| Model | Framework | Short (1s) | Medium (3s) | Long (6s) | Streaming First-Packet |
|---|---|---|---|---|---|
| Qwen3-TTS-0.6B (4-bit) | MLX Swift (release) | 1.6s (RTF 1.2) | 2.3s (RTF 0.7) | 3.9s (RTF 0.7) | ~120ms (1-frame) |
| Kokoro-82M | CoreML (Neural Engine) | ~1.4s (RTFx 0.7) | ~4.3s (RTFx 0.7) | ~8.6s (RTFx 0.7) | N/A (non-autoregressive) |
Apple AVSpeechSynthesizer |
AVFoundation | 0.08s | 0.08s | 0.17s (RTF 0.02) | N/A |
Qwen3-TTS generates natural, expressive speech with prosody and emotion, running faster than real-time (RTF < 1.0). Streaming synthesis delivers the first audio chunk in ~120ms. Kokoro-82M runs entirely on Neural Engine with an end-to-end model (RTFx ~0.7), ideal for iOS. Apple's built-in TTS is faster but produces robotic, monotone speech.
| Model | Framework | ms/step | RTF | Notes |
|---|---|---|---|---|
| PersonaPlex-7B (8-bit) | MLX Swift (release) | ~112ms | ~1.4 | Recommended — coherent responses, 30% faster than 4-bit |
| PersonaPlex-7B (4-bit) | MLX Swift (release) | ~158ms | ~1.97 | Not recommended — garbled output quality |
Use 8-bit. INT8 is both faster (112 ms/step vs 158 ms/step) and produces coherent full-duplex responses. INT4 quantization degrades generation quality, producing incoherent speech ("I go tea my coffee brewing..."). INT8 runs at ~112ms/step on M2 Max — above the 80ms real-time threshold but close to usable for streaming, and the output quality difference is decisive.
| Model | Backend | Per-call Latency | RTF | Notes |
|---|---|---|---|---|
| Silero-VAD-v5 | MLX | ~2.1ms / chunk | 0.065 | GPU (Metal) |
| Silero-VAD-v5 | CoreML | ~0.27ms / chunk | 0.008 | Neural Engine, 7.7x faster |
| WeSpeaker ResNet34-LM | MLX | ~310ms / 20s audio | 0.016 | GPU (Metal) |
| WeSpeaker ResNet34-LM | CoreML | ~430ms / 20s audio | 0.021 | Neural Engine, frees GPU |
Silero VAD CoreML runs on the Neural Engine at 7.7x the speed of MLX, making it ideal for always-on microphone input. WeSpeaker MLX is faster on GPU, but CoreML frees the GPU for concurrent workloads (TTS, ASR). Both backends produce equivalent results.
| Model | Backend | Duration | Latency | RTF |
|---|---|---|---|---|
| DeepFilterNet3 (FP16) | CoreML | 5s | 0.65s | 0.13 |
| DeepFilterNet3 (FP16) | CoreML | 10s | 1.2s | 0.12 |
| DeepFilterNet3 (FP16) | CoreML | 20s | 4.8s | 0.24 |
RTF = Real-Time Factor (lower is better, < 1.0 = faster than real-time). GRU cost scales ~O(n²).
Both backends produce equivalent results. Choose based on your workload:
| MLX | CoreML | |
|---|---|---|
| Hardware | GPU (Metal shaders) | Neural Engine + CPU |
| Best for | Maximum throughput, single-model workloads | Multi-model pipelines, background tasks |
| Power | Higher GPU utilization | Lower power, frees GPU |
| Latency | Faster for large models (WeSpeaker) | Faster for small models (Silero VAD) |
Desktop inference: MLX is the default — fastest single-model performance on Apple Silicon. Switch to CoreML when running multiple models concurrently (e.g., VAD + ASR + TTS) to avoid GPU contention, or for battery-sensitive workloads on laptops.
CoreML models are available for Qwen3-ASR encoder, Silero VAD, and WeSpeaker. For Qwen3-ASR, use --engine qwen3-coreml (hybrid: CoreML encoder on ANE + MLX text decoder on GPU). For VAD/embeddings, pass engine: .coreml at construction time — inference API is identical.
ASR — Word Error Rate (details)
| Model | WER% (LibriSpeech test-clean) | RTF |
|---|---|---|
| Qwen3-ASR 1.7B 8-bit | 2.35 | 0.090 |
| Qwen3-ASR 1.7B 4-bit | 2.57 | 0.045 |
| Parakeet TDT INT8 | 2.74 | 0.089 |
| Qwen3-ASR 0.6B 8-bit | 2.80 | 0.025 |
Qwen3-ASR 1.7B 8-bit beats Whisper Large v3 Turbo (2.5%) at comparable size. Multilingual: 10 languages benchmarked on FLEURS.
TTS — Round-Trip Intelligibility (details)
| Engine | WER% | RTF |
|---|---|---|
| CosyVoice3 | 3.25 | 0.59 |
| Qwen3-TTS 1.7B | 3.47 | 0.79 |
| Kokoro-82M | 3.90 | 0.17 |
VAD — Speech Detection (details)
| Engine | F1% (FLEURS) | RTF |
|---|---|---|
| FireRedVAD | 99.12 | 0.007 |
| Silero CoreML | 95.13 | 0.022 |
| Pyannote MLX | 94.86 | 0.358 |
Models: ASR Model, TTS Model, CosyVoice TTS, Kokoro TTS, Parakeet TDT, PersonaPlex, FireRedVAD
Inference: Qwen3-ASR, Parakeet TDT, TTS, Forced Aligner, FireRedVAD, Silero VAD, Speaker Diarization, Speech Enhancement
Audio: Streaming Playback, Voice Pipeline
Benchmarks: ASR WER, TTS Round-Trip, VAD Detection
Reference: Shared Protocols
Model weights are cached locally in ~/Library/Caches/qwen3-speech/.
CLI — override with an environment variable:
export QWEN3_CACHE_DIR=/path/to/cacheSwift API — all fromPretrained() methods accept cacheDir and offlineMode:
// Custom cache directory (sandboxed apps, iOS containers)
let model = try await ParakeetASRModel.fromPretrained(
cacheDir: myAppModelsDir)
// Offline mode — skip network when weights are already cached
let model = try await KokoroTTSModel.fromPretrained(offlineMode: true)See docs/inference/cache-and-offline.md for full details.
If you see Failed to load the default metallib at runtime, the Metal shader library is missing. Run make build (or ./scripts/build_mlx_metallib.sh release after a manual swift build) to compile it. If the Metal Toolchain is missing, install it first:
xcodebuild -downloadComponent MetalToolchainUnit tests (config, sampling, text preprocessing, timestamp correction) run without model downloads:
swift test --filter "Qwen3TTSConfigTests|SamplingTests|CosyVoiceTTSConfigTests|CamPlusPlusMelExtractorTests|PersonaPlexTests|ForcedAlignerTests/testText|ForcedAlignerTests/testTimestamp|ForcedAlignerTests/testLIS|SileroVADTests/testSilero|SileroVADTests/testReflection|SileroVADTests/testProcess|SileroVADTests/testReset|SileroVADTests/testDetect|SileroVADTests/testStreaming|SileroVADTests/testVADEvent|KokoroTTSTests"Integration tests require model weights (downloaded automatically on first run):
# TTS round-trip: synthesize text, save WAV, transcribe back with ASR
swift test --filter TTSASRRoundTripTests
# ASR only: transcribe test audio
swift test --filter Qwen3ASRIntegrationTests
# Forced Aligner E2E: word-level timestamps (~979 MB download)
swift test --filter ForcedAlignerTests/testForcedAlignerE2E
# PersonaPlex E2E: speech-to-speech pipeline (~5.5 GB download)
PERSONAPLEX_E2E=1 swift test --filter PersonaPlexE2ETestsNote: MLX Metal library must be built before running tests that use MLX operations. See MLX Metal Library for instructions.
| Model | Languages |
|---|---|
| Qwen3-ASR | 52 languages (CN, EN, Cantonese, DE, FR, ES, JA, KO, RU, + 22 Chinese dialects, ...) |
| Parakeet TDT | 25 European languages (BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, HR, HU, IT, LT, LV, MT, NL, PL, PT, RO, RU, SK, SL, SV, UK) |
| Qwen3-TTS | EN, CN, DE, JA, ES, FR, KO, RU, IT, PT (+ Beijing/Sichuan dialects via CustomVoice) |
| CosyVoice TTS | CN, EN, JA, KO, DE, ES, FR, IT, RU |
| Kokoro TTS | EN (US/UK), ES, FR, HI, IT, JA, PT, CN, KO, DE |
| PersonaPlex | EN |
| speech-swift (Qwen3-ASR) | whisper.cpp | Apple SFSpeechRecognizer | Google Cloud Speech | |
|---|---|---|---|---|
| Runtime | On-device (MLX/CoreML) | On-device (CPU/GPU) | On-device or cloud | Cloud only |
| Languages | 52 | 100+ | ~70 (on-device: limited) | 125+ |
| RTF (10s audio, M2 Max) | 0.06 (17x real-time) | 0.10 (Whisper-large-v3) | N/A | N/A |
| Streaming | No (batch) | No (batch) | Yes | Yes |
| Custom models | Yes (swap HuggingFace weights) | Yes (GGML models) | No | No |
| Swift API | Native async/await | C++ with Swift bridge | Native | REST/gRPC |
| Privacy | Fully on-device | Fully on-device | Depends on config | Data sent to cloud |
| Word timestamps | Yes (Forced Aligner) | Yes | Limited | Yes |
| Cost | Free (Apache 2.0) | Free (MIT) | Free (on-device) | Pay per minute |
| speech-swift (Qwen3-TTS) | speech-swift (Kokoro) | Apple AVSpeechSynthesizer | ElevenLabs / Cloud TTS | |
|---|---|---|---|---|
| Quality | Neural, expressive | Neural, natural | Robotic, monotone | Neural, highest quality |
| Runtime | On-device (MLX) | On-device (CoreML) | On-device | Cloud only |
| Streaming | Yes (~120ms first chunk) | No (single pass, ~45ms) | No | Yes |
| Voice cloning | Yes | No | No | Yes |
| Voices | 9 built-in + clone any | 50 preset voices | ~50 system voices | 1000+ |
| Languages | 10 | 10 | 60+ | 30+ |
| iOS support | macOS only | iOS + macOS | iOS + macOS | Any (API) |
| Cost | Free (Apache 2.0) | Free (Apache 2.0) | Free | Pay per character |
- Privacy-critical apps — medical, legal, enterprise where audio cannot leave the device
- Offline use — no internet connection needed after initial model download
- Cost-sensitive — no per-minute or per-character API charges
- Apple Silicon optimization — built specifically for M-series GPU (Metal) and Neural Engine
- Full pipeline — combine ASR + TTS + VAD + diarization + enhancement in a single Swift package
Does speech-swift work on iOS? Kokoro TTS, Qwen3.5-Chat (CoreML), Silero VAD, Parakeet ASR, DeepFilterNet3, and WeSpeaker all run on iOS 17+ via CoreML on the Neural Engine. MLX-based models (Qwen3-ASR, Qwen3-TTS, Qwen3.5-Chat MLX, PersonaPlex) require macOS 14+ on Apple Silicon.
Does it require an internet connection?
Only for the initial model download from HuggingFace (automatic, cached in ~/Library/Caches/qwen3-speech/). After that, all inference runs fully offline with no network access.
How does speech-swift compare to Whisper? Qwen3-ASR-0.6B achieves RTF 0.06 on M2 Max — 40% faster than Whisper-large-v3 via whisper.cpp (RTF 0.10) — with comparable accuracy across 52 languages. speech-swift provides a native Swift async/await API, while whisper.cpp requires a C++ bridge.
Can I use it in a commercial app? Yes. speech-swift is licensed under Apache 2.0. The underlying model weights have their own licenses (check each model's HuggingFace page).
What Apple Silicon chips are supported? All M-series chips: M1, M2, M3, M4 and their Pro/Max/Ultra variants. Requires macOS 14+ (Sonoma) or iOS 17+.
How much memory does it need? From ~3 MB (Silero VAD) to ~6.5 GB (PersonaPlex 7B). Kokoro TTS uses ~500 MB, Qwen3-ASR ~2.2 GB. See the Memory Requirements table for full details.
Can I run multiple models simultaneously? Yes. Use CoreML models on the Neural Engine alongside MLX models on the GPU to avoid contention — for example, Silero VAD (CoreML) + Qwen3-ASR (MLX) + Qwen3-TTS (MLX).
Is there a REST API?
Yes. The audio-server binary exposes all models via HTTP REST and WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime.
We welcome contributions! Whether it's a bug fix, new model integration, or documentation improvement — PRs are appreciated.
To get started:
- Fork the repo and create a feature branch
make buildto compile (requires Xcode + Metal Toolchain)make testto run the test suite- Open a PR against
main
Apache 2.0