Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ Project skills in `.claude/skills/`:

- `Sources/Qwen3ASR/` — Speech-to-text (Qwen3-ASR)
- `Sources/ParakeetASR/` — Speech-to-text (Parakeet TDT, CoreML)
- `Sources/ParakeetStreamingASR/` — Streaming speech-to-text (Parakeet EOU 120M, CoreML)
- `Sources/OmnilingualASR/` — Speech-to-text (Meta wav2vec2 CTC, 1,672 languages, CoreML 300M + MLX 300M/1B/3B/7B)
- `Sources/Qwen3TTS/` — Text-to-speech (Qwen3-TTS)
- `Sources/CosyVoiceTTS/` — Text-to-speech (CosyVoice3, streaming)
- `Sources/KokoroTTS/` — Text-to-speech (Kokoro-82M, CoreML, iOS-ready)
Expand All @@ -67,8 +69,8 @@ Project skills in `.claude/skills/`:
- `Sources/SpeechVAD/` — VAD (Silero + Pyannote), speaker diarization, speaker embedding (WeSpeaker)
- `Sources/SpeechEnhancement/` — Noise suppression (DeepFilterNet3, CoreML)
- `Sources/Qwen3Chat/` — On-device LLM chat (Qwen3.5-0.8B, MLX + CoreML, INT4/INT8)
- `Sources/MLXCommon/` — Shared MLX utilities (weight loading, quantized layers, memory estimation)
- `Sources/AudioCommon/` — Audio I/O, protocols, HuggingFace downloader
- `Sources/MLXCommon/` — Shared MLX utilities (weight loading, quantized layers, memory estimation, `SDPA` multi-head attention helper)
- `Sources/AudioCommon/` — Audio I/O, protocols, HuggingFace downloader, shared `SentencePieceModel` protobuf reader
- `Sources/AudioCLILib/` — CLI commands
- `Sources/AudioCLI/` — CLI entry point (`audio` binary)
- `Tests/` — Unit and integration tests
Expand Down
26 changes: 26 additions & 0 deletions Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@ let package = Package(
name: "ParakeetStreamingASR",
targets: ["ParakeetStreamingASR"]
),
.library(
name: "OmnilingualASR",
targets: ["OmnilingualASR"]
),
.library(
name: "SpeechCore",
targets: ["SpeechCore"]
Expand Down Expand Up @@ -170,6 +174,16 @@ let package = Package(
"AudioCommon",
]
),
.target(
name: "OmnilingualASR",
dependencies: [
"AudioCommon",
"MLXCommon",
.product(name: "MLX", package: "mlx-swift"),
.product(name: "MLXNN", package: "mlx-swift"),
.product(name: "MLXFast", package: "mlx-swift")
]
),
.binaryTarget(
name: "CSpeechCore",
url: "https://github.com/soniqo/speech-core/releases/download/v0.0.5/SpeechCore.xcframework.zip",
Expand Down Expand Up @@ -217,6 +231,7 @@ let package = Package(
"SpeechEnhancement",
"ParakeetASR",
"ParakeetStreamingASR",
"OmnilingualASR",
"KokoroTTS",
"AudioCommon",
.product(name: "MLX", package: "mlx-swift"),
Expand Down Expand Up @@ -293,6 +308,17 @@ let package = Package(
.copy("Resources/test_audio.wav")
]
),
.testTarget(
name: "OmnilingualASRTests",
dependencies: ["OmnilingualASR", "AudioCommon"],
resources: [
.copy("Resources/test_audio.wav"),
.copy("Resources/fleurs_en.wav"),
.copy("Resources/fleurs_hi.wav"),
.copy("Resources/fleurs_fr.wav"),
.copy("Resources/fleurs_ar.wav")
]
),
.testTarget(
name: "AudioCommonTests",
dependencies: [
Expand Down
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs

- **Qwen3-ASR** — Speech-to-text / speech recognition (automatic speech recognition, 52 languages)
- **Parakeet TDT** — Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
- **Omnilingual ASR** — Speech-to-text via CoreML (Meta wav2vec2 + CTC, **1,672 languages** across 150+ families, 5/10 s windows, INT8 palettized, ANE)
- **Qwen3-ForcedAligner** — Word-level timestamp alignment (audio + text → timestamps)
- **Qwen3-TTS** — Text-to-speech synthesis (highest quality, streaming, custom speakers, 10 languages)
- **CosyVoice TTS** — Text-to-speech with streaming, voice cloning, multi-speaker dialogue, and emotion tags (9 languages, DiT flow matching, CAM++ speaker encoder)
Expand Down Expand Up @@ -97,7 +98,7 @@ struct DictateView: View {

`SpeechUI` ships only `TranscriptionView` (finals + partials) and `TranscriptionStore` (streaming ASR adapter). Use AVFoundation for audio visualization and playback.

Available SPM products: `Qwen3ASR`, `Qwen3TTS`, `Qwen3TTSCoreML`, `ParakeetASR`, `ParakeetStreamingASR`, `KokoroTTS`, `CosyVoiceTTS`, `PersonaPlex`, `SpeechVAD`, `SpeechEnhancement`, `Qwen3Chat`, `SpeechCore`, `SpeechUI`, `AudioCommon`.
Available SPM products: `Qwen3ASR`, `Qwen3TTS`, `Qwen3TTSCoreML`, `ParakeetASR`, `ParakeetStreamingASR`, `OmnilingualASR`, `KokoroTTS`, `CosyVoiceTTS`, `PersonaPlex`, `SpeechVAD`, `SpeechEnhancement`, `Qwen3Chat`, `SpeechCore`, `SpeechUI`, `AudioCommon`.

## Models

Expand All @@ -107,6 +108,10 @@ Available SPM products: `Qwen3ASR`, `Qwen3TTS`, `Qwen3TTSCoreML`, `ParakeetASR`,
| Qwen3-ASR-1.7B | Speech → Text | No | 52 languages | [4-bit](https://huggingface.co/aufklarer/Qwen3-ASR-1.7B-MLX-4bit) 2.1 GB · [8-bit](https://huggingface.co/aufklarer/Qwen3-ASR-1.7B-MLX-8bit) 3.2 GB |
| Parakeet-TDT-0.6B | Speech → Text | No | 25 European languages | [CoreML INT8](https://huggingface.co/aufklarer/Parakeet-TDT-v3-CoreML-INT8) 500 MB |
| Parakeet-EOU-120M | Speech → Text | Yes (streaming + EOU) | 25 European languages | [CoreML INT8](https://huggingface.co/aufklarer/Parakeet-EOU-120M-CoreML-INT8) ~120 MB |
| Omnilingual-ASR-CTC-300M | Speech → Text | No (5/10 s window) | [1,672 languages](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py) | [CoreML INT8 (10s)](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8-10s) 312 MB · [CoreML INT8 (5s)](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8) 312 MB · [MLX 4-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit) 193 MB · [MLX 8-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-300M-MLX-8bit) 342 MB |
| Omnilingual-ASR-CTC-1B | Speech → Text | No | [1,672 languages](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py) | [MLX 4-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-1B-MLX-4bit) 549 MB · [MLX 8-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-1B-MLX-8bit) 1006 MB |
| Omnilingual-ASR-CTC-3B | Speech → Text | No | [1,672 languages](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py) | [MLX 4-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-3B-MLX-4bit) 1709 MB · [MLX 8-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-3B-MLX-8bit) 3159 MB |
| Omnilingual-ASR-CTC-7B | Speech → Text | No | [1,672 languages](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py) | [MLX 4-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-7B-MLX-4bit) 3.55 GB · [MLX 8-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-7B-MLX-8bit) 6.63 GB |
| Qwen3-ForcedAligner-0.6B | Audio + Text → Timestamps | No | Multi | [4-bit](https://huggingface.co/aufklarer/Qwen3-ForcedAligner-0.6B-4bit) 979 MB · [8-bit](https://huggingface.co/aufklarer/Qwen3-ForcedAligner-0.6B-8bit) 1.4 GB · [CoreML INT4](https://huggingface.co/aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT4) 630 MB · [CoreML INT8](https://huggingface.co/aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT8) 1.0 GB |
| Qwen3-TTS-0.6B Base | Text → Speech | Yes (~120ms) | 10 languages | [4-bit](https://huggingface.co/aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit) 1.7 GB · [8-bit](https://huggingface.co/aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-8bit) 2.4 GB · [CoreML](https://huggingface.co/aufklarer/Qwen3-TTS-CoreML) 1.0 GB |
| Qwen3-TTS-0.6B CustomVoice | Text → Speech | Yes (~120ms) | 10 languages | [4-bit](https://huggingface.co/aufklarer/Qwen3-TTS-12Hz-0.6B-CustomVoice-MLX-4bit) 1.7 GB |
Expand Down Expand Up @@ -134,6 +139,7 @@ Weight memory is the GPU (MLX) or ANE (CoreML) memory consumed by model paramete
| Qwen3-ASR-1.7B (8-bit, MLX) | 2,349 MB | ~4 GB |
| Parakeet-TDT-0.6B (CoreML) | 315 MB | ~400 MB |
| Parakeet-EOU-120M (CoreML) | ~120 MB | ~200 MB |
| Omnilingual-ASR-CTC-300M (CoreML INT8) | 312 MB | ~600 MB |
| Qwen3-ForcedAligner-0.6B (4-bit, MLX) | 933 MB | ~1.5 GB |
| Qwen3-TTS-1.7B (4-bit, MLX) | 2,300 MB | ~4–6 GB |
| Qwen3-TTS-0.6B (4-bit, MLX) | 977 MB | ~2 GB |
Expand Down Expand Up @@ -1274,6 +1280,7 @@ PERSONAPLEX_E2E=1 swift test --filter PersonaPlexE2ETests
|-------|-----------|
| Qwen3-ASR | 52 languages (CN, EN, Cantonese, DE, FR, ES, JA, KO, RU, + 22 Chinese dialects, ...) |
| Parakeet TDT | 25 European languages (BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, HR, HU, IT, LT, LV, MT, NL, PL, PT, RO, RU, SK, SL, SV, UK) |
| Omnilingual ASR | **1,672 languages** across 150+ families. Includes all of Meta's "served directly" set (~1,100) plus 500+ low-resource languages added via community data. Selected: EN, ES, FR, DE, ZH, JA, KO, AR, HI, BN, UR, FA, HE, TR, RU, UK, PL, NL, IT, PT, SW, YO, IG, HA, AM, ZU, XH, VI, TH, ID, MS, TL, MY, KM, LO, NE, SI, GU, MR, PA, TA, TE, KN, ML, OR, AS, KA, HY, AZ, KK, UZ, MN, BO, DZ, ... (full list: [`lang_ids.py`](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py)). The CTC head is language-agnostic — no language hint is required at inference time. |
| Qwen3-TTS | EN, CN, DE, JA, ES, FR, KO, RU, IT, PT (+ Beijing/Sichuan dialects via CustomVoice) |
| CosyVoice TTS | CN, EN, JA, KO, DE, ES, FR, IT, RU |
| Kokoro TTS | EN (US/UK), ES, FR, HI, IT, JA, PT, CN, KO, DE |
Expand Down
9 changes: 8 additions & 1 deletion README_de.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Spracherkennung, -synthese und -verständnis auf dem Gerät für Mac und iOS. L

- **Qwen3-ASR** — Sprache-zu-Text / Spracherkennung (automatische Spracherkennung, 52 Sprachen)
- **Parakeet TDT** — Sprache-zu-Text über CoreML (Neural Engine, NVIDIA FastConformer + TDT-Decoder, 25 Sprachen)
- **Omnilingual ASR** — Sprache-zu-Text über CoreML (Meta wav2vec2 + CTC, **1.672 Sprachen** aus 150+ Sprachfamilien, 5/10 s Fenster, INT8 palettiert, ANE)
- **Qwen3-ForcedAligner** — Wortgenaue Zeitstempel-Zuordnung (Audio + Text → Zeitstempel)
- **Qwen3-TTS** — Sprachsynthese (höchste Qualität, Streaming, benutzerdefinierte Sprecher, 10 Sprachen)
- **CosyVoice TTS** — Sprachsynthese mit Streaming, Stimmklonen, Mehrsprecherdialog und Emotions-Tags (9 Sprachen, DiT Flow Matching, CAM++ Sprecherencoder)
Expand Down Expand Up @@ -97,7 +98,7 @@ struct DictateView: View {

`SpeechUI` enthält nur `TranscriptionView` (Endgültige + Teilergebnisse) und `TranscriptionStore` (Streaming-ASR-Adapter). Für Audio-Visualisierung und -Wiedergabe verwenden Sie AVFoundation.

Verfügbare SPM-Produkte: `Qwen3ASR`, `Qwen3TTS`, `Qwen3TTSCoreML`, `ParakeetASR`, `ParakeetStreamingASR`, `KokoroTTS`, `CosyVoiceTTS`, `PersonaPlex`, `SpeechVAD`, `SpeechEnhancement`, `Qwen3Chat`, `SpeechCore`, `SpeechUI`, `AudioCommon`.
Verfügbare SPM-Produkte: `Qwen3ASR`, `Qwen3TTS`, `Qwen3TTSCoreML`, `ParakeetASR`, `ParakeetStreamingASR`, `OmnilingualASR`, `KokoroTTS`, `CosyVoiceTTS`, `PersonaPlex`, `SpeechVAD`, `SpeechEnhancement`, `Qwen3Chat`, `SpeechCore`, `SpeechUI`, `AudioCommon`.

## Modelle

Expand All @@ -107,6 +108,10 @@ Verfügbare SPM-Produkte: `Qwen3ASR`, `Qwen3TTS`, `Qwen3TTSCoreML`, `ParakeetASR
| Qwen3-ASR-1.7B | Sprache → Text | Nein | 52 Sprachen | [4-bit](https://huggingface.co/aufklarer/Qwen3-ASR-1.7B-MLX-4bit) 2.1 GB · [8-bit](https://huggingface.co/aufklarer/Qwen3-ASR-1.7B-MLX-8bit) 3.2 GB |
| Parakeet-TDT-0.6B | Sprache → Text | Nein | 25 europäische Sprachen | [CoreML INT8](https://huggingface.co/aufklarer/Parakeet-TDT-v3-CoreML-INT8) 500 MB |
| Parakeet-EOU-120M | Sprache → Text | Ja (Streaming + EOU) | 25 europäische Sprachen | [CoreML INT8](https://huggingface.co/aufklarer/Parakeet-EOU-120M-CoreML-INT8) ~120 MB |
| Omnilingual-ASR-CTC-300M | Sprache → Text | Nein (5/10 s Fenster) | [1.672 Sprachen](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py) | [CoreML INT8 (10s)](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8-10s) 312 MB · [CoreML INT8 (5s)](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8) 312 MB · [MLX 4-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit) 193 MB · [MLX 8-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-300M-MLX-8bit) 342 MB |
| Omnilingual-ASR-CTC-1B | Sprache → Text | Nein | [1.672 Sprachen](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py) | [MLX 4-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-1B-MLX-4bit) 549 MB · [MLX 8-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-1B-MLX-8bit) 1006 MB |
| Omnilingual-ASR-CTC-3B | Sprache → Text | Nein | [1.672 Sprachen](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py) | [MLX 4-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-3B-MLX-4bit) 1709 MB · [MLX 8-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-3B-MLX-8bit) 3159 MB |
| Omnilingual-ASR-CTC-7B | Sprache → Text | Nein | [1.672 Sprachen](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py) | [MLX 4-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-7B-MLX-4bit) 3.55 GB · [MLX 8-bit](https://huggingface.co/aufklarer/Omnilingual-ASR-CTC-7B-MLX-8bit) 6.63 GB |
| Qwen3-ForcedAligner-0.6B | Audio + Text → Zeitstempel | Nein | Mehrsprachig | [4-bit](https://huggingface.co/aufklarer/Qwen3-ForcedAligner-0.6B-4bit) 979 MB · [8-bit](https://huggingface.co/aufklarer/Qwen3-ForcedAligner-0.6B-8bit) 1.4 GB · [CoreML INT4](https://huggingface.co/aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT4) 630 MB · [CoreML INT8](https://huggingface.co/aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT8) 1.0 GB |
| Qwen3-TTS-0.6B Base | Text → Sprache | Ja (~120ms) | 10 Sprachen | [4-bit](https://huggingface.co/aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit) 1.7 GB · [8-bit](https://huggingface.co/aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-8bit) 2.4 GB · [CoreML](https://huggingface.co/aufklarer/Qwen3-TTS-CoreML) 1.0 GB |
| Qwen3-TTS-0.6B CustomVoice | Text → Sprache | Ja (~120ms) | 10 Sprachen | [4-bit](https://huggingface.co/aufklarer/Qwen3-TTS-12Hz-0.6B-CustomVoice-MLX-4bit) 1.7 GB |
Expand Down Expand Up @@ -134,6 +139,7 @@ Gewichtsspeicher ist der GPU- (MLX) oder ANE-Speicher (CoreML), der von Modellpa
| Qwen3-ASR-1.7B (8-bit, MLX) | 2,349 MB | ~4 GB |
| Parakeet-TDT-0.6B (CoreML) | 315 MB | ~400 MB |
| Parakeet-EOU-120M (CoreML) | ~120 MB | ~200 MB |
| Omnilingual-ASR-CTC-300M (CoreML INT8) | 312 MB | ~600 MB |
| Qwen3-ForcedAligner-0.6B (4-bit, MLX) | 933 MB | ~1.5 GB |
| Qwen3-TTS-1.7B (4-bit, MLX) | 2,300 MB | ~4–6 GB |
| Qwen3-TTS-0.6B (4-bit, MLX) | 977 MB | ~2 GB |
Expand Down Expand Up @@ -1272,6 +1278,7 @@ PERSONAPLEX_E2E=1 swift test --filter PersonaPlexE2ETests
|--------|----------|
| Qwen3-ASR | 52 Sprachen (CN, EN, Kantonesisch, DE, FR, ES, JA, KO, RU, + 22 chinesische Dialekte, ...) |
| Parakeet TDT | 25 europäische Sprachen (BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, HR, HU, IT, LT, LV, MT, NL, PL, PT, RO, RU, SK, SL, SV, UK) |
| Omnilingual ASR | **1.672 Sprachen** aus über 150 Sprachfamilien. Umfasst Metas gesamten "direkt bedienten" Satz (~1.100) sowie 500+ ressourcenarme Sprachen, die durch Community-Daten hinzugefügt wurden. Auswahl: EN, ES, FR, DE, ZH, JA, KO, AR, HI, BN, UR, FA, HE, TR, RU, UK, PL, NL, IT, PT, SW, YO, IG, HA, AM, ZU, XH, VI, TH, ID, MS, TL, MY, KM, LO, NE, SI, GU, MR, PA, TA, TE, KN, ML, OR, AS, KA, HY, AZ, KK, UZ, MN, BO, DZ, ... (vollständige Liste: [`lang_ids.py`](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py)). Der CTC-Kopf ist sprachagnostisch — zur Inferenzzeit ist kein Sprachhinweis erforderlich. |
| Qwen3-TTS | EN, CN, DE, JA, ES, FR, KO, RU, IT, PT (+ Peking-/Sichuan-Dialekte über CustomVoice) |
| CosyVoice TTS | CN, EN, JA, KO, DE, ES, FR, IT, RU |
| Kokoro TTS | EN (US/UK), ES, FR, HI, IT, JA, PT, CN, KO, DE |
Expand Down
Loading
Loading