Add OmnilingualASR — 1,672 languages on CoreML + MLX, with shared SentencePiece/SDPA extract by ivan-digital · Pull Request #201 · soniqo/speech-swift

ivan-digital · 2026-04-11T13:24:52Z

Summary

Single PR consolidating Meta's Omnilingual ASR port for Apple Silicon: CoreML 300M, MLX 300M/1B/3B/7B, plus two orthogonal refactors that extract duplicated building blocks into shared modules. Reference: arXiv 2511.09690, Apache 2.0.

Closes #195. Supersedes #199 and #200 (both will be closed with a pointer here).

What ships

1. OmnilingualASR CoreML (300M) — `OmnilingualASRModel`

Sources/OmnilingualASR/: Configuration, SentencePieceVocabulary, CTCGreedyDecoder, CoreML model loader with 5 s and 10 s fixed-window variants (aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8{,-10s})
Utterance-level layer_norm preprocessing matching fairseq2's apply_audio_normalization — normalises the raw waveform before zero-padding so sub-window inputs match the reference pipeline statistics
40 s hard cap matching Meta's MAX_ALLOWED_AUDIO_SEC; clear error pointing to SpeechVAD for longer audio

2. OmnilingualASR MLX (300M / 1B / 3B / 7B) — `OmnilingualASRMLXModel`

Sources/OmnilingualASR/MLX/: fairseq2-compatible wav2vec2 port from scratch
- Wav2Vec2Frontend — 7-layer CNN feature extractor (kernels [10,3,3,3,3,2,2], strides [5,2,2,2,2,2,2] → 50 Hz frames), per-channel LayerNorm + GELU, post_extract_layer_norm, model_dim_proj, weight-normed grouped Conv1d positional encoder (kernel 128, groups 16) with even-kernel trim and residual
- Wav2Vec2EncoderLayer — pre-norm self-attention + 2-linear FFN, quantised projections (QuantizedLinear group size 64, 4 or 8 bits)
- Wav2Vec2Encoder — stack of N layers + final encoder.layer_norm
- CTCHead — quantised final_proj → 10288
- OmnilingualMLXWeightLoader — loads model.safetensors, fuses PyTorch weight_norm(dim=2) for the position encoder (W = g · v / ‖v‖_per_kernel), applies all tensors via MLXCommon helpers
- OmnilingualMLXConfig — variant table (300M 24L/1024d/16h, 1B 48L/1280d/20h, 3B 60L/2048d/32h, 7B 128L/2048d/32h), auto-detect from HF model id

3. Shared extractions

AudioCommon/SentencePieceModel.swift — protobuf reader for .model files, exposing pieces: [Piece] with (text, score, type) and a PieceType enum. Eliminates two independent near-identical protobuf readers:

OmnilingualASR/SentencePieceVocabulary.swift now wraps it (175 → 62 lines)
PersonaPlex/SentencePieceDecoder.swift now wraps it (179 → 85 lines) and keeps the greedy unigram encoder

MLXCommon/SDPA.swift — multiHead / attendAndMerge / mergeHeads helpers. All reshape calls use -1 for the batch dimension so the helpers compose with MLX.compile(shapeless:) graphs that vary batch at runtime.

12 attention classes migrated to use SDPA.* (Qwen3ASR audio + float text + quantised text, Qwen3TTS Talker + CodePredictor + DecoderTransformer, Qwen3Chat GatedAttentionLayer, CosyVoice LLM, PersonaPlex Mimi + Depformer + Temporal, OmnilingualASR Wav2Vec2). Eliminates ~80 lines of duplicated reshape+SDPA+reshape boilerplate.

Caught and fixed during refactor

The first mergeHeads implementation used attn.dim(0) to size the batch dim, which baked it as a Swift constant and broke Qwen3-TTS compile(shapeless:) graphs that vary batch at runtime. Qwen3-TTS Talker testBatchWithDefaultInstruct fails with Cannot reshape array of size 4096 into shape (1,1,2048). Fix: use -1 for batch throughout. Verified end-to-end against Talker batch + CustomVoice + CosyVoice + Omnilingual MLX after fix.

CLI

# CoreML (default, ANE)
audio transcribe recording.wav --engine omnilingual               # 10 s window
audio transcribe recording.wav --engine omnilingual --window 5      # 5 s window

# MLX (GPU/Metal, any input length up to 40 s)
audio transcribe recording.wav --engine omnilingual --backend mlx                              # 300M @ 4-bit
audio transcribe recording.wav --engine omnilingual --backend mlx --variant 1B                  # 1B @ 4-bit
audio transcribe recording.wav --engine omnilingual --backend mlx --variant 3B --bits 8         # 3B @ 8-bit
audio transcribe recording.wav --engine omnilingual --backend mlx --variant 7B                  # 7B @ 4-bit

Docs

docs/models/omnilingual-asr.md — architecture, all 10 published variants (CoreML 5/10 s + MLX 300M/1B/3B/7B × 4/8 bit), and the full 1,672-language list with English names + country hints (resolved via CLDR / pycountry, grouped by 32 scripts, in a collapsible <details> block)
docs/inference/omnilingual-asr-inference.md — pipeline, layer-norm rationale, 40 s cap rationale, CoreML + MLX quick-starts, CLI examples, multilingual notes, perf numbers
docs/shared-protocols.md — added OmnilingualASR (both backends) to the SpeechRecognitionModel conformers + not-thread-safe list; added the MLXCommon module section (SDPA, QuantizedMLP, PreQuantizedEmbedding); added OmnilingualASR module tree; corrected stale entries about QuantizedMLP/PreQuantizedEmbedding location (MLXCommon, not AudioCommon)
CLAUDE.md / AGENTS.md — added OmnilingualASR + ParakeetStreamingASR to project structure; noted SDPA + SentencePieceModel in their respective common modules
README.md + all 9 translations (zh/ja/ko/es/de/fr/hi/pt/ru) — feature bullet, SPM products, 4 model rows (300M/1B/3B/7B in 4 and 8 bit, CoreML 5 s and 10 s), memory table, and a Supported Languages row linking the canonical lang_ids.py

Verified output

CoreML 300M (reference-correct after layer-norm fix):

Lang	Transcript
EN	`can you guarantee that the replacement part will be shipped tomorrow`
AR	`كما أثنى الزملاء المصارعون على لونا` (reference match)
HI	`लूना को साथी पहलवानों ने भी सरधांजलीदी`
FR	`pense à létineraire desqui comme un étineraire de rent donner similaire`

MLX 300M-4bit (matches CoreML modulo one-char differences):

Lang	Transcript
EN	`can you guarantee that the replacement part will be shiped tomorrow`
AR	`كما أثنى الزملاء المصارع على لونا`
HI	`लूना को साथी पहलवानों ने भी सरधांजली दी`
FR	`pensez à létineraire deski comme unétineraire de rendonner similaire`

MLX 1B-4bit sanity check: can you guarantee that the replacement par o be shipe tomorrow (slightly more quant degradation, expected).

Tests

New unit tests (~26 added):

OmnilingualASRTests/OmnilingualASRTests — 17 unit (config, layer-norm, CTC decoder, vocabulary)
OmnilingualASRTests/OmnilingualMLXTests — 9 unit (variant table, frontend output length math)
AudioCommonTests/SentencePieceModelTests — 7 unit (protobuf decode, type enum, helpers, subscript bounds, empty/unknown fields)
AudioCLITests/TranscribeCommandTests — 11 new CLI parser tests for --engine omnilingual, --window, --backend, --variant, --bits

New E2E tests (E2E… prefix, skipped in CI):

E2EOmnilingualASRTests — 9 (English real audio, warmup, unload, chunking, 40 s cap, FLEURS en/hi/fr/ar)
E2EOmnilingualMLXTests — 7 (load, warmup, English, 40 s cap, FLEURS en/hi/fr/ar) — covers Wav2Vec2 frontend + encoder + CTC head + weight loader on real 300M-4bit weights
E2EOmnilingualMLX1BTests — 1 (loads real 549 MB 1B-4bit weights, exercises the larger encoder code path)

Test resources: 4 small FLEURS fixtures (fleurs_{en,hi,fr,ar}.wav ~230-300 KB each).

Test plan

Out of scope (follow-up tickets)

3B and 7B MLX variants spot-checked via the same code path as 300M and 1B, but no dedicated E2E tests (weights are 1.7 GB and 3.55 GB — manual verification only until CI has the disk for it)
Streaming / chunked dictation flow on top of the 5 s CoreML window (reference pipeline does not support streaming)
The pre-existing testPyTorchDecoderProducesHello failure — tracked separately; unrelated to this PR

Phase 1: CoreML 300M end-to-end - Sources/OmnilingualASR/: config, SentencePiece vocab, CTC greedy decoder, CoreML model loader, fixed-window inference (5s/10s), waveform layer_norm matching fairseq2 apply_audio_normalization, 40s utterance hard cap matching reference MAX_ALLOWED_AUDIO_SEC - Layer-norm fix: normalize raw audio chunk before zero padding so sub-window inputs match reference statistics (previous version normalized post-pad which skewed mean/variance with silence) - 17 unit tests (config, layer norm, CTC decoder, vocab) + 9 E2E tests (multilingual FLEURS clips for en/hi/fr/ar, warmup, chunking, 40s cap) - Hindi and Arabic E2E pass with correct script output CLI - audio transcribe --engine omnilingual [--window 5|10] - 5 new parser tests in TranscribeCommandTests Docs - docs/models/omnilingual-asr.md: architecture, variants, **full 1,672 language list** with names + country hints (resolved via CLDR / pycountry) - docs/inference/omnilingual-asr-inference.md: pipeline, chunking, VAD note - README.md + 9 translations: feature bullet, SPM products, models table rows for 300M / 1B / 3B / 7B variants, memory table, Supported Languages row Package - New OmnilingualASR library + target with MLX deps wired for follow-up - Test target with 4 FLEURS resources Reference: facebookresearch/omnilingual-asr (Apache 2.0), arXiv:2511.09690 Phase 2 (MLX 1B/3B/7B backend, larger variants) tracked as follow-up — only the 300M CoreML INT8 path is wired in this PR.

…ariants Native MLX/Metal implementation of Meta's Omnilingual CTC model. Loads any of the 8 published MLX repos (300M / 1B / 3B / 7B in 4-bit and 8-bit) directly from HuggingFace and runs inference on Apple GPUs. Sources/OmnilingualASR/MLX/ - OmnilingualMLXConfig: variant table (300M=24L/1024d/16h, 1B=48L/1280d/20h, 3B=60L/2048d/32h, 7B=128L/2048d/32h), groupSize=64, bits=4|8 - Wav2Vec2Frontend: 7-layer feature extractor (kernels [10,3,3,3,3,2,2], strides [5,2,2,2,2,2,2], 320× downsample → 50 Hz frames), per-channel LayerNorm + GELU, post_extract_layer_norm, model_dim_proj, weight-normed grouped Conv1d positional encoder (kernel=128, groups=16) with even-kernel trim and residual - Wav2Vec2EncoderLayer: pre-norm self-attention + 2-linear FFN, all 4 attn projections + both FFN linears are QuantizedLinear with separate per-group scales/biases plus a regular linear bias (matches fairseq2 weight names exactly: self_attn.{q,k,v,output}_proj, ffn.{inner,output}_proj) - Wav2Vec2Encoder: stack of N pre-norm layers + final encoder.layer_norm - CTCHead: QuantizedLinear final_proj → 10288 logits - OmnilingualMLXWeightLoader: loads model.safetensors, fuses PyTorch weight_norm(dim=2) for the position encoder (W = g * v / ||v||_per_kernel), applies all quantized + dense + norm tensors via MLXCommon helpers - OmnilingualASRMLXModel: top-level loader, fromPretrained with HF download, variant + bits auto-detection from model id, transcribeAudio with reference utterance-level layer_norm preprocessing and 40 s hard cap CLI - audio transcribe --engine omnilingual --backend mlx [--variant 300M|1B|3B|7B] [--bits 4|8] - 6 new parser tests covering backend/variant/bits validation Tests - 9 unit tests (6 config variants + 3 frontend shape math): all pass - 7 E2E tests on 300M-4bit: load, warmup, English transcript, 40s cap rejection, Arabic + Hindi + French FLEURS clips. All pass; transcripts essentially identical to the CoreML 300M model. EN: "can you guarantee that the replacement part will be shiped tomorrow"; AR: "كما أثنى الزملاء المصارع على لونا"; HI Devanagari output verified. - 1 E2E test for 1B-4bit: loads 549 MB, runs inference, produces recognizable English transcript — confirms variant detection + larger encoder code path - Full unit regression sweep: 763 tests pass (+20 vs baseline), 0 failures Docs - docs/models/omnilingual-asr.md: marked all 10 published variants as wired through this module (CoreML 5/10s + MLX 300M/1B/3B/7B in 4/8 bit) - docs/inference/omnilingual-asr-inference.md: added MLX quick-start, variant selection, CLI examples Reference: facebookresearch/omnilingual-asr (Apache 2.0), arXiv:2511.09690. Ported on top of #199 (CoreML backend).

Two orthogonal refactors that consolidate code duplicated across the model modules into `AudioCommon` and `MLXCommon` respectively. SentencePieceModel (AudioCommon) - New `SentencePieceModel.swift` — protobuf-format `.model` reader exposing a `pieces: [Piece]` array with `(text, score, type)` per entry plus a `PieceType` enum (NORMAL/UNKNOWN/CONTROL/...) and an `isControlOrUnknown` helper. Replaces two independent near-identical protobuf readers. - `OmnilingualASR/SentencePieceVocabulary.swift` now wraps it (175 → 62 lines); `PersonaPlex/SentencePieceDecoder.swift` now wraps it (179 → 85 lines) and keeps the greedy unigram encoder used for system-prompt tokenisation. - 7 unit tests in `Tests/AudioCommonTests/SentencePieceModelTests` cover field decoding, all piece types, the control-or-unknown helper, subscript bounds, empty model errors, `Data` init, and skipping unknown top-level fields. Existing 5 `OmnilingualVocabularyTests` in `Tests/OmnilingualASRTests` pass unchanged via the preserved public API. SDPA (MLXCommon) - New `SDPA.swift` helpers: * `multiHead(q:k:v:numHeads:headDim:scale:mask:)` — flat `[B, T, H*D]` inputs, runs reshape → SDPA → reshape. Supports GQA via a second overload with separate query and KV head counts. * `attendAndMerge(qHeads:kHeads:vHeads:scale:mask:)` — already-shaped `[B, H, T, D]` inputs (for RoPE / KV-cache paths). Two overloads for `MLXArray?` and `MLXFast.ScaledDotProductAttentionMaskMode`. * `mergeHeads(_:)` — low-level `[B, H, T, D] → [B, T, H*D]` merge. - All reshape calls use `-1` for the batch dimension so the helpers compose with `MLX.compile(shapeless:)` graphs that vary batch at runtime. An earlier iteration that baked `attn.dim(0)` as a Swift constant broke Qwen3-TTS Talker batch inference (caught by `testBatchWithDefaultInstruct`, fix verified E2E). - Migrated 12 attention classes to use the helpers, eliminating duplicated reshape+SDPA+reshape boilerplate without forcing a unified `Attention` class (the variation in RoPE/KV-cache/GQA/quantisation made the per-site projections easier to keep local): * Qwen3ASR — AudioSelfAttention, FloatTextAttention, QuantizedTextAttention * Qwen3TTS — TalkerAttention, CodePredictorAttention, DecoderTransformerAttention * Qwen3Chat — GatedAttentionLayer (DeltaNet hybrid) * CosyVoiceTTS — CosyVoiceAttention * PersonaPlex — MimiAttention, DepformerAttention, TemporalAttention * OmnilingualASR — Wav2Vec2SelfAttention Docs - `docs/shared-protocols.md`: added OmnilingualASR (both backends) to the SpeechRecognitionModel conformer list and to the not-thread-safe list, added the MLXCommon module section with SDPA + QuantizedMLP + PreQuantizedEmbedding, added the OmnilingualASR module tree with both backends, updated the dependency diagram to include OmnilingualASR + ParakeetASR + ParakeetStreamingASR and the new MLXCommon node. Also corrected stale entries claiming QuantizedMLP / PreQuantizedEmbedding lived in AudioCommon (they're in MLXCommon). - `CLAUDE.md` / `AGENTS.md`: added `OmnilingualASR` and `ParakeetStreamingASR` to the project structure, noted the new `SDPA` helper in MLXCommon and `SentencePieceModel` in AudioCommon. Net diff: 14 files changed, +186/-350 lines, plus 3 new source files (SentencePieceModel.swift 177 lines, SDPA.swift 94 lines, SentencePieceModelTests.swift 182 lines). Verification - `swift build` — green. - `swift test --skip E2E` — 770/770 unit tests pass (+17 vs baseline: 7 new `SentencePieceModelTests` + existing 12 Omnilingual-related already counted against this branch's first commit). - Full `swift test` sweep: 949 executed, 917 passed, 31 conditionally skipped (XCTSkip for missing models / macOS-only), 1 pre-existing failure (`E2EQwen35CoreMLConversionTests.testPyTorchDecoderProducesHello` hard-codes `scripts/convert_qwen35_chat_coreml.py` which was never committed to the repo — unrelated to this refactor). - Spot E2E after refactor: Qwen3-TTS Talker batch E2E, Qwen3-TTS CustomVoice + CosyVoice TTS (8/8), Omnilingual MLX 300M and 1B, full multilingual FLEURS transcripts (EN/AR/HI/FR) — all pass.

Closes the one coverage gap in the SDPA refactor: the 4-bit and 8-bit forced-aligner variants both go through `QuantizedTextAttention`, leaving `FloatTextAttention` (used by the bf16 checkpoint `aufklarer/Qwen3-ForcedAligner-0.6B-bf16`) only compile-checked after the `SDPA.attendAndMerge` migration. The new `testForcedAlignerE2EBf16Variant` loads the real bf16 model, asserts the decoder is `FloatTextModel` (proving we're on the non-quantised path), and runs end-to-end alignment on the same `test_audio.wav` fixture used by the existing 4-bit E2E test. Passes in 3.3 s after download. With this test, all 12 migrated attention classes across Qwen3ASR, Qwen3TTS, Qwen3Chat, CosyVoice, PersonaPlex, and OmnilingualASR are runtime-verified on real weights end-to-end.

The test `testPyTorchDecoderProducesHello` imports and runs `scripts/convert_qwen35_chat_coreml.py` via Python, but that script was moved out of speech-swift in 274420d ("Remove conversion scripts — moved to soniqo/speech-model") and a2ccbf7 ("Remove conversion and benchmark scripts"). The test itself was never deleted in that cleanup, so it has been failing silently against a missing file in every full E2E run since. The verification properly belongs in the soniqo/speech-model repo alongside the Python script — testing conversion tooling from a repo that no longer ships the tools is a dangling dependency. Deleting rather than skipping because: - The XCTSkip path still requires the script to run, and its absence from this repo is permanent by design, not a transient missing-file - Keeping a no-op skip in CI is just dead code that future readers will wonder about - The actual assertion (PyTorch decoder top token = 9419 for a specific input) is verification *of* the conversion script, not *of* the Swift Qwen3Chat module — the Swift module has its own E2E coverage via `E2EQwen35MLXChatTests` and `E2EQwen35CoreMLChatTests` With this removed, the full `swift test` sweep runs clean: 948 passed, 31 conditionally skipped, 0 failures.

ivan-digital added 3 commits April 11, 2026 12:49

This was referenced Apr 11, 2026

Add OmnilingualASR module — Meta CTC-300M on CoreML, 1,672 languages #199

Closed

Add MLX backend for OmnilingualASR — wav2vec2 in MLX, 300M/1B/3B/7B #200

Closed

ivan-digital added 2 commits April 11, 2026 15:38

ivan-digital merged commit 03920e1 into main Apr 11, 2026
1 check passed

ivan-digital deleted the feat/omnilingual-asr branch April 11, 2026 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OmnilingualASR — 1,672 languages on CoreML + MLX, with shared SentencePiece/SDPA extract#201

Add OmnilingualASR — 1,672 languages on CoreML + MLX, with shared SentencePiece/SDPA extract#201
ivan-digital merged 5 commits intomainfrom
feat/omnilingual-asr

ivan-digital commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ivan-digital commented Apr 11, 2026

Summary

What ships

1. OmnilingualASR CoreML (300M) — OmnilingualASRModel

2. OmnilingualASR MLX (300M / 1B / 3B / 7B) — OmnilingualASRMLXModel

3. Shared extractions

Caught and fixed during refactor

CLI

Docs

Verified output

Tests

Test plan

Out of scope (follow-up tickets)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. OmnilingualASR CoreML (300M) — `OmnilingualASRModel`

2. OmnilingualASR MLX (300M / 1B / 3B / 7B) — `OmnilingualASRMLXModel`