Add OmnilingualASR — 1,672 languages on CoreML + MLX, with shared SentencePiece/SDPA extract#201
Merged
ivan-digital merged 5 commits intomainfrom Apr 11, 2026
Merged
Conversation
Phase 1: CoreML 300M end-to-end - Sources/OmnilingualASR/: config, SentencePiece vocab, CTC greedy decoder, CoreML model loader, fixed-window inference (5s/10s), waveform layer_norm matching fairseq2 apply_audio_normalization, 40s utterance hard cap matching reference MAX_ALLOWED_AUDIO_SEC - Layer-norm fix: normalize raw audio chunk before zero padding so sub-window inputs match reference statistics (previous version normalized post-pad which skewed mean/variance with silence) - 17 unit tests (config, layer norm, CTC decoder, vocab) + 9 E2E tests (multilingual FLEURS clips for en/hi/fr/ar, warmup, chunking, 40s cap) - Hindi and Arabic E2E pass with correct script output CLI - audio transcribe --engine omnilingual [--window 5|10] - 5 new parser tests in TranscribeCommandTests Docs - docs/models/omnilingual-asr.md: architecture, variants, **full 1,672 language list** with names + country hints (resolved via CLDR / pycountry) - docs/inference/omnilingual-asr-inference.md: pipeline, chunking, VAD note - README.md + 9 translations: feature bullet, SPM products, models table rows for 300M / 1B / 3B / 7B variants, memory table, Supported Languages row Package - New OmnilingualASR library + target with MLX deps wired for follow-up - Test target with 4 FLEURS resources Reference: facebookresearch/omnilingual-asr (Apache 2.0), arXiv:2511.09690 Phase 2 (MLX 1B/3B/7B backend, larger variants) tracked as follow-up — only the 300M CoreML INT8 path is wired in this PR.
…ariants
Native MLX/Metal implementation of Meta's Omnilingual CTC model. Loads any of
the 8 published MLX repos (300M / 1B / 3B / 7B in 4-bit and 8-bit) directly
from HuggingFace and runs inference on Apple GPUs.
Sources/OmnilingualASR/MLX/
- OmnilingualMLXConfig: variant table (300M=24L/1024d/16h, 1B=48L/1280d/20h,
3B=60L/2048d/32h, 7B=128L/2048d/32h), groupSize=64, bits=4|8
- Wav2Vec2Frontend: 7-layer feature extractor (kernels [10,3,3,3,3,2,2],
strides [5,2,2,2,2,2,2], 320× downsample → 50 Hz frames), per-channel
LayerNorm + GELU, post_extract_layer_norm, model_dim_proj, weight-normed
grouped Conv1d positional encoder (kernel=128, groups=16) with even-kernel
trim and residual
- Wav2Vec2EncoderLayer: pre-norm self-attention + 2-linear FFN, all 4 attn
projections + both FFN linears are QuantizedLinear with separate per-group
scales/biases plus a regular linear bias (matches fairseq2 weight names
exactly: self_attn.{q,k,v,output}_proj, ffn.{inner,output}_proj)
- Wav2Vec2Encoder: stack of N pre-norm layers + final encoder.layer_norm
- CTCHead: QuantizedLinear final_proj → 10288 logits
- OmnilingualMLXWeightLoader: loads model.safetensors, fuses PyTorch
weight_norm(dim=2) for the position encoder (W = g * v / ||v||_per_kernel),
applies all quantized + dense + norm tensors via MLXCommon helpers
- OmnilingualASRMLXModel: top-level loader, fromPretrained with HF download,
variant + bits auto-detection from model id, transcribeAudio with reference
utterance-level layer_norm preprocessing and 40 s hard cap
CLI
- audio transcribe --engine omnilingual --backend mlx [--variant 300M|1B|3B|7B] [--bits 4|8]
- 6 new parser tests covering backend/variant/bits validation
Tests
- 9 unit tests (6 config variants + 3 frontend shape math): all pass
- 7 E2E tests on 300M-4bit: load, warmup, English transcript, 40s cap rejection,
Arabic + Hindi + French FLEURS clips. All pass; transcripts essentially
identical to the CoreML 300M model. EN: "can you guarantee that the
replacement part will be shiped tomorrow"; AR: "كما أثنى الزملاء المصارع على
لونا"; HI Devanagari output verified.
- 1 E2E test for 1B-4bit: loads 549 MB, runs inference, produces recognizable
English transcript — confirms variant detection + larger encoder code path
- Full unit regression sweep: 763 tests pass (+20 vs baseline), 0 failures
Docs
- docs/models/omnilingual-asr.md: marked all 10 published variants as wired
through this module (CoreML 5/10s + MLX 300M/1B/3B/7B in 4/8 bit)
- docs/inference/omnilingual-asr-inference.md: added MLX quick-start, variant
selection, CLI examples
Reference: facebookresearch/omnilingual-asr (Apache 2.0), arXiv:2511.09690.
Ported on top of #199 (CoreML backend).
Two orthogonal refactors that consolidate code duplicated across the model
modules into `AudioCommon` and `MLXCommon` respectively.
SentencePieceModel (AudioCommon)
- New `SentencePieceModel.swift` — protobuf-format `.model` reader exposing
a `pieces: [Piece]` array with `(text, score, type)` per entry plus a
`PieceType` enum (NORMAL/UNKNOWN/CONTROL/...) and an `isControlOrUnknown`
helper. Replaces two independent near-identical protobuf readers.
- `OmnilingualASR/SentencePieceVocabulary.swift` now wraps it (175 → 62 lines);
`PersonaPlex/SentencePieceDecoder.swift` now wraps it (179 → 85 lines) and
keeps the greedy unigram encoder used for system-prompt tokenisation.
- 7 unit tests in `Tests/AudioCommonTests/SentencePieceModelTests` cover
field decoding, all piece types, the control-or-unknown helper, subscript
bounds, empty model errors, `Data` init, and skipping unknown top-level
fields. Existing 5 `OmnilingualVocabularyTests` in `Tests/OmnilingualASRTests`
pass unchanged via the preserved public API.
SDPA (MLXCommon)
- New `SDPA.swift` helpers:
* `multiHead(q:k:v:numHeads:headDim:scale:mask:)` — flat `[B, T, H*D]`
inputs, runs reshape → SDPA → reshape. Supports GQA via a second
overload with separate query and KV head counts.
* `attendAndMerge(qHeads:kHeads:vHeads:scale:mask:)` — already-shaped
`[B, H, T, D]` inputs (for RoPE / KV-cache paths). Two overloads for
`MLXArray?` and `MLXFast.ScaledDotProductAttentionMaskMode`.
* `mergeHeads(_:)` — low-level `[B, H, T, D] → [B, T, H*D]` merge.
- All reshape calls use `-1` for the batch dimension so the helpers compose
with `MLX.compile(shapeless:)` graphs that vary batch at runtime. An
earlier iteration that baked `attn.dim(0)` as a Swift constant broke
Qwen3-TTS Talker batch inference (caught by `testBatchWithDefaultInstruct`,
fix verified E2E).
- Migrated 12 attention classes to use the helpers, eliminating duplicated
reshape+SDPA+reshape boilerplate without forcing a unified `Attention`
class (the variation in RoPE/KV-cache/GQA/quantisation made the per-site
projections easier to keep local):
* Qwen3ASR — AudioSelfAttention, FloatTextAttention, QuantizedTextAttention
* Qwen3TTS — TalkerAttention, CodePredictorAttention, DecoderTransformerAttention
* Qwen3Chat — GatedAttentionLayer (DeltaNet hybrid)
* CosyVoiceTTS — CosyVoiceAttention
* PersonaPlex — MimiAttention, DepformerAttention, TemporalAttention
* OmnilingualASR — Wav2Vec2SelfAttention
Docs
- `docs/shared-protocols.md`: added OmnilingualASR (both backends) to the
SpeechRecognitionModel conformer list and to the not-thread-safe list,
added the MLXCommon module section with SDPA + QuantizedMLP +
PreQuantizedEmbedding, added the OmnilingualASR module tree with both
backends, updated the dependency diagram to include OmnilingualASR +
ParakeetASR + ParakeetStreamingASR and the new MLXCommon node. Also
corrected stale entries claiming QuantizedMLP / PreQuantizedEmbedding
lived in AudioCommon (they're in MLXCommon).
- `CLAUDE.md` / `AGENTS.md`: added `OmnilingualASR` and `ParakeetStreamingASR`
to the project structure, noted the new `SDPA` helper in MLXCommon and
`SentencePieceModel` in AudioCommon.
Net diff: 14 files changed, +186/-350 lines, plus 3 new source files
(SentencePieceModel.swift 177 lines, SDPA.swift 94 lines,
SentencePieceModelTests.swift 182 lines).
Verification
- `swift build` — green.
- `swift test --skip E2E` — 770/770 unit tests pass (+17 vs baseline:
7 new `SentencePieceModelTests` + existing 12 Omnilingual-related
already counted against this branch's first commit).
- Full `swift test` sweep: 949 executed, 917 passed, 31 conditionally
skipped (XCTSkip for missing models / macOS-only), 1 pre-existing
failure (`E2EQwen35CoreMLConversionTests.testPyTorchDecoderProducesHello`
hard-codes `scripts/convert_qwen35_chat_coreml.py` which was never
committed to the repo — unrelated to this refactor).
- Spot E2E after refactor: Qwen3-TTS Talker batch E2E, Qwen3-TTS
CustomVoice + CosyVoice TTS (8/8), Omnilingual MLX 300M and 1B, full
multilingual FLEURS transcripts (EN/AR/HI/FR) — all pass.
This was referenced Apr 11, 2026
Closes the one coverage gap in the SDPA refactor: the 4-bit and 8-bit forced-aligner variants both go through `QuantizedTextAttention`, leaving `FloatTextAttention` (used by the bf16 checkpoint `aufklarer/Qwen3-ForcedAligner-0.6B-bf16`) only compile-checked after the `SDPA.attendAndMerge` migration. The new `testForcedAlignerE2EBf16Variant` loads the real bf16 model, asserts the decoder is `FloatTextModel` (proving we're on the non-quantised path), and runs end-to-end alignment on the same `test_audio.wav` fixture used by the existing 4-bit E2E test. Passes in 3.3 s after download. With this test, all 12 migrated attention classes across Qwen3ASR, Qwen3TTS, Qwen3Chat, CosyVoice, PersonaPlex, and OmnilingualASR are runtime-verified on real weights end-to-end.
The test `testPyTorchDecoderProducesHello` imports and runs `scripts/convert_qwen35_chat_coreml.py` via Python, but that script was moved out of speech-swift in 274420d ("Remove conversion scripts — moved to soniqo/speech-model") and a2ccbf7 ("Remove conversion and benchmark scripts"). The test itself was never deleted in that cleanup, so it has been failing silently against a missing file in every full E2E run since. The verification properly belongs in the soniqo/speech-model repo alongside the Python script — testing conversion tooling from a repo that no longer ships the tools is a dangling dependency. Deleting rather than skipping because: - The XCTSkip path still requires the script to run, and its absence from this repo is permanent by design, not a transient missing-file - Keeping a no-op skip in CI is just dead code that future readers will wonder about - The actual assertion (PyTorch decoder top token = 9419 for a specific input) is verification *of* the conversion script, not *of* the Swift Qwen3Chat module — the Swift module has its own E2E coverage via `E2EQwen35MLXChatTests` and `E2EQwen35CoreMLChatTests` With this removed, the full `swift test` sweep runs clean: 948 passed, 31 conditionally skipped, 0 failures.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Single PR consolidating Meta's Omnilingual ASR port for Apple Silicon: CoreML 300M, MLX 300M/1B/3B/7B, plus two orthogonal refactors that extract duplicated building blocks into shared modules. Reference: arXiv 2511.09690, Apache 2.0.
Closes #195. Supersedes #199 and #200 (both will be closed with a pointer here).
What ships
1. OmnilingualASR CoreML (300M) —
OmnilingualASRModelSources/OmnilingualASR/: Configuration,SentencePieceVocabulary,CTCGreedyDecoder, CoreML model loader with 5 s and 10 s fixed-window variants (aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8{,-10s})layer_normpreprocessing matching fairseq2'sapply_audio_normalization— normalises the raw waveform before zero-padding so sub-window inputs match the reference pipeline statisticsMAX_ALLOWED_AUDIO_SEC; clear error pointing toSpeechVADfor longer audio2. OmnilingualASR MLX (300M / 1B / 3B / 7B) —
OmnilingualASRMLXModelSources/OmnilingualASR/MLX/: fairseq2-compatible wav2vec2 port from scratchWav2Vec2Frontend— 7-layer CNN feature extractor (kernels[10,3,3,3,3,2,2], strides[5,2,2,2,2,2,2]→ 50 Hz frames), per-channelLayerNorm + GELU,post_extract_layer_norm,model_dim_proj, weight-normed groupedConv1dpositional encoder (kernel 128, groups 16) with even-kernel trim and residualWav2Vec2EncoderLayer— pre-norm self-attention + 2-linear FFN, quantised projections (QuantizedLineargroup size 64, 4 or 8 bits)Wav2Vec2Encoder— stack of N layers + finalencoder.layer_normCTCHead— quantisedfinal_proj → 10288OmnilingualMLXWeightLoader— loadsmodel.safetensors, fuses PyTorchweight_norm(dim=2)for the position encoder (W = g · v / ‖v‖_per_kernel), applies all tensors viaMLXCommonhelpersOmnilingualMLXConfig— variant table (300M 24L/1024d/16h, 1B 48L/1280d/20h, 3B 60L/2048d/32h, 7B 128L/2048d/32h), auto-detect from HF model id3. Shared extractions
AudioCommon/SentencePieceModel.swift— protobuf reader for.modelfiles, exposingpieces: [Piece]with(text, score, type)and aPieceTypeenum. Eliminates two independent near-identical protobuf readers:OmnilingualASR/SentencePieceVocabulary.swiftnow wraps it (175 → 62 lines)PersonaPlex/SentencePieceDecoder.swiftnow wraps it (179 → 85 lines) and keeps the greedy unigram encoderMLXCommon/SDPA.swift—multiHead/attendAndMerge/mergeHeadshelpers. All reshape calls use-1for the batch dimension so the helpers compose withMLX.compile(shapeless:)graphs that vary batch at runtime.12 attention classes migrated to use
SDPA.*(Qwen3ASR audio + float text + quantised text, Qwen3TTS Talker + CodePredictor + DecoderTransformer, Qwen3Chat GatedAttentionLayer, CosyVoice LLM, PersonaPlex Mimi + Depformer + Temporal, OmnilingualASR Wav2Vec2). Eliminates ~80 lines of duplicated reshape+SDPA+reshape boilerplate.Caught and fixed during refactor
The first
mergeHeadsimplementation usedattn.dim(0)to size the batch dim, which baked it as a Swift constant and brokeQwen3-TTScompile(shapeless:)graphs that vary batch at runtime. Qwen3-TTS TalkertestBatchWithDefaultInstructfails withCannot reshape array of size 4096 into shape (1,1,2048). Fix: use-1for batch throughout. Verified end-to-end against Talker batch + CustomVoice + CosyVoice + Omnilingual MLX after fix.CLI
Docs
docs/models/omnilingual-asr.md— architecture, all 10 published variants (CoreML 5/10 s + MLX 300M/1B/3B/7B × 4/8 bit), and the full 1,672-language list with English names + country hints (resolved via CLDR / pycountry, grouped by 32 scripts, in a collapsible<details>block)docs/inference/omnilingual-asr-inference.md— pipeline, layer-norm rationale, 40 s cap rationale, CoreML + MLX quick-starts, CLI examples, multilingual notes, perf numbersdocs/shared-protocols.md— added OmnilingualASR (both backends) to the SpeechRecognitionModel conformers + not-thread-safe list; added theMLXCommonmodule section (SDPA, QuantizedMLP, PreQuantizedEmbedding); added OmnilingualASR module tree; corrected stale entries aboutQuantizedMLP/PreQuantizedEmbeddinglocation (MLXCommon, not AudioCommon)CLAUDE.md/AGENTS.md— added OmnilingualASR + ParakeetStreamingASR to project structure; noted SDPA + SentencePieceModel in their respective common modulesREADME.md+ all 9 translations (zh/ja/ko/es/de/fr/hi/pt/ru) — feature bullet, SPM products, 4 model rows (300M/1B/3B/7B in 4 and 8 bit, CoreML 5 s and 10 s), memory table, and a Supported Languages row linking the canonicallang_ids.pyVerified output
CoreML 300M (reference-correct after layer-norm fix):
can you guarantee that the replacement part will be shipped tomorrowكما أثنى الزملاء المصارعون على لونا(reference match)लूना को साथी पहलवानों ने भी सरधांजलीदीpense à létineraire desqui comme un étineraire de rent donner similaireMLX 300M-4bit (matches CoreML modulo one-char differences):
can you guarantee that the replacement part will be shiped tomorrowكما أثنى الزملاء المصارع على لوناलूना को साथी पहलवानों ने भी सरधांजली दीpensez à létineraire deski comme unétineraire de rendonner similaireMLX 1B-4bit sanity check:
can you guarantee that the replacement par o be shipe tomorrow(slightly more quant degradation, expected).Tests
New unit tests (~26 added):
OmnilingualASRTests/OmnilingualASRTests— 17 unit (config, layer-norm, CTC decoder, vocabulary)OmnilingualASRTests/OmnilingualMLXTests— 9 unit (variant table, frontend output length math)AudioCommonTests/SentencePieceModelTests— 7 unit (protobuf decode, type enum, helpers, subscript bounds, empty/unknown fields)AudioCLITests/TranscribeCommandTests— 11 new CLI parser tests for--engine omnilingual,--window,--backend,--variant,--bitsNew E2E tests (
E2E…prefix, skipped in CI):E2EOmnilingualASRTests— 9 (English real audio, warmup, unload, chunking, 40 s cap, FLEURS en/hi/fr/ar)E2EOmnilingualMLXTests— 7 (load, warmup, English, 40 s cap, FLEURS en/hi/fr/ar) — covers Wav2Vec2 frontend + encoder + CTC head + weight loader on real 300M-4bit weightsE2EOmnilingualMLX1BTests— 1 (loads real 549 MB 1B-4bit weights, exercises the larger encoder code path)Test resources: 4 small FLEURS fixtures (
fleurs_{en,hi,fr,ar}.wav~230-300 KB each).Test plan
swift build— green (all targets, all dependencies)swift test --skip E2E— 770/770 unit tests pass (+17 vs baseline for the new SentencePiece + Wav2Vec2 shape tests)swift test --filter OmnilingualASRTests— 26/26 pass (17 unit + 9 E2E on real 300M CoreML weights, ~5 s wall clock)swift test --filter E2EOmnilingualMLXTests— 7/7 pass (on real 300M-4bit MLX weights, ~5 s)swift test --filter E2EOmnilingualMLX1BTests— 1/1 pass (on real 1B-4bit MLX weights, ~56 s including 549 MB download)swift test --filter TranscribeCommandTests— 22/22 pass (11 new parser tests)swift testsweep — 949 executed, 917 passed, 31 conditionally skipped (XCTSkipfor missing models / macOS-only), 1 pre-existing unrelated failure (E2EQwen35CoreMLConversionTests.testPyTorchDecoderProducesHellohard-codesscripts/convert_qwen35_chat_coreml.pywhich was never committed to the repo; failing long before this branch, tracked separately)-1batch-dim fix inSDPA.mergeHeadsOut of scope (follow-up tickets)
testPyTorchDecoderProducesHellofailure — tracked separately; unrelated to this PR