Skip to content

Add OmnilingualASR — 1,672 languages on CoreML + MLX, with shared SentencePiece/SDPA extract#201

Merged
ivan-digital merged 5 commits intomainfrom
feat/omnilingual-asr
Apr 11, 2026
Merged

Add OmnilingualASR — 1,672 languages on CoreML + MLX, with shared SentencePiece/SDPA extract#201
ivan-digital merged 5 commits intomainfrom
feat/omnilingual-asr

Conversation

@ivan-digital
Copy link
Copy Markdown
Collaborator

Summary

Single PR consolidating Meta's Omnilingual ASR port for Apple Silicon: CoreML 300M, MLX 300M/1B/3B/7B, plus two orthogonal refactors that extract duplicated building blocks into shared modules. Reference: arXiv 2511.09690, Apache 2.0.

Closes #195. Supersedes #199 and #200 (both will be closed with a pointer here).

What ships

1. OmnilingualASR CoreML (300M) — OmnilingualASRModel

  • Sources/OmnilingualASR/: Configuration, SentencePieceVocabulary, CTCGreedyDecoder, CoreML model loader with 5 s and 10 s fixed-window variants (aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8{,-10s})
  • Utterance-level layer_norm preprocessing matching fairseq2's apply_audio_normalization — normalises the raw waveform before zero-padding so sub-window inputs match the reference pipeline statistics
  • 40 s hard cap matching Meta's MAX_ALLOWED_AUDIO_SEC; clear error pointing to SpeechVAD for longer audio

2. OmnilingualASR MLX (300M / 1B / 3B / 7B) — OmnilingualASRMLXModel

  • Sources/OmnilingualASR/MLX/: fairseq2-compatible wav2vec2 port from scratch
    • Wav2Vec2Frontend — 7-layer CNN feature extractor (kernels [10,3,3,3,3,2,2], strides [5,2,2,2,2,2,2] → 50 Hz frames), per-channel LayerNorm + GELU, post_extract_layer_norm, model_dim_proj, weight-normed grouped Conv1d positional encoder (kernel 128, groups 16) with even-kernel trim and residual
    • Wav2Vec2EncoderLayer — pre-norm self-attention + 2-linear FFN, quantised projections (QuantizedLinear group size 64, 4 or 8 bits)
    • Wav2Vec2Encoder — stack of N layers + final encoder.layer_norm
    • CTCHead — quantised final_proj → 10288
    • OmnilingualMLXWeightLoader — loads model.safetensors, fuses PyTorch weight_norm(dim=2) for the position encoder (W = g · v / ‖v‖_per_kernel), applies all tensors via MLXCommon helpers
    • OmnilingualMLXConfig — variant table (300M 24L/1024d/16h, 1B 48L/1280d/20h, 3B 60L/2048d/32h, 7B 128L/2048d/32h), auto-detect from HF model id

3. Shared extractions

AudioCommon/SentencePieceModel.swift — protobuf reader for .model files, exposing pieces: [Piece] with (text, score, type) and a PieceType enum. Eliminates two independent near-identical protobuf readers:

  • OmnilingualASR/SentencePieceVocabulary.swift now wraps it (175 → 62 lines)
  • PersonaPlex/SentencePieceDecoder.swift now wraps it (179 → 85 lines) and keeps the greedy unigram encoder

MLXCommon/SDPA.swiftmultiHead / attendAndMerge / mergeHeads helpers. All reshape calls use -1 for the batch dimension so the helpers compose with MLX.compile(shapeless:) graphs that vary batch at runtime.

12 attention classes migrated to use SDPA.* (Qwen3ASR audio + float text + quantised text, Qwen3TTS Talker + CodePredictor + DecoderTransformer, Qwen3Chat GatedAttentionLayer, CosyVoice LLM, PersonaPlex Mimi + Depformer + Temporal, OmnilingualASR Wav2Vec2). Eliminates ~80 lines of duplicated reshape+SDPA+reshape boilerplate.

Caught and fixed during refactor

The first mergeHeads implementation used attn.dim(0) to size the batch dim, which baked it as a Swift constant and broke Qwen3-TTS compile(shapeless:) graphs that vary batch at runtime. Qwen3-TTS Talker testBatchWithDefaultInstruct fails with Cannot reshape array of size 4096 into shape (1,1,2048). Fix: use -1 for batch throughout. Verified end-to-end against Talker batch + CustomVoice + CosyVoice + Omnilingual MLX after fix.

CLI

# CoreML (default, ANE)
audio transcribe recording.wav --engine omnilingual               # 10 s window
audio transcribe recording.wav --engine omnilingual --window 5      # 5 s window

# MLX (GPU/Metal, any input length up to 40 s)
audio transcribe recording.wav --engine omnilingual --backend mlx                              # 300M @ 4-bit
audio transcribe recording.wav --engine omnilingual --backend mlx --variant 1B                  # 1B @ 4-bit
audio transcribe recording.wav --engine omnilingual --backend mlx --variant 3B --bits 8         # 3B @ 8-bit
audio transcribe recording.wav --engine omnilingual --backend mlx --variant 7B                  # 7B @ 4-bit

Docs

  • docs/models/omnilingual-asr.md — architecture, all 10 published variants (CoreML 5/10 s + MLX 300M/1B/3B/7B × 4/8 bit), and the full 1,672-language list with English names + country hints (resolved via CLDR / pycountry, grouped by 32 scripts, in a collapsible <details> block)
  • docs/inference/omnilingual-asr-inference.md — pipeline, layer-norm rationale, 40 s cap rationale, CoreML + MLX quick-starts, CLI examples, multilingual notes, perf numbers
  • docs/shared-protocols.md — added OmnilingualASR (both backends) to the SpeechRecognitionModel conformers + not-thread-safe list; added the MLXCommon module section (SDPA, QuantizedMLP, PreQuantizedEmbedding); added OmnilingualASR module tree; corrected stale entries about QuantizedMLP/PreQuantizedEmbedding location (MLXCommon, not AudioCommon)
  • CLAUDE.md / AGENTS.md — added OmnilingualASR + ParakeetStreamingASR to project structure; noted SDPA + SentencePieceModel in their respective common modules
  • README.md + all 9 translations (zh/ja/ko/es/de/fr/hi/pt/ru) — feature bullet, SPM products, 4 model rows (300M/1B/3B/7B in 4 and 8 bit, CoreML 5 s and 10 s), memory table, and a Supported Languages row linking the canonical lang_ids.py

Verified output

CoreML 300M (reference-correct after layer-norm fix):

Lang Transcript
EN can you guarantee that the replacement part will be shipped tomorrow
AR كما أثنى الزملاء المصارعون على لونا (reference match)
HI लूना को साथी पहलवानों ने भी सरधांजलीदी
FR pense à létineraire desqui comme un étineraire de rent donner similaire

MLX 300M-4bit (matches CoreML modulo one-char differences):

Lang Transcript
EN can you guarantee that the replacement part will be shiped tomorrow
AR كما أثنى الزملاء المصارع على لونا
HI लूना को साथी पहलवानों ने भी सरधांजली दी
FR pensez à létineraire deski comme unétineraire de rendonner similaire

MLX 1B-4bit sanity check: can you guarantee that the replacement par o be shipe tomorrow (slightly more quant degradation, expected).

Tests

New unit tests (~26 added):

  • OmnilingualASRTests/OmnilingualASRTests — 17 unit (config, layer-norm, CTC decoder, vocabulary)
  • OmnilingualASRTests/OmnilingualMLXTests — 9 unit (variant table, frontend output length math)
  • AudioCommonTests/SentencePieceModelTests — 7 unit (protobuf decode, type enum, helpers, subscript bounds, empty/unknown fields)
  • AudioCLITests/TranscribeCommandTests — 11 new CLI parser tests for --engine omnilingual, --window, --backend, --variant, --bits

New E2E tests (E2E… prefix, skipped in CI):

  • E2EOmnilingualASRTests — 9 (English real audio, warmup, unload, chunking, 40 s cap, FLEURS en/hi/fr/ar)
  • E2EOmnilingualMLXTests — 7 (load, warmup, English, 40 s cap, FLEURS en/hi/fr/ar) — covers Wav2Vec2 frontend + encoder + CTC head + weight loader on real 300M-4bit weights
  • E2EOmnilingualMLX1BTests — 1 (loads real 549 MB 1B-4bit weights, exercises the larger encoder code path)

Test resources: 4 small FLEURS fixtures (fleurs_{en,hi,fr,ar}.wav ~230-300 KB each).

Test plan

  • swift build — green (all targets, all dependencies)
  • swift test --skip E2E — 770/770 unit tests pass (+17 vs baseline for the new SentencePiece + Wav2Vec2 shape tests)
  • swift test --filter OmnilingualASRTests — 26/26 pass (17 unit + 9 E2E on real 300M CoreML weights, ~5 s wall clock)
  • swift test --filter E2EOmnilingualMLXTests — 7/7 pass (on real 300M-4bit MLX weights, ~5 s)
  • swift test --filter E2EOmnilingualMLX1BTests — 1/1 pass (on real 1B-4bit MLX weights, ~56 s including 549 MB download)
  • swift test --filter TranscribeCommandTests — 22/22 pass (11 new parser tests)
  • Full swift test sweep — 949 executed, 917 passed, 31 conditionally skipped (XCTSkip for missing models / macOS-only), 1 pre-existing unrelated failure (E2EQwen35CoreMLConversionTests.testPyTorchDecoderProducesHello hard-codes scripts/convert_qwen35_chat_coreml.py which was never committed to the repo; failing long before this branch, tracked separately)
  • Multilingual transcripts verified on real weights for EN / HI / FR / AR on both CoreML and MLX backends
  • Qwen3-TTS Talker batch test passes after the -1 batch-dim fix in SDPA.mergeHeads
  • Docs sync'd: English README + 9 translations, docs/models/, docs/inference/, docs/shared-protocols.md, CLAUDE.md / AGENTS.md

Out of scope (follow-up tickets)

  • 3B and 7B MLX variants spot-checked via the same code path as 300M and 1B, but no dedicated E2E tests (weights are 1.7 GB and 3.55 GB — manual verification only until CI has the disk for it)
  • Streaming / chunked dictation flow on top of the 5 s CoreML window (reference pipeline does not support streaming)
  • The pre-existing testPyTorchDecoderProducesHello failure — tracked separately; unrelated to this PR

Phase 1: CoreML 300M end-to-end
- Sources/OmnilingualASR/: config, SentencePiece vocab, CTC greedy decoder,
  CoreML model loader, fixed-window inference (5s/10s), waveform layer_norm
  matching fairseq2 apply_audio_normalization, 40s utterance hard cap matching
  reference MAX_ALLOWED_AUDIO_SEC
- Layer-norm fix: normalize raw audio chunk before zero padding so sub-window
  inputs match reference statistics (previous version normalized post-pad which
  skewed mean/variance with silence)
- 17 unit tests (config, layer norm, CTC decoder, vocab) + 9 E2E tests
  (multilingual FLEURS clips for en/hi/fr/ar, warmup, chunking, 40s cap)
- Hindi and Arabic E2E pass with correct script output

CLI
- audio transcribe --engine omnilingual [--window 5|10]
- 5 new parser tests in TranscribeCommandTests

Docs
- docs/models/omnilingual-asr.md: architecture, variants, **full 1,672
  language list** with names + country hints (resolved via CLDR / pycountry)
- docs/inference/omnilingual-asr-inference.md: pipeline, chunking, VAD note
- README.md + 9 translations: feature bullet, SPM products, models table
  rows for 300M / 1B / 3B / 7B variants, memory table, Supported Languages row

Package
- New OmnilingualASR library + target with MLX deps wired for follow-up
- Test target with 4 FLEURS resources

Reference: facebookresearch/omnilingual-asr (Apache 2.0), arXiv:2511.09690

Phase 2 (MLX 1B/3B/7B backend, larger variants) tracked as follow-up — only
the 300M CoreML INT8 path is wired in this PR.
…ariants

Native MLX/Metal implementation of Meta's Omnilingual CTC model. Loads any of
the 8 published MLX repos (300M / 1B / 3B / 7B in 4-bit and 8-bit) directly
from HuggingFace and runs inference on Apple GPUs.

Sources/OmnilingualASR/MLX/
- OmnilingualMLXConfig: variant table (300M=24L/1024d/16h, 1B=48L/1280d/20h,
  3B=60L/2048d/32h, 7B=128L/2048d/32h), groupSize=64, bits=4|8
- Wav2Vec2Frontend: 7-layer feature extractor (kernels [10,3,3,3,3,2,2],
  strides [5,2,2,2,2,2,2], 320× downsample → 50 Hz frames), per-channel
  LayerNorm + GELU, post_extract_layer_norm, model_dim_proj, weight-normed
  grouped Conv1d positional encoder (kernel=128, groups=16) with even-kernel
  trim and residual
- Wav2Vec2EncoderLayer: pre-norm self-attention + 2-linear FFN, all 4 attn
  projections + both FFN linears are QuantizedLinear with separate per-group
  scales/biases plus a regular linear bias (matches fairseq2 weight names
  exactly: self_attn.{q,k,v,output}_proj, ffn.{inner,output}_proj)
- Wav2Vec2Encoder: stack of N pre-norm layers + final encoder.layer_norm
- CTCHead: QuantizedLinear final_proj → 10288 logits
- OmnilingualMLXWeightLoader: loads model.safetensors, fuses PyTorch
  weight_norm(dim=2) for the position encoder (W = g * v / ||v||_per_kernel),
  applies all quantized + dense + norm tensors via MLXCommon helpers
- OmnilingualASRMLXModel: top-level loader, fromPretrained with HF download,
  variant + bits auto-detection from model id, transcribeAudio with reference
  utterance-level layer_norm preprocessing and 40 s hard cap

CLI
- audio transcribe --engine omnilingual --backend mlx [--variant 300M|1B|3B|7B] [--bits 4|8]
- 6 new parser tests covering backend/variant/bits validation

Tests
- 9 unit tests (6 config variants + 3 frontend shape math): all pass
- 7 E2E tests on 300M-4bit: load, warmup, English transcript, 40s cap rejection,
  Arabic + Hindi + French FLEURS clips. All pass; transcripts essentially
  identical to the CoreML 300M model. EN: "can you guarantee that the
  replacement part will be shiped tomorrow"; AR: "كما أثنى الزملاء المصارع على
  لونا"; HI Devanagari output verified.
- 1 E2E test for 1B-4bit: loads 549 MB, runs inference, produces recognizable
  English transcript — confirms variant detection + larger encoder code path
- Full unit regression sweep: 763 tests pass (+20 vs baseline), 0 failures

Docs
- docs/models/omnilingual-asr.md: marked all 10 published variants as wired
  through this module (CoreML 5/10s + MLX 300M/1B/3B/7B in 4/8 bit)
- docs/inference/omnilingual-asr-inference.md: added MLX quick-start, variant
  selection, CLI examples

Reference: facebookresearch/omnilingual-asr (Apache 2.0), arXiv:2511.09690.
Ported on top of #199 (CoreML backend).
Two orthogonal refactors that consolidate code duplicated across the model
modules into `AudioCommon` and `MLXCommon` respectively.

SentencePieceModel (AudioCommon)
- New `SentencePieceModel.swift` — protobuf-format `.model` reader exposing
  a `pieces: [Piece]` array with `(text, score, type)` per entry plus a
  `PieceType` enum (NORMAL/UNKNOWN/CONTROL/...) and an `isControlOrUnknown`
  helper. Replaces two independent near-identical protobuf readers.
- `OmnilingualASR/SentencePieceVocabulary.swift` now wraps it (175 → 62 lines);
  `PersonaPlex/SentencePieceDecoder.swift` now wraps it (179 → 85 lines) and
  keeps the greedy unigram encoder used for system-prompt tokenisation.
- 7 unit tests in `Tests/AudioCommonTests/SentencePieceModelTests` cover
  field decoding, all piece types, the control-or-unknown helper, subscript
  bounds, empty model errors, `Data` init, and skipping unknown top-level
  fields. Existing 5 `OmnilingualVocabularyTests` in `Tests/OmnilingualASRTests`
  pass unchanged via the preserved public API.

SDPA (MLXCommon)
- New `SDPA.swift` helpers:
  * `multiHead(q:k:v:numHeads:headDim:scale:mask:)` — flat `[B, T, H*D]`
    inputs, runs reshape → SDPA → reshape. Supports GQA via a second
    overload with separate query and KV head counts.
  * `attendAndMerge(qHeads:kHeads:vHeads:scale:mask:)` — already-shaped
    `[B, H, T, D]` inputs (for RoPE / KV-cache paths). Two overloads for
    `MLXArray?` and `MLXFast.ScaledDotProductAttentionMaskMode`.
  * `mergeHeads(_:)` — low-level `[B, H, T, D] → [B, T, H*D]` merge.
- All reshape calls use `-1` for the batch dimension so the helpers compose
  with `MLX.compile(shapeless:)` graphs that vary batch at runtime. An
  earlier iteration that baked `attn.dim(0)` as a Swift constant broke
  Qwen3-TTS Talker batch inference (caught by `testBatchWithDefaultInstruct`,
  fix verified E2E).
- Migrated 12 attention classes to use the helpers, eliminating duplicated
  reshape+SDPA+reshape boilerplate without forcing a unified `Attention`
  class (the variation in RoPE/KV-cache/GQA/quantisation made the per-site
  projections easier to keep local):
  * Qwen3ASR — AudioSelfAttention, FloatTextAttention, QuantizedTextAttention
  * Qwen3TTS — TalkerAttention, CodePredictorAttention, DecoderTransformerAttention
  * Qwen3Chat — GatedAttentionLayer (DeltaNet hybrid)
  * CosyVoiceTTS — CosyVoiceAttention
  * PersonaPlex — MimiAttention, DepformerAttention, TemporalAttention
  * OmnilingualASR — Wav2Vec2SelfAttention

Docs
- `docs/shared-protocols.md`: added OmnilingualASR (both backends) to the
  SpeechRecognitionModel conformer list and to the not-thread-safe list,
  added the MLXCommon module section with SDPA + QuantizedMLP +
  PreQuantizedEmbedding, added the OmnilingualASR module tree with both
  backends, updated the dependency diagram to include OmnilingualASR +
  ParakeetASR + ParakeetStreamingASR and the new MLXCommon node. Also
  corrected stale entries claiming QuantizedMLP / PreQuantizedEmbedding
  lived in AudioCommon (they're in MLXCommon).
- `CLAUDE.md` / `AGENTS.md`: added `OmnilingualASR` and `ParakeetStreamingASR`
  to the project structure, noted the new `SDPA` helper in MLXCommon and
  `SentencePieceModel` in AudioCommon.

Net diff: 14 files changed, +186/-350 lines, plus 3 new source files
(SentencePieceModel.swift 177 lines, SDPA.swift 94 lines,
SentencePieceModelTests.swift 182 lines).

Verification
- `swift build` — green.
- `swift test --skip E2E` — 770/770 unit tests pass (+17 vs baseline:
  7 new `SentencePieceModelTests` + existing 12 Omnilingual-related
  already counted against this branch's first commit).
- Full `swift test` sweep: 949 executed, 917 passed, 31 conditionally
  skipped (XCTSkip for missing models / macOS-only), 1 pre-existing
  failure (`E2EQwen35CoreMLConversionTests.testPyTorchDecoderProducesHello`
  hard-codes `scripts/convert_qwen35_chat_coreml.py` which was never
  committed to the repo — unrelated to this refactor).
- Spot E2E after refactor: Qwen3-TTS Talker batch E2E, Qwen3-TTS
  CustomVoice + CosyVoice TTS (8/8), Omnilingual MLX 300M and 1B, full
  multilingual FLEURS transcripts (EN/AR/HI/FR) — all pass.
Closes the one coverage gap in the SDPA refactor: the 4-bit and 8-bit
forced-aligner variants both go through `QuantizedTextAttention`, leaving
`FloatTextAttention` (used by the bf16 checkpoint `aufklarer/Qwen3-ForcedAligner-0.6B-bf16`)
only compile-checked after the `SDPA.attendAndMerge` migration.

The new `testForcedAlignerE2EBf16Variant` loads the real bf16 model, asserts
the decoder is `FloatTextModel` (proving we're on the non-quantised path),
and runs end-to-end alignment on the same `test_audio.wav` fixture used by
the existing 4-bit E2E test. Passes in 3.3 s after download.

With this test, all 12 migrated attention classes across Qwen3ASR, Qwen3TTS,
Qwen3Chat, CosyVoice, PersonaPlex, and OmnilingualASR are runtime-verified
on real weights end-to-end.
The test `testPyTorchDecoderProducesHello` imports and runs
`scripts/convert_qwen35_chat_coreml.py` via Python, but that script was
moved out of speech-swift in 274420d ("Remove conversion scripts — moved
to soniqo/speech-model") and a2ccbf7 ("Remove conversion and benchmark
scripts"). The test itself was never deleted in that cleanup, so it has
been failing silently against a missing file in every full E2E run since.

The verification properly belongs in the soniqo/speech-model repo
alongside the Python script — testing conversion tooling from a repo that
no longer ships the tools is a dangling dependency. Deleting rather than
skipping because:

- The XCTSkip path still requires the script to run, and its absence from
  this repo is permanent by design, not a transient missing-file
- Keeping a no-op skip in CI is just dead code that future readers will
  wonder about
- The actual assertion (PyTorch decoder top token = 9419 for a specific
  input) is verification *of* the conversion script, not *of* the Swift
  Qwen3Chat module — the Swift module has its own E2E coverage via
  `E2EQwen35MLXChatTests` and `E2EQwen35CoreMLChatTests`

With this removed, the full `swift test` sweep runs clean: 948 passed,
31 conditionally skipped, 0 failures.
@ivan-digital ivan-digital merged commit 03920e1 into main Apr 11, 2026
1 check passed
@ivan-digital ivan-digital deleted the feat/omnilingual-asr branch April 11, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Omnilingual ASR support (1,600+ languages, MLX + CoreML, streaming via CTC)

1 participant