feat: add OmniVoice TTS model support by leegang · Pull Request #147 · Blaizzy/mlx-audio-swift

leegang · 2026-04-09T01:14:53Z

Add complete OmniVoice support for mlx-community/OmniVoice-bf16, mirroring the Python omnivoice-infer CLI functionality.

Features:

Three modes: auto voice, voice design, and voice cloning
Full CLI integration with all 11 OmniVoice parameters (--instruct, --num_step, --guidance_scale, --speed, --duration, --t_shift, --denoise, --postprocess_output, --layer_penalty_factor, --position_temperature, --class_temperature)
OmniVoiceGenerateParameters with fast/high-quality presets
Complete test suite (8 tests passing)
Comprehensive documentation

Model Architecture:

Qwen3 LLM backbone (28 layers, 1024 hidden, 16 heads)
8 audio codebooks with hierarchical weighting
24kHz sample rate output
100% parameter compatibility with Python CLI

Files:

Sources/MLXAudioTTS/Models/OmniVoice/ (model implementation)
Sources/MLXAudioTTS/TTSModel.swift (factory registration)
Sources/Tools/mlx-audio-swift-tts/App.swift (CLI integration)
Tests/OmniVoiceTests.swift (test suite)
Documentation and testing scripts

Add complete OmniVoice support for mlx-community/OmniVoice-bf16, mirroring the Python omnivoice-infer CLI functionality. Features: - Three modes: auto voice, voice design, and voice cloning - Full CLI integration with all 11 OmniVoice parameters (--instruct, --num_step, --guidance_scale, --speed, --duration, --t_shift, --denoise, --postprocess_output, --layer_penalty_factor, --position_temperature, --class_temperature) - OmniVoiceGenerateParameters with fast/high-quality presets - Complete test suite (8 tests passing) - Comprehensive documentation Model Architecture: - Qwen3 LLM backbone (28 layers, 1024 hidden, 16 heads) - 8 audio codebooks with hierarchical weighting - 24kHz sample rate output - 100% parameter compatibility with Python CLI Files: - Sources/MLXAudioTTS/Models/OmniVoice/ (model implementation) - Sources/MLXAudioTTS/TTSModel.swift (factory registration) - Sources/Tools/mlx-audio-swift-tts/App.swift (CLI integration) - Tests/OmniVoiceTests.swift (test suite) - Documentation and testing scripts Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

lucasnewman

@leegang Thanks for the contribution!

Your change should only include the model itself, a README for the model, and any relevant tests. You shouldn't need to modify the base TTS protocol -- use sensible defaults for the model and specify a custom generate() overload if you want to allow additional customization.

Your changes shouldn't include any one-off testing scripts, Markdown files, or other detritus at the root level.

If you can fix those issues up I'm happy to review further.

Remove one-off testing scripts, markdown documentation, and other detritus per maintainer review. Keep only: - Model implementation (Sources/MLXAudioTTS/Models/OmniVoice/) - Model README (Sources/MLXAudioTTS/Models/OmniVoice/README.md) - Tests (Tests/OmniVoiceTests.swift) - Necessary code changes (TTSModel.swift, App.swift) Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Implement complete diffusion-based generation for OmniVoice TTS model, replacing the previous TODO placeholder that only returned random noise. Changes: - OmniVoice.swift: Full diffusion iterative unmasking generation loop with classifier-free guidance, Gumbel sampling, and layer penalty - Audio tokenizer: DAC + RVQ codec for encoding/decoding discrete tokens - Text tokenizer integration via HuggingFace AutoTokenizer - Prompt construction with style tokens, text wrapping, and voice modes - Audio post-processing with peak normalization and fade in/out Qwen3Model changes: - Added public init for Qwen3Configuration - Added forwardWithEmbeddings() for custom embedding injection - Added getEmbeddings() for token embedding lookup Build verified. All 13 OmniVoice tests pass. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

lucasnewman · 2026-04-10T19:09:15Z

+                language: args.language,
+                // OmniVoice-specific parameters
+                instruct: args.instruct,
+                numStep: args.numStep,


Can we just use sensible defaults here? The CLI tool isn't supposed to be a comprehensive test bed with every option for every model -- it's not really tractable. The API surface for the model can support a generate() overload with all of these options easily, though, so your own app can use them.

lucasnewman · 2026-04-10T19:10:15Z


+        // Configure OmniVoice-specific parameters if the model is OmniVoice
+        if let omnivoiceModel = loadedModel as? OmniVoiceModel {
+            omnivoiceModel.setGenerationConfig(


This should just be an generate*() overload on the model that takes the custom generation config.

lucasnewman · 2026-04-10T19:12:51Z

@leegang Overall this is looking good. I think you should just have a generate*() overload that accepts your custom generation parameters structure to make the API as simple as possible. See my comment about the CLI tool changes as well. If we can fix those up we can get this in.

Use sensible defaults for OmniVoice model generation. The CLI tool isn't a comprehensive test bed for every model's options. Users who need fine-grained control can use generate() overloads in their own app. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Remove instance-level configuration in favor of passing OmniVoiceGenerateParameters directly to generate(). This provides a cleaner API where users can customize generation per-call without mutating model state. - Add generate(text:voice:refAudio:refText:language:ovParameters:) overload - Remove setGenerationConfig() and all instance-level config vars - Update tests to use the new generate() overload - CLI uses sensible defaults via the standard protocol method Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Changed audioEmbeddings from a single flattened Embedding to an array of [Embedding] (one per codebook) to match the checkpoint structure. Removed unused codebookLayerOffsets since we no longer shift token IDs. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Changed audioHeads from a single flattened Linear to an array of [Linear] (one per codebook) to match the checkpoint structure. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

The mlx-community/OmniVoice-bf16 safetensors uses 'backbone.' prefix for LLM weights, but Qwen3Model expects 'model.' prefix. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…beddings The mlx-community/OmniVoice-bf16 safetensors stores weights with: - Per-codebook audio_embeddings.N.weight and audio_heads.N.weight that need concatenation into single tables - model. prefix for LLM weights that maps to llm. in our wrapper - backbone. prefix that maps to llm. prefix Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

The safetensors stores audio_embeddings.0.weight through audio_embeddings.7.weight and audio_heads.0.weight through audio_heads.7.weight as separate arrays, matching our model's [Embedding] and [Linear] array structure. No concatenation needed. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Qwen3Model's inner model was 'private let' which prevented MLX's weight loader from traversing into it. Change to @ModuleInfo(key: "model") fileprivate so weights at path 'llm.model.*' are properly matched. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

The @ModuleInfo key was "backbone" but sanitize() maps LLM weights to "llm." prefix. Changed key to "llm" so the weight loader finds the nested Qwen3Model correctly. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

The HiggsAudioV2 audio tokenizer has a complex architecture with acoustic_encoder, acoustic_decoder, quantizer, and fc2 that doesn't match our simplified codec wrapper. Skip loading these weights for now and return a config-only tokenizer. The main OmniVoice model should now load successfully. Audio tokenizer encode/decode remain TODO and will throw clear errors. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…coder and RVQ Implement the complete audio tokenizer for OmniVoice: - DAC-style acoustic encoder (Conv1d + downsampling blocks + Snake activations) - DAC-style acoustic decoder (ConvTranspose1d + upsampling blocks + Snake) - Residual Vector Quantization (RVQ) with 9 codebooks of 1024 entries - Snake activation with learnable alpha parameters - fc2 projection layer between quantizer and decoder This enables voice cloning mode by encoding reference audio to discrete tokens and decoding generated tokens back to waveforms. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…tizers') Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…tecture - OmniVoiceDACDownBlock: 3 ResidualUnits(dilation 1,3,9) + Snake + Conv - OmniVoiceDACUpBlock: Snake + ConvTranspose + 3 ResidualUnits(dilation 1,3,9) - Fix quantizer key mismatch (quantizers vs quantizer) Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…oice acoustic encoder Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

- Add OmniVoiceConvTranspose1d with [in, out, kernel] weight shape (no weight norm) - Replace DACVAEWNConvTranspose1d in OmniVoiceDACUpBlock Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…oice acoustic decoder Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…izer input projection with checkpoint requirements

…nnecessary transpose

…ivision by zero and improve stability

…ranspose1d The DAC decoder in HiggsAudioV2TokenizerModel removes the final Tanh (_adjust_dac_decoder replaces it with Identity) and sets output_padding = stride % 2 on each ConvTranspose1d. The Swift port was missing both, causing severe output length drift and saturated (noisy) waveforms.

- Dump raw vocoder output and post-processed audio in generateAudio - Dump audio tokenizer encode->decode roundtrip in encode() - Add testAudioTokenizerRoundTrip test for Xcode execution

The padding for conv1 in DacResidualUnit was hard-coded to 3, which is only correct for dilation=1. Python reference uses , so for dilation=3 the padding should be 9, and for dilation=9 it should be 27. This under-padding caused boundary artifacts, length mismatches, and cascading feature corruption through the encoder/decoder, resulting in pure noise from the vocoder decode path.

Python reference uses full attention (not causal) for the conditional path and the unconditional target region. The default causal mask prevented target tokens from attending to each other during diffusion, which produced near-silent output from the vocoder. - forward() now accepts an optional MLXFast.ScaledDotProductAttentionMaskMode - generateAudio() builds a per-batch attention mask: * conditional: full attention over condLength * unconditional: target region full attention, prefix diagonal only

MLX scaled_dot_product_attention converts bool masks internally, but there may be a bug in the bool->additive conversion path on Metal. Switch to explicit float masks where True=0 and False=-inf to rule out any mask interpretation issues.

…sk bugs Instead of concatenating cond and uncond into a single batch and using a custom attention mask (which triggers fatal dtype errors and potential Metal kernel bugs), run two separate forward passes per diffusion step. This is mathematically equivalent to the Python reference but avoids all MLX scaled_dot_product_attention mask handling issues entirely.

…in bfloat16 Python reference (Blaizzy/mlx-audio PR #630) shows snake must be computed in float32: alpha near zero in float16/bfloat16 makes 1/(alpha+eps) = inf, and inf*sin²(0) = NaN. This was corrupting both the vocoder encoder and decoder, causing pure noise output. Also adds step-0 logits NaN/inf diagnostics and replaces remaining mask tokens with 0 after diffusion, matching Python behavior.

…t correctly The tokenizer was loading raw checkpoint weights without sanitizing, so the quantizer codebook weights (keyed 'weight' in Python nn.Embedding) could not match Swift's 'embed' key, leaving the codebook randomly initialized. This caused the vocoder to output pure noise. Adds sanitize() matching Python HiggsAudioTokenizer.sanitize, including: - codebook.weight -> codebook.embed rename - alpha transpose for shape correction - filtering of unused keys (decoder_semantic, fc1, VQ bookkeeping) Skips 3D conv weight transposes because Swift's OmniVoiceConv1d and OmniVoiceConvTranspose1d already handle PyTorch->MLX layout conversion internally.

Swift's OmniVoiceConv1d/ConvTranspose1d keep data in NCL [B,C,L] format. snakeAlpha's alpha is [1,1,C] (for NLC), which causes a broadcast fatal error when multiplied against NCL inputs. Instead of transposing alpha at load time (which may fail if update() rejects shape changes), we now reshape it at runtime inside callAsFunction to [1,C,1] so it broadcasts correctly over NCL feature maps. Also removes the alpha transpose from tokenizer sanitize since runtime reshape handles both [1,1,C] and [1,C,1] checkpoints.

OmniVoiceConv1d initialized weights as [out,k,in] (MLX format) but then transposed them to [out,in,k] inside callAsFunction before passing to MLX.conv1d. This permuted every conv1d weight in the vocoder encoder/decoder, causing pure noise output even with correctly-loaded checkpoint weights. Remove the spurious transpose so the weight layout matches what checkpoint and MLX.conv1d both expect.

…e source

… loading

…bfloat16 precision issues Adds .asType(.float32) on inputs, weights, and biases inside OmniVoiceConv1d and OmniVoiceConvTranspose1d, then casts the output back to the original dtype. This tests whether MLX Metal's bfloat16 convolution paths are silently collapsing the signal. Also adds diagnostic prints to the actual AcousticDecoder class.

OmniVoiceDACResidualUnit calculated padding based on dilation but never passed dilation to OmniVoiceConv1d, so all dilated convolutions in the encoder/decoder residual units ran with dilation=1. This completely broke the vocoder for a checkpoint trained with dilations 1,3,9, causing pure noise output despite correct weights and shapes.

…precision hypothesis

…precision hypothesis Since ref_direct (vocoder-only path) sounds like compressed human voice but raw TTS output sounds like instruments, the vocoder is likely correct and the TTS model (diffusion token generation) is producing wrong tokens. This forces both the tokenizer and the TTS model to load weights in float32 to rule out bfloat16 precision issues in the LLM backbone or audio heads causing token distribution corruption.

- Exclude refAudio from unconditional inputIds for proper CFG - Add tanh to acousticDecoder final conv output for stability - Fix encoder diagnostic log labels

…t-lm version - Replace hand-written Qwen3 attention/model with official mlx-swift-lm code - Keeps TTS extensions: forwardWithEmbeddings, getEmbeddings, audio decode helpers - Update Qwen3Configuration to match upstream structure while preserving sampleRate/eosTokenId

- Revert incorrect refLen slicing on uncondInputIds that caused reshape crash - Add per-layer diagnostic prints for embeds/hidden/logits to isolate 0.6% accuracy

…decode path - Replace direct fc2() call with proper matmul using transposed weight - Encode uses matmul(zNLC, fc2.weight) to project 256→1024 - Decode now uses matmul(zNLC, fc2.weight.T) to reverse projection 1024→256 - This fixes the root cause of synthesized audio being pure noise Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…nd document known checkpoint limitations

bestmrlee-max and others added 2 commits April 9, 2026 08:54

Merge branch 'main' into feature/omnivoice-support

7f07b50

lucasnewman requested changes Apr 9, 2026

View reviewed changes

bestmrlee-max and others added 2 commits April 9, 2026 22:11

lucasnewman reviewed Apr 10, 2026

View reviewed changes

leegang closed this Apr 11, 2026

leegang reopened this Apr 11, 2026

bestmrlee-max and others added 5 commits April 11, 2026 14:17

Merge branch 'Blaizzy:main' into feature/omnivoice-support

448c05b

fix: resolve audio_heads shape mismatch for OmniVoice checkpoint

50c6ae7

Changed audioHeads from a single flattened Linear to an array of [Linear] (one per codebook) to match the checkpoint structure. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

leegang marked this pull request as draft April 12, 2026 10:07

bestmrlee-max and others added 14 commits April 12, 2026 18:14

fix: add backbone key to llm ModuleInfo for checkpoint compatibility

40b571f

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

fix: map backbone. weight prefix to model. for OmniVoice

8fbb180

The mlx-community/OmniVoice-bf16 safetensors uses 'backbone.' prefix for LLM weights, but Qwen3Model expects 'model.' prefix. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

fix: map unprefixed LLM weights to llm.model. path

8e27d2d

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

fix: match @ModuleInfo key to sanitized weight prefix

f06d3fb

The @ModuleInfo key was "backbone" but sanitize() maps LLM weights to "llm." prefix. Changed key to "llm" so the weight loader finds the nested Qwen3Model correctly. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

fix: match OmniVoiceAudioTokenizer quantizer key to checkpoint ('quan…

c0ff54b

…tizers') Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

fix: add missing encoder tail layers (snake1, conv1, conv2) for OmniV…

2557b62

…oice acoustic encoder Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

fix: use PyTorch-weight-layout ConvTranspose1d for OmniVoice decoder

ca33c2c

- Add OmniVoiceConvTranspose1d with [in, out, kernel] weight shape (no weight norm) - Replace DACVAEWNConvTranspose1d in OmniVoiceDACUpBlock Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

fix: add missing decoder tail layers (snake1, conv1, conv2) for OmniV…

2693828

…oice acoustic decoder Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

bestmrlee-max added 4 commits April 14, 2026 01:41

refactor: update OmniVoice residual unit architecture and align quant…

004ea6f

…izer input projection with checkpoint requirements

fix: correct OmniVoice projection matrix multiplication by removing u…

4c53b57

…nnecessary transpose

fix: update activation function calculation in OmniVoice to prevent d…

92a4c25

…ivision by zero and improve stability

leegang force-pushed the feature/omnivoice-support branch from f40d325 to 29419ec Compare April 14, 2026 07:15

bestmrlee-max and others added 25 commits April 14, 2026 15:54

diag(omnivoice): add temporary wav dumps to isolate noise source

b3af751

- Dump raw vocoder output and post-processed audio in generateAudio - Dump audio tokenizer encode->decode roundtrip in encode() - Add testAudioTokenizerRoundTrip test for Xcode execution

diag(omnivoice): add step-0 logits/predTokens/scores prints

6238b41

fix(omnivoice): resolve MLXArray dtype inference compilation errors

558dc6e

fix(omnivoice): cast attention mask to bfloat16 to match Q/K/V dtype

3addac3

diag(omnivoice): add layer-by-layer decode diagnostics to locate nois…

ed20fce

…e source

diag(omnivoice): add post-load weight statistics to verify checkpoint…

fe1e20e

… loading

diag(omnivoice): add runtime decoder weight shape checks inside decode()

4e61770

test(omnivoice): force tokenizer weights to float32 to test bfloat16 …

f93cff2

…precision hypothesis

fix(OmniVoice): correct CFG uncond input and add decoder tanh

cc74a73

- Exclude refAudio from unconditional inputIds for proper CFG - Add tanh to acousticDecoder final conv output for stability - Fix encoder diagnostic log labels

fix(OmniVoice): rollback broken uncond slice and add step diagnostics

904ef5f

- Revert incorrect refLen slicing on uncondInputIds that caused reshape crash - Add per-layer diagnostic prints for embeds/hidden/logits to isolate 0.6% accuracy

fix(OmniVoice): add fc2.bias in decode path, remove incorrect tanh, a…

4cfed0a

…nd document known checkpoint limitations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add OmniVoice TTS model support#147

feat: add OmniVoice TTS model support#147
leegang wants to merge 113 commits intoBlaizzy:mainfrom
leegang:feature/omnivoice-support

leegang commented Apr 9, 2026

Uh oh!

lucasnewman left a comment

Uh oh!

lucasnewman Apr 10, 2026

Uh oh!

lucasnewman Apr 10, 2026 •

edited

Loading

Uh oh!

lucasnewman commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

leegang commented Apr 9, 2026

Uh oh!

lucasnewman left a comment

Choose a reason for hiding this comment

Uh oh!

lucasnewman Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

lucasnewman Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucasnewman commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucasnewman Apr 10, 2026 •

edited

Loading