Skip to content

Add Cohere Transcribe STT support#129

Open
beshkenadze wants to merge 9 commits intoBlaizzy:mainfrom
beshkenadze:draft/cohere-transcribe-experimental
Open

Add Cohere Transcribe STT support#129
beshkenadze wants to merge 9 commits intoBlaizzy:mainfrom
beshkenadze:draft/cohere-transcribe-experimental

Conversation

@beshkenadze
Copy link
Copy Markdown
Contributor

@beshkenadze beshkenadze commented Mar 27, 2026

Summary

  • add Cohere Transcribe model support, CLI wiring, and Cohere-specific tests
  • fix the Swift decoder prompt token order to match the canonical Cohere/Python layout and restore punctuation/casing on transcription output
  • add Swift-side support for loading quantized Cohere checkpoints via quantization / quantization_config
  • publish quantized Cohere MLX repos for 8-bit, 6-bit, and 4-bit variants

Quantized model repos

Root causes addressed

1. Text quality regression

Cohere Transcribe uses prompt control tokens to steer formatting. The Swift tokenizer originally built the decoder prompt in the wrong order:

<|startofcontext|> <|startoftranscript|> <|en|> <|en|> <|pnc|> <|notimestamp|> <|nodiarize|> <|noitn|> <|emo:undefined|>

The canonical order used by the Python implementation is:

<|startofcontext|> <|startoftranscript|> <|emo:undefined|> <|en|> <|en|> <|pnc|> <|noitn|> <|notimestamp|> <|nodiarize|>

That mismatch preserved lexical content but degraded casing and punctuation. The Swift tokenizer now matches the working Python path.

2. Quantized Cohere checkpoints were not loadable in Swift

Unlike the other STT models in this repo, Cohere did not decode quantization / quantization_config from config.json and did not call quantize(model: ...) before loading packed weights. This PR adds the same quantization-aware loading path used by the other Swift STT models.

It also tightens Cohere conv-weight normalization so both the original converted fp16 checkpoint and the locally re-saved quantized checkpoints load with the correct 1D convolution layout.

Files changed for quantization support

  • Sources/MLXAudioSTT/Models/CohereTranscribe/CohereTranscribe.swift
  • Sources/MLXAudioSTT/Models/CohereTranscribe/CohereTranscribeConfig.swift
  • Tests/MLXAudioSTTTests.swift

Validation

  • swift test --filter cohereConfigDecoding
  • swift test --filter cohereTokenizerBuildsPromptTokens
  • xcodebuild -scheme mlx-audio-swift-stt -configuration Release -destination "platform=macOS" -derivedDataPath .build/xcode build
  • real transcription benchmark on Tests/media/conversational_a.wav
  • local quantized export + load verification for 8-bit / 6-bit / 4-bit Cohere checkpoints

Benchmark (Tests/media/conversational_a.wav, warm run)

Variant Gen TPS Total time Peak memory Quality note
fp16 146.6 0.764s 5.40 GB baseline
8bit 352.9 0.460s 2.87 GB matches fp16 on this sample
6bit 362.5 0.461s 2.42 GB punctuation regression: bush-curious
4bit 394.6 0.436s 1.96 GB lexical regression: KaldiKhaldi

Recommendation

  • 8-bit is the best quantized trade-off on the repo sample: ~2.4x fp16 generation throughput with ~47% lower peak memory and no observed text regression on this sample.
  • 6-bit and 4-bit are faster/smaller, but both show output degradation on the same audio clip, so they should be considered more experimental.

Notes

  • this PR remains a draft because broader Cohere evaluation is still ongoing even though the repo-sample text-quality issue is fixed
  • the quantized repos are uploaded and linked above so reviewers can pull the exact artifacts used in this benchmark

@beshkenadze
Copy link
Copy Markdown
Contributor Author

@lucasnewman I'm not sure about the quality of the model itself, but we can keep it if we want to support

@beshkenadze beshkenadze marked this pull request as ready for review March 29, 2026 19:07
@beshkenadze
Copy link
Copy Markdown
Contributor Author

@lucasnewman now we have a full parity with py version :)

@beshkenadze
Copy link
Copy Markdown
Contributor Author

Also closes #130

@Newarr
Copy link
Copy Markdown

Newarr commented Mar 30, 2026

We're looking to use this in OpenOats (macOS meeting transcription). Ran Cohere against Whisper Large v3 Turbo on Apple Silicon via the Python mlx-audio side.

Started with 8 samples where Cohere hit 0.0% WER on French and Spanish. Kept adding samples to see if that held. It didn't at scale, but neither did Whisper's early leads. At 695 samples (647 English):

n Cohere avg WER Whisper avg WER Cohere median Whisper median
English 647 5.55% 5.56% 4.00% 3.57%
Polish 22 7.3% 7.3% 4.3% 4.3%
Spanish 22 2.5% 1.8% 0.0% 0.0%
French 2 0.0% 1.6% - -
German 2 1.6% 7.8% - -
Avg latency - 0.23s 0.47s - -

On English they're the same model to two decimal places. Cohere is 2x faster, and that held across every test we ran.

Can help test on meeting audio once this lands.

@Benjoyo
Copy link
Copy Markdown

Benjoyo commented Mar 30, 2026

Thanks @beshkenadze for this - can we use a 8 or 4bit quantized model as well? I only find your fp16 model on the Hub (https://huggingface.co/beshkenadze/cohere-transcribe-03-2026-mlx-fp16). Does quantization work, and if yes, can you point me at the right script/command? Thanks

@beshkenadze
Copy link
Copy Markdown
Contributor Author

@Benjoyo I've uploaded quantized models and updated the benchmark results as well.

beshkenadze added a commit to beshkenadze/mlx-audio-swift that referenced this pull request Mar 31, 2026
- Add Cohere Transcribe STT model implementation
- Wire into CLI and docs
- Add Cohere Transcribe tests
- Fix: use max(dim-1, 1) in ParakeetAudio normalization (div-by-zero guard)
- Fix: add textProcessor param and kokoro case to TTSModel factory
- Improve test integration via MLXAUDIO_TEST_MODEL_DIR env var
@beshkenadze
Copy link
Copy Markdown
Contributor Author

@Benjoyo @Newarr feel free to use my fork while we are waiting for merge into upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants