Skip to content

Latest commit

 

History

History
88 lines (63 loc) · 3.38 KB

File metadata and controls

88 lines (63 loc) · 3.38 KB

Parakeet TDT ASR

CoreML-based automatic speech recognition using NVIDIA's Parakeet-TDT 0.6B v3.

Architecture

FastConformer Encoder: 24-layer conformer with depthwise-separable convolutions and 8x subsampling. Input: 128-dim mel spectrogram at 16kHz. Output: 1024-dim encoded representations.

TDT Decoder: Token-and-Duration Transducer — an extension of RNN-T that adds a duration prediction head alongside the standard token prediction head.

  • Prediction network: 2-layer LSTM (640 hidden) predicts from previously emitted tokens
  • Joint network: Combines encoder and prediction outputs, produces dual logits:
    • Token logits (8193 classes: 8192 SentencePiece tokens + 1 blank)
    • Duration logits (5 classes: [0, 1, 2, 3, 4] frames to advance)

TDT Decoding Algorithm

Standard RNN-T always advances 1 encoder frame on blank. TDT enhances this:

t = 0
while t < encoded_length:
    (token_logits, dur_logits) = joint(encoder[t], decoder_state)
    token = argmax(token_logits)
    if token == blank:
        t += 1
    else:
        emit(token)
        duration = duration_bins[argmax(dur_logits)]
        t += max(duration, 1)
        decoder_state = lstm(token, decoder_state)

This variable-rate alignment is more accurate and efficient than fixed-rate RNN-T.

CoreML Models

The model is split into 3 CoreML sub-models for optimal compute unit placement:

Model Compute Quantization Purpose
encoder.mlmodelc CPU + Neural Engine INT8 palettized Mel → 1024-dim encoded features
decoder.mlmodelc CPU + Neural Engine None LSTM prediction network
joint.mlmodelc CPU + Neural Engine None Dual-head token + duration logits

Default quantization:

  • INT8 (aufklarer/Parakeet-TDT-v3-CoreML-INT8) — ~50% size reduction, best quality/speed balance

Mel preprocessing (pre-emphasis, STFT, mel filterbank, normalization) is done in Swift using Accelerate/vDSP — no CoreML preprocessor model needed. Decoder and joint are small enough that quantization isn't necessary.

Audio Preprocessing

  • Sample rate: 16kHz
  • Mel bins: 128
  • n_fft: 512, hop: 160, win: 400
  • Window: Hanning
  • Pre-emphasis: 0.97 (applied in Swift via vDSP)

CLI Usage

# Transcribe with Parakeet (CoreML, fast on Neural Engine)
audio transcribe --engine parakeet audio.wav

# Transcribe with Qwen3 (MLX, default)
audio transcribe audio.wav
audio transcribe --engine qwen3 audio.wav

Performance

On Apple Silicon with Neural Engine (M2 Max, 20s audio):

  • Cold: RTF ~0.12 (first inference includes CoreML compilation)
  • Warm: RTF ~0.03 (subsequent inferences, ~32x real-time)
  • Mel preprocessing uses vectorized Accelerate/vDSP (pre-emphasis, normalization)
  • Decode loop uses memcpy, memset, and vDSP argmax to minimize scalar overhead
  • Encoder runs on Neural Engine for maximum throughput

Model Weights

Thread Safety

  • ParakeetConfigSendable (immutable value type)
  • ParakeetVocabularySendable (immutable dictionary)
  • ParakeetASRModel — NOT thread-safe (LSTM decoder state). Create separate instances for concurrent use.
  • CoreML MLModel.prediction() is thread-safe for stateless models, but the TDT decode loop manages LSTM h/c state as local variables (safe per-call, not re-entrant).