Skip to content

Add streaming ASR with Parakeet EOU 120M + DictateDemo#188

Merged
ivan-digital merged 29 commits intomainfrom
feat/streaming-asr
Apr 11, 2026
Merged

Add streaming ASR with Parakeet EOU 120M + DictateDemo#188
ivan-digital merged 29 commits intomainfrom
feat/streaming-asr

Conversation

@ivan-digital
Copy link
Copy Markdown
Collaborator

@ivan-digital ivan-digital commented Apr 5, 2026

Summary

Streaming ASR module using Parakeet EOU 120M on CoreML with end-of-utterance detection. Includes macOS dictation demo app.

ParakeetStreamingASR module

  • Cache-aware FastConformer-RNNT encoder processing 320ms chunks
  • RNNT greedy decoder with EOU token detection (ID 1024)
  • AsyncStream<PartialTranscript> API with persistent encoder/decoder state
  • Mel preprocessing matching NeMo reference: symmetric Hann window, zero center padding, centered FFT frame, power/4 scaling, no normalization
  • Pre-encode mel cache (manual prepend, 9 frames context)
  • Multi-utterance: no state reset after EOU, token offset tracking
  • Float32 decoder LSTM and joint inputs

DictateDemo (macOS)

  • Window app with mic button, VAD indicator, live transcript
  • Silero VAD (CoreML) for speech/silence detection
  • Multi-sentence display with EOU-committed lines
  • Debug overlay (audio RMS, chunks processed, partials count)
  • Raw mic audio (no normalization needed with correct mel)

Model (aufklarer/Parakeet-EOU-120M-CoreML-INT8)

  • Encoder: INT8, 102 MB, input [1,128,73] (9 cache + 64 chunk)
  • Decoder: 7.5 MB, float32 h/c
  • Joint: 2.7 MB, float32 inputs
  • Total: 112 MB

Known limitations

  • English only (Parakeet EOU 120M is English-only)
  • Pre-encode cache is manual mel prepend (not model-managed loopback)
  • Model sensitivity varies with mic distance/volume

Test plan

  • 17 unit tests (config, vocab, mel, cache shapes, types)
  • 7 E2E tests (batch, streaming, session, latency, warmup, memory, model loading)
  • 6 demo E2E tests (streaming, VAD, multi-utterance, latency, real audio)
  • Mel reference test vs NeMo (max diff 0.0001)
  • Live mic transcription: multi-utterance dictation

Closes #186

Streaming ASR with end-of-utterance detection using Parakeet EOU 120M
on CoreML. Cache-aware FastConformer-RNNT encoder processes 320ms
chunks with persistent encoder/decoder state between calls.

Module structure:
- Configuration, Vocabulary, StreamingMelPreprocessor
- RNNTGreedyDecoder (no duration bins, EOU token detection)
- StreamingSession with AsyncStream<PartialTranscript> API
- SpeechRecognitionModel + ModelMemoryManageable conformance

17/17 unit tests pass. E2E inference produces blank tokens —
needs investigation into mel preprocessing match with NeMo,
pre_cache encoder input, and decoder output transposition.
- Add pre_cache mel-level context input to encoder
- Encoder output now [B,T,D] — use simple memcpy for frame extraction
- Decoder output now [B,U,D] — remove transpose workarounds
- Use shape[1] for frame count (was shape[2] for old [B,D,T] layout)

Remaining: batch transcription produces empty text because per-chunk
mel normalization differs from NeMo's whole-utterance normalization.
Need to port proven MelPreprocessor from ParakeetASR.
- Replace custom mel preprocessor with proven NeMo-matching DSP from
  ParakeetASR (Slaney filterbank, whole-utterance normalization, vDSP)
- Output float32 mel (encoder expects float32)
- Fix mel truncation to use actual array shape, not melLength
- 6/7 E2E tests pass: streaming transcription, latency (RTF 0.056),
  session, warmup, model loading, memory management
EOU token fires during pushAudio, returning the transcript as a
final partial and clearing accumulated tokens. The batch path was
discarding pushAudio results, so finalize() found empty tokens.

All 24 tests pass (17 unit + 7 E2E). Streaming latency RTF 0.056.
Menu bar app using Parakeet EOU 120M for real-time voice dictation.
Live partial transcripts update as you speak, EOU auto-commits
sentences, paste to frontmost app via Cmd+Shift+V.

- StreamingRecorder: 16kHz mic capture with audio level
- DictateViewModel: streaming session management, paste-to-app
- Menu bar dropdown + floating HUD with live transcript
@ivan-digital ivan-digital changed the title Add streaming ASR with Parakeet EOU 120M (WIP) Add streaming ASR with Parakeet EOU 120M Apr 5, 2026
- Move audio processing off audio thread via Sendable ASRProcessor
- Buffer audio and drain on background queue (was blocking audio thread)
- Don't clear tokens on EOU — keep accumulating for continuous dictation
- EOU marks sentence boundary, not hard reset
Token accumulation without reset caused partials to repeat the same
text forever. Restore allTokens.removeAll() on EOU — each utterance
is a fresh segment. Encoder/decoder LSTM state persists across EOU.

Removed unsafe cache reset (caused SIGBUS from concurrent memset).
Model self-recovers after ~10-20s silence naturally.
…Timer

- Add SpeechVAD for speech/silence indicator (CoreML Silero)
- Multi-sentence display: each EOU-committed sentence on own line
- DispatchSourceTimer for reliable processQueue scheduling
- Auto-load models on app launch
- VAD is UI-only — all audio feeds to encoder for cache continuity

Known: streaming per-chunk mel normalization limits real-time
transcription quality vs batch mode. Needs investigation into
running mel normalization or shared mel context across chunks.
- Switch from @observable to ObservableObject + @published (Combine)
  to fix menu bar and HUD window not re-rendering on property changes
- Add extractRaw() mel mode (no normalization) for streaming — fixes
  blank-token issue caused by per-chunk normalization mismatch
- Add extractStreaming() with running Welford normalization
- Auto-open HUD window on recording start
- Batch transcribeAudio uses whole-utterance normalization (all E2E pass)
- Streaming produces live partials and EOU-committed finals

Known limitation: after EOU commits a sentence, encoder cache needs
time to re-engage for next utterance. First sentence transcribes well,
subsequent sentences may be delayed or missed.
After end-of-utterance, zero all encoder caches (preCache,
cacheLastChannel, cacheLastTime), reset decoder LSTM, re-prime
with blank token. Each utterance starts with fresh state.
Correct the streaming pipeline to match NeMo's cache-aware design:

- Pre-encode cache is mel-level, handled externally by the caller:
  save last 9 mel frames from each chunk, prepend to next chunk's
  mel input. Encoder input is [1, 128, 43] (9 cache + 34 chunk).
- Remove pre_cache as separate encoder input (was incorrect)
- Use extractRaw() (no normalization) for all paths — EOU model
  was trained with normalize: "NA"
- Full session reset after EOU: encoder caches, decoder LSTM,
  pre-encode mel cache, accumulated tokens

24/24 tests pass (17 unit + 7 E2E). RTF 0.103 (10x real-time).
WIP: menu bar button action not triggering consistently.
The streaming pipeline works (24/24 E2E tests pass) but the
SwiftUI MenuBarExtra UI has rendering/interaction issues.
6/6 tests pass:
- Streaming session produces text from real audio
- Multi-utterance with silence gaps (no crash)
- VAD detects speech vs silence correctly
- Chunk latency RTF 0.006 (166x real-time)

The streaming pipeline works end-to-end. Menu bar UI has
separate SwiftUI rendering issues to debug.
The EOU model was trained with NeMo's streaming config which uses:
- Symmetric Hann window (divides by N-1, not N)
- Zero center padding (not reflect padding)
- No per-feature normalization

Our mel used periodic Hann + reflect padding, producing mel values
the model didn't recognize. Fixed extractRaw() to match NeMo.

Result: mic audio now transcribes correctly across all chunk sizes.
24/24 main tests pass + demo mic test produces text.
- Add mic audio normalization (mic levels too low for EOU model)
- Normalize before VAD and ASR for consistent levels
- Switch demo from MenuBarExtra to regular Window (more reliable)
- EOU reset: only clear tokens, keep encoder/decoder state
- Debug audio capture to WAV for pipeline validation
- Added testDebugWavLoading and testDebugAudioSmallChunks tests

Findings:
- EOU model is English-only (blanks on Russian speech)
- Periodic Hann + reflect padding works in live demo
- Model produces text from live mic ("hello what are you")
- Needs mel A/B test vs NeMo reference in models repo
…adding

Three mel computation bugs that caused the EOU model to produce blanks:

1. Window centering: torch.stft centers the Hann window in the FFT frame
   with (n_fft - win_length) / 2 = 56 zeros on each side. Our code put
   the window at offset 0 with 112 trailing zeros — wrong phase/magnitude.
2. Symmetric Hann window: NeMo uses periodic=False (divides by N-1).
3. Zero center padding: NeMo uses pad_mode='constant', not reflect.

Validated against NeMo reference: max mel difference 0.0001 (was 2.26).
Demo transcribes live mic: "what is the name of the screen"
- Don't clear tokens on EOU — accumulate for continuous dictation
- UI extracts delta text (new since last commit) for sentence display
- Track lastCommittedText to compute deltas from full accumulated text

Live mic transcription works ("what is the name of the screen")
but is intermittent — some runs produce tokens, others don't.
Needs investigation into CoreML inference determinism.
The EOU model needs continuous encoder/decoder context across
utterance boundaries. Resetting state after EOU kills the model's
ability to re-engage on subsequent speech.

Fixes:
- Remove guard !eouDetected in pushAudio (was killing session)
- Keep all encoder caches, decoder LSTM, tokens across EOU
- Use eouTokenOffset to decode only new tokens per segment
- Demo: no session replacement, just append FINAL sentences

Result: multi-utterance dictation works — "what", "do you",
"want to go" transcribed as three separate FINALs from one session.
The normalization was causing more problems than it solved.
With the correct 64 mel frame chunk size, the model handles
raw mic levels (rms 0.03-0.06) correctly.

Multi-utterance dictation working: "when you come home",
"feel you jump the street" — two separate FINALs from raw mic.

Model re-exported with NeMo's actual 320ms config:
chunk_size=[57,64], shift_size=[25,32], pre_encode_cache=[0,9]
vDSP_fft_zrip scales output by 2x vs standard DFT. The power
spectrum (magnitude squared) was 4x larger than torch.stft output,
adding +1.386 bias to all log-mel values. The model was trained
with unscaled DFT, so this mismatch reduced sensitivity.

Fix: multiply power spectrum by 0.25 in extractRaw().

Result: significantly improved sensitivity on normal-volume speech.
4 sentences transcribed from raw mic without normalization.
Matches FluidAudio's implementation:
- Decoder LSTM state (h, c) as float32 (was float16)
- Joint network inputs as float32
- Encoder frame copy converts float16→float32
- argmax/logSoftmax operate on float32 directly

Reduces accumulated rounding errors over long streaming sessions.
- Encoder now has separate pre_cache [1,128,9] input and
  new_pre_cache output — model handles concatenation internally
- Remove manual mel prepending and cache rotation code
- Chunk samples = (melFrames-1) * hopLength = 10080 → exactly 64 mel frames
- All discrepancies with reference implementation resolved:
  symmetric Hann window, zero padding, centered FFT, FFT/4 scaling,
  float32 decoder/joint, separate pre_cache, 64 mel frames
The model-managed pre_cache (returning audio_signal tail as new_pre_cache)
caused hallucination because it's not the encoder's actual subsampling cache.
Revert to manual mel prepending which produces correct transcription.

Keep the (melFrames-1)*hopLength chunk size fix for exact 64 mel frames.
@ivan-digital ivan-digital changed the title Add streaming ASR with Parakeet EOU 120M Add streaming ASR with Parakeet EOU 120M + DictateDemo Apr 9, 2026
The float32 decoder/joint changes caused dtype mismatch with the
v4 model. Reverted to exact working state from commit 3d94bfb.

Live mic still shows 'sim' hallucination despite E2E tests passing.
Root cause: model I/O spec differs from reference implementation.
Need to re-export model matching reference spec exactly:
- Separate pre_cache input with model-managed new_pre_cache output
- Encoder output [1,D,T] not [1,T,D]
- Float32 decoder h/c (not float16)
- Argmax baked into joint model
- Encoder: separate pre_cache, [B,D,T] output, new_pre_cache loopback
- Decoder: float32 h/c, names targets/h_in/c_in, output [1,640,1]
- Joint: argmax baked in, token_id output, [B,D,1] inputs
- RNNT decoder: strided [B,D,T] frame copy, maxSymbolsPerStep=2

WIP: batch transcription produces empty text — need to debug
encoder output handling (8 frames vs 4, encoded_length mismatch)
- Convert decoder output to float32 before joint (was float16 mismatch)
- Decode only last 4 frames of 8-frame encoder output (skip pre-cache)
- Add debug logging for encoder/joint output analysis

WIP: model still outputs blank. Python test passes with same model.
Likely mel computation difference between Swift and Python.
Revert to v4 model export (manual mel prepend, [B,T,D] encoder output,
float16 decoder/joint, raw logits) which passes all 7 E2E tests.

The v7 export (separate pre_cache, [B,D,T], float32, baked argmax)
produces blanks due to audio_length mismatch in the tracing wrapper
(full_length = audio_length + pre_cache_size causes wrong valid frame
calculation). This needs further investigation.

V4 model uploaded to HuggingFace. CoreML cache cleared.
- New docs/models/parakeet-streaming-asr.md and docs/inference/parakeet-streaming-asr-inference.md
- DictateDemo updates: VAD-driven force-finalize, regression tests, run-loop fix for menu-bar tracking mode
- Decoder fp16->fp32 cast and chunking fixes in StreamingSession/RNNTGreedyDecoder
- All 10 READMEs: add Parakeet-EOU-120M to model and memory tables, DictateDemo to demo apps, streaming model and inference links to architecture section
@ivan-digital ivan-digital merged commit accef28 into main Apr 11, 2026
1 check passed
@ivan-digital ivan-digital deleted the feat/streaming-asr branch April 11, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add streaming ASR with end-of-utterance detection (Parakeet EOU, CoreML)

1 participant