Add streaming ASR with Parakeet EOU 120M + DictateDemo#188
Merged
ivan-digital merged 29 commits intomainfrom Apr 11, 2026
Merged
Add streaming ASR with Parakeet EOU 120M + DictateDemo#188ivan-digital merged 29 commits intomainfrom
ivan-digital merged 29 commits intomainfrom
Conversation
Streaming ASR with end-of-utterance detection using Parakeet EOU 120M on CoreML. Cache-aware FastConformer-RNNT encoder processes 320ms chunks with persistent encoder/decoder state between calls. Module structure: - Configuration, Vocabulary, StreamingMelPreprocessor - RNNTGreedyDecoder (no duration bins, EOU token detection) - StreamingSession with AsyncStream<PartialTranscript> API - SpeechRecognitionModel + ModelMemoryManageable conformance 17/17 unit tests pass. E2E inference produces blank tokens — needs investigation into mel preprocessing match with NeMo, pre_cache encoder input, and decoder output transposition.
- Add pre_cache mel-level context input to encoder - Encoder output now [B,T,D] — use simple memcpy for frame extraction - Decoder output now [B,U,D] — remove transpose workarounds - Use shape[1] for frame count (was shape[2] for old [B,D,T] layout) Remaining: batch transcription produces empty text because per-chunk mel normalization differs from NeMo's whole-utterance normalization. Need to port proven MelPreprocessor from ParakeetASR.
- Replace custom mel preprocessor with proven NeMo-matching DSP from ParakeetASR (Slaney filterbank, whole-utterance normalization, vDSP) - Output float32 mel (encoder expects float32) - Fix mel truncation to use actual array shape, not melLength - 6/7 E2E tests pass: streaming transcription, latency (RTF 0.056), session, warmup, model loading, memory management
EOU token fires during pushAudio, returning the transcript as a final partial and clearing accumulated tokens. The batch path was discarding pushAudio results, so finalize() found empty tokens. All 24 tests pass (17 unit + 7 E2E). Streaming latency RTF 0.056.
Menu bar app using Parakeet EOU 120M for real-time voice dictation. Live partial transcripts update as you speak, EOU auto-commits sentences, paste to frontmost app via Cmd+Shift+V. - StreamingRecorder: 16kHz mic capture with audio level - DictateViewModel: streaming session management, paste-to-app - Menu bar dropdown + floating HUD with live transcript
- Move audio processing off audio thread via Sendable ASRProcessor - Buffer audio and drain on background queue (was blocking audio thread) - Don't clear tokens on EOU — keep accumulating for continuous dictation - EOU marks sentence boundary, not hard reset
Token accumulation without reset caused partials to repeat the same text forever. Restore allTokens.removeAll() on EOU — each utterance is a fresh segment. Encoder/decoder LSTM state persists across EOU. Removed unsafe cache reset (caused SIGBUS from concurrent memset). Model self-recovers after ~10-20s silence naturally.
…Timer - Add SpeechVAD for speech/silence indicator (CoreML Silero) - Multi-sentence display: each EOU-committed sentence on own line - DispatchSourceTimer for reliable processQueue scheduling - Auto-load models on app launch - VAD is UI-only — all audio feeds to encoder for cache continuity Known: streaming per-chunk mel normalization limits real-time transcription quality vs batch mode. Needs investigation into running mel normalization or shared mel context across chunks.
- Switch from @observable to ObservableObject + @published (Combine) to fix menu bar and HUD window not re-rendering on property changes - Add extractRaw() mel mode (no normalization) for streaming — fixes blank-token issue caused by per-chunk normalization mismatch - Add extractStreaming() with running Welford normalization - Auto-open HUD window on recording start - Batch transcribeAudio uses whole-utterance normalization (all E2E pass) - Streaming produces live partials and EOU-committed finals Known limitation: after EOU commits a sentence, encoder cache needs time to re-engage for next utterance. First sentence transcribes well, subsequent sentences may be delayed or missed.
After end-of-utterance, zero all encoder caches (preCache, cacheLastChannel, cacheLastTime), reset decoder LSTM, re-prime with blank token. Each utterance starts with fresh state.
Correct the streaming pipeline to match NeMo's cache-aware design: - Pre-encode cache is mel-level, handled externally by the caller: save last 9 mel frames from each chunk, prepend to next chunk's mel input. Encoder input is [1, 128, 43] (9 cache + 34 chunk). - Remove pre_cache as separate encoder input (was incorrect) - Use extractRaw() (no normalization) for all paths — EOU model was trained with normalize: "NA" - Full session reset after EOU: encoder caches, decoder LSTM, pre-encode mel cache, accumulated tokens 24/24 tests pass (17 unit + 7 E2E). RTF 0.103 (10x real-time).
WIP: menu bar button action not triggering consistently. The streaming pipeline works (24/24 E2E tests pass) but the SwiftUI MenuBarExtra UI has rendering/interaction issues.
6/6 tests pass: - Streaming session produces text from real audio - Multi-utterance with silence gaps (no crash) - VAD detects speech vs silence correctly - Chunk latency RTF 0.006 (166x real-time) The streaming pipeline works end-to-end. Menu bar UI has separate SwiftUI rendering issues to debug.
The EOU model was trained with NeMo's streaming config which uses: - Symmetric Hann window (divides by N-1, not N) - Zero center padding (not reflect padding) - No per-feature normalization Our mel used periodic Hann + reflect padding, producing mel values the model didn't recognize. Fixed extractRaw() to match NeMo. Result: mic audio now transcribes correctly across all chunk sizes. 24/24 main tests pass + demo mic test produces text.
- Add mic audio normalization (mic levels too low for EOU model)
- Normalize before VAD and ASR for consistent levels
- Switch demo from MenuBarExtra to regular Window (more reliable)
- EOU reset: only clear tokens, keep encoder/decoder state
- Debug audio capture to WAV for pipeline validation
- Added testDebugWavLoading and testDebugAudioSmallChunks tests
Findings:
- EOU model is English-only (blanks on Russian speech)
- Periodic Hann + reflect padding works in live demo
- Model produces text from live mic ("hello what are you")
- Needs mel A/B test vs NeMo reference in models repo
…adding Three mel computation bugs that caused the EOU model to produce blanks: 1. Window centering: torch.stft centers the Hann window in the FFT frame with (n_fft - win_length) / 2 = 56 zeros on each side. Our code put the window at offset 0 with 112 trailing zeros — wrong phase/magnitude. 2. Symmetric Hann window: NeMo uses periodic=False (divides by N-1). 3. Zero center padding: NeMo uses pad_mode='constant', not reflect. Validated against NeMo reference: max mel difference 0.0001 (was 2.26). Demo transcribes live mic: "what is the name of the screen"
- Don't clear tokens on EOU — accumulate for continuous dictation
- UI extracts delta text (new since last commit) for sentence display
- Track lastCommittedText to compute deltas from full accumulated text
Live mic transcription works ("what is the name of the screen")
but is intermittent — some runs produce tokens, others don't.
Needs investigation into CoreML inference determinism.
The EOU model needs continuous encoder/decoder context across utterance boundaries. Resetting state after EOU kills the model's ability to re-engage on subsequent speech. Fixes: - Remove guard !eouDetected in pushAudio (was killing session) - Keep all encoder caches, decoder LSTM, tokens across EOU - Use eouTokenOffset to decode only new tokens per segment - Demo: no session replacement, just append FINAL sentences Result: multi-utterance dictation works — "what", "do you", "want to go" transcribed as three separate FINALs from one session.
The normalization was causing more problems than it solved. With the correct 64 mel frame chunk size, the model handles raw mic levels (rms 0.03-0.06) correctly. Multi-utterance dictation working: "when you come home", "feel you jump the street" — two separate FINALs from raw mic. Model re-exported with NeMo's actual 320ms config: chunk_size=[57,64], shift_size=[25,32], pre_encode_cache=[0,9]
vDSP_fft_zrip scales output by 2x vs standard DFT. The power spectrum (magnitude squared) was 4x larger than torch.stft output, adding +1.386 bias to all log-mel values. The model was trained with unscaled DFT, so this mismatch reduced sensitivity. Fix: multiply power spectrum by 0.25 in extractRaw(). Result: significantly improved sensitivity on normal-volume speech. 4 sentences transcribed from raw mic without normalization.
Matches FluidAudio's implementation: - Decoder LSTM state (h, c) as float32 (was float16) - Joint network inputs as float32 - Encoder frame copy converts float16→float32 - argmax/logSoftmax operate on float32 directly Reduces accumulated rounding errors over long streaming sessions.
- Encoder now has separate pre_cache [1,128,9] input and new_pre_cache output — model handles concatenation internally - Remove manual mel prepending and cache rotation code - Chunk samples = (melFrames-1) * hopLength = 10080 → exactly 64 mel frames - All discrepancies with reference implementation resolved: symmetric Hann window, zero padding, centered FFT, FFT/4 scaling, float32 decoder/joint, separate pre_cache, 64 mel frames
This reverts commit 0a2aa16.
The model-managed pre_cache (returning audio_signal tail as new_pre_cache) caused hallucination because it's not the encoder's actual subsampling cache. Revert to manual mel prepending which produces correct transcription. Keep the (melFrames-1)*hopLength chunk size fix for exact 64 mel frames.
The float32 decoder/joint changes caused dtype mismatch with the v4 model. Reverted to exact working state from commit 3d94bfb. Live mic still shows 'sim' hallucination despite E2E tests passing. Root cause: model I/O spec differs from reference implementation. Need to re-export model matching reference spec exactly: - Separate pre_cache input with model-managed new_pre_cache output - Encoder output [1,D,T] not [1,T,D] - Float32 decoder h/c (not float16) - Argmax baked into joint model
- Encoder: separate pre_cache, [B,D,T] output, new_pre_cache loopback - Decoder: float32 h/c, names targets/h_in/c_in, output [1,640,1] - Joint: argmax baked in, token_id output, [B,D,1] inputs - RNNT decoder: strided [B,D,T] frame copy, maxSymbolsPerStep=2 WIP: batch transcription produces empty text — need to debug encoder output handling (8 frames vs 4, encoded_length mismatch)
- Convert decoder output to float32 before joint (was float16 mismatch) - Decode only last 4 frames of 8-frame encoder output (skip pre-cache) - Add debug logging for encoder/joint output analysis WIP: model still outputs blank. Python test passes with same model. Likely mel computation difference between Swift and Python.
Revert to v4 model export (manual mel prepend, [B,T,D] encoder output, float16 decoder/joint, raw logits) which passes all 7 E2E tests. The v7 export (separate pre_cache, [B,D,T], float32, baked argmax) produces blanks due to audio_length mismatch in the tracing wrapper (full_length = audio_length + pre_cache_size causes wrong valid frame calculation). This needs further investigation. V4 model uploaded to HuggingFace. CoreML cache cleared.
- New docs/models/parakeet-streaming-asr.md and docs/inference/parakeet-streaming-asr-inference.md - DictateDemo updates: VAD-driven force-finalize, regression tests, run-loop fix for menu-bar tracking mode - Decoder fp16->fp32 cast and chunking fixes in StreamingSession/RNNTGreedyDecoder - All 10 READMEs: add Parakeet-EOU-120M to model and memory tables, DictateDemo to demo apps, streaming model and inference links to architecture section
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Streaming ASR module using Parakeet EOU 120M on CoreML with end-of-utterance detection. Includes macOS dictation demo app.
ParakeetStreamingASR module
AsyncStream<PartialTranscript>API with persistent encoder/decoder stateDictateDemo (macOS)
Model (aufklarer/Parakeet-EOU-120M-CoreML-INT8)
Known limitations
Test plan
Closes #186