Add streaming ASR with Parakeet EOU 120M + DictateDemo by ivan-digital · Pull Request #188 · soniqo/speech-swift

ivan-digital · 2026-04-05T09:39:16Z

Summary

Streaming ASR module using Parakeet EOU 120M on CoreML with end-of-utterance detection. Includes macOS dictation demo app.

ParakeetStreamingASR module

Cache-aware FastConformer-RNNT encoder processing 320ms chunks
RNNT greedy decoder with EOU token detection (ID 1024)
AsyncStream<PartialTranscript> API with persistent encoder/decoder state
Mel preprocessing matching NeMo reference: symmetric Hann window, zero center padding, centered FFT frame, power/4 scaling, no normalization
Pre-encode mel cache (manual prepend, 9 frames context)
Multi-utterance: no state reset after EOU, token offset tracking
Float32 decoder LSTM and joint inputs

DictateDemo (macOS)

Window app with mic button, VAD indicator, live transcript
Silero VAD (CoreML) for speech/silence detection
Multi-sentence display with EOU-committed lines
Debug overlay (audio RMS, chunks processed, partials count)
Raw mic audio (no normalization needed with correct mel)

Model (aufklarer/Parakeet-EOU-120M-CoreML-INT8)

Encoder: INT8, 102 MB, input [1,128,73] (9 cache + 64 chunk)
Decoder: 7.5 MB, float32 h/c
Joint: 2.7 MB, float32 inputs
Total: 112 MB

Known limitations

English only (Parakeet EOU 120M is English-only)
Pre-encode cache is manual mel prepend (not model-managed loopback)
Model sensitivity varies with mic distance/volume

Test plan

17 unit tests (config, vocab, mel, cache shapes, types)
7 E2E tests (batch, streaming, session, latency, warmup, memory, model loading)
6 demo E2E tests (streaming, VAD, multi-utterance, latency, real audio)
Mel reference test vs NeMo (max diff 0.0001)
Live mic transcription: multi-utterance dictation

Closes #186

Streaming ASR with end-of-utterance detection using Parakeet EOU 120M on CoreML. Cache-aware FastConformer-RNNT encoder processes 320ms chunks with persistent encoder/decoder state between calls. Module structure: - Configuration, Vocabulary, StreamingMelPreprocessor - RNNTGreedyDecoder (no duration bins, EOU token detection) - StreamingSession with AsyncStream<PartialTranscript> API - SpeechRecognitionModel + ModelMemoryManageable conformance 17/17 unit tests pass. E2E inference produces blank tokens — needs investigation into mel preprocessing match with NeMo, pre_cache encoder input, and decoder output transposition.

- Add pre_cache mel-level context input to encoder - Encoder output now [B,T,D] — use simple memcpy for frame extraction - Decoder output now [B,U,D] — remove transpose workarounds - Use shape[1] for frame count (was shape[2] for old [B,D,T] layout) Remaining: batch transcription produces empty text because per-chunk mel normalization differs from NeMo's whole-utterance normalization. Need to port proven MelPreprocessor from ParakeetASR.

- Replace custom mel preprocessor with proven NeMo-matching DSP from ParakeetASR (Slaney filterbank, whole-utterance normalization, vDSP) - Output float32 mel (encoder expects float32) - Fix mel truncation to use actual array shape, not melLength - 6/7 E2E tests pass: streaming transcription, latency (RTF 0.056), session, warmup, model loading, memory management

EOU token fires during pushAudio, returning the transcript as a final partial and clearing accumulated tokens. The batch path was discarding pushAudio results, so finalize() found empty tokens. All 24 tests pass (17 unit + 7 E2E). Streaming latency RTF 0.056.

Menu bar app using Parakeet EOU 120M for real-time voice dictation. Live partial transcripts update as you speak, EOU auto-commits sentences, paste to frontmost app via Cmd+Shift+V. - StreamingRecorder: 16kHz mic capture with audio level - DictateViewModel: streaming session management, paste-to-app - Menu bar dropdown + floating HUD with live transcript

- Move audio processing off audio thread via Sendable ASRProcessor - Buffer audio and drain on background queue (was blocking audio thread) - Don't clear tokens on EOU — keep accumulating for continuous dictation - EOU marks sentence boundary, not hard reset

Token accumulation without reset caused partials to repeat the same text forever. Restore allTokens.removeAll() on EOU — each utterance is a fresh segment. Encoder/decoder LSTM state persists across EOU. Removed unsafe cache reset (caused SIGBUS from concurrent memset). Model self-recovers after ~10-20s silence naturally.

…Timer - Add SpeechVAD for speech/silence indicator (CoreML Silero) - Multi-sentence display: each EOU-committed sentence on own line - DispatchSourceTimer for reliable processQueue scheduling - Auto-load models on app launch - VAD is UI-only — all audio feeds to encoder for cache continuity Known: streaming per-chunk mel normalization limits real-time transcription quality vs batch mode. Needs investigation into running mel normalization or shared mel context across chunks.

@published

- Switch from @observable to ObservableObject + @published (Combine) to fix menu bar and HUD window not re-rendering on property changes - Add extractRaw() mel mode (no normalization) for streaming — fixes blank-token issue caused by per-chunk normalization mismatch - Add extractStreaming() with running Welford normalization - Auto-open HUD window on recording start - Batch transcribeAudio uses whole-utterance normalization (all E2E pass) - Streaming produces live partials and EOU-committed finals Known limitation: after EOU commits a sentence, encoder cache needs time to re-engage for next utterance. First sentence transcribes well, subsequent sentences may be delayed or missed.

After end-of-utterance, zero all encoder caches (preCache, cacheLastChannel, cacheLastTime), reset decoder LSTM, re-prime with blank token. Each utterance starts with fresh state.

Correct the streaming pipeline to match NeMo's cache-aware design: - Pre-encode cache is mel-level, handled externally by the caller: save last 9 mel frames from each chunk, prepend to next chunk's mel input. Encoder input is [1, 128, 43] (9 cache + 34 chunk). - Remove pre_cache as separate encoder input (was incorrect) - Use extractRaw() (no normalization) for all paths — EOU model was trained with normalize: "NA" - Full session reset after EOU: encoder caches, decoder LSTM, pre-encode mel cache, accumulated tokens 24/24 tests pass (17 unit + 7 E2E). RTF 0.103 (10x real-time).

WIP: menu bar button action not triggering consistently. The streaming pipeline works (24/24 E2E tests pass) but the SwiftUI MenuBarExtra UI has rendering/interaction issues.

6/6 tests pass: - Streaming session produces text from real audio - Multi-utterance with silence gaps (no crash) - VAD detects speech vs silence correctly - Chunk latency RTF 0.006 (166x real-time) The streaming pipeline works end-to-end. Menu bar UI has separate SwiftUI rendering issues to debug.

The EOU model was trained with NeMo's streaming config which uses: - Symmetric Hann window (divides by N-1, not N) - Zero center padding (not reflect padding) - No per-feature normalization Our mel used periodic Hann + reflect padding, producing mel values the model didn't recognize. Fixed extractRaw() to match NeMo. Result: mic audio now transcribes correctly across all chunk sizes. 24/24 main tests pass + demo mic test produces text.

- Add mic audio normalization (mic levels too low for EOU model) - Normalize before VAD and ASR for consistent levels - Switch demo from MenuBarExtra to regular Window (more reliable) - EOU reset: only clear tokens, keep encoder/decoder state - Debug audio capture to WAV for pipeline validation - Added testDebugWavLoading and testDebugAudioSmallChunks tests Findings: - EOU model is English-only (blanks on Russian speech) - Periodic Hann + reflect padding works in live demo - Model produces text from live mic ("hello what are you") - Needs mel A/B test vs NeMo reference in models repo

…adding Three mel computation bugs that caused the EOU model to produce blanks: 1. Window centering: torch.stft centers the Hann window in the FFT frame with (n_fft - win_length) / 2 = 56 zeros on each side. Our code put the window at offset 0 with 112 trailing zeros — wrong phase/magnitude. 2. Symmetric Hann window: NeMo uses periodic=False (divides by N-1). 3. Zero center padding: NeMo uses pad_mode='constant', not reflect. Validated against NeMo reference: max mel difference 0.0001 (was 2.26). Demo transcribes live mic: "what is the name of the screen"

- Don't clear tokens on EOU — accumulate for continuous dictation - UI extracts delta text (new since last commit) for sentence display - Track lastCommittedText to compute deltas from full accumulated text Live mic transcription works ("what is the name of the screen") but is intermittent — some runs produce tokens, others don't. Needs investigation into CoreML inference determinism.

The EOU model needs continuous encoder/decoder context across utterance boundaries. Resetting state after EOU kills the model's ability to re-engage on subsequent speech. Fixes: - Remove guard !eouDetected in pushAudio (was killing session) - Keep all encoder caches, decoder LSTM, tokens across EOU - Use eouTokenOffset to decode only new tokens per segment - Demo: no session replacement, just append FINAL sentences Result: multi-utterance dictation works — "what", "do you", "want to go" transcribed as three separate FINALs from one session.

The normalization was causing more problems than it solved. With the correct 64 mel frame chunk size, the model handles raw mic levels (rms 0.03-0.06) correctly. Multi-utterance dictation working: "when you come home", "feel you jump the street" — two separate FINALs from raw mic. Model re-exported with NeMo's actual 320ms config: chunk_size=[57,64], shift_size=[25,32], pre_encode_cache=[0,9]

vDSP_fft_zrip scales output by 2x vs standard DFT. The power spectrum (magnitude squared) was 4x larger than torch.stft output, adding +1.386 bias to all log-mel values. The model was trained with unscaled DFT, so this mismatch reduced sensitivity. Fix: multiply power spectrum by 0.25 in extractRaw(). Result: significantly improved sensitivity on normal-volume speech. 4 sentences transcribed from raw mic without normalization.

Matches FluidAudio's implementation: - Decoder LSTM state (h, c) as float32 (was float16) - Joint network inputs as float32 - Encoder frame copy converts float16→float32 - argmax/logSoftmax operate on float32 directly Reduces accumulated rounding errors over long streaming sessions.

- Encoder now has separate pre_cache [1,128,9] input and new_pre_cache output — model handles concatenation internally - Remove manual mel prepending and cache rotation code - Chunk samples = (melFrames-1) * hopLength = 10080 → exactly 64 mel frames - All discrepancies with reference implementation resolved: symmetric Hann window, zero padding, centered FFT, FFT/4 scaling, float32 decoder/joint, separate pre_cache, 64 mel frames

This reverts commit 0a2aa16.

The model-managed pre_cache (returning audio_signal tail as new_pre_cache) caused hallucination because it's not the encoder's actual subsampling cache. Revert to manual mel prepending which produces correct transcription. Keep the (melFrames-1)*hopLength chunk size fix for exact 64 mel frames.

The float32 decoder/joint changes caused dtype mismatch with the v4 model. Reverted to exact working state from commit 3d94bfb. Live mic still shows 'sim' hallucination despite E2E tests passing. Root cause: model I/O spec differs from reference implementation. Need to re-export model matching reference spec exactly: - Separate pre_cache input with model-managed new_pre_cache output - Encoder output [1,D,T] not [1,T,D] - Float32 decoder h/c (not float16) - Argmax baked into joint model

- Encoder: separate pre_cache, [B,D,T] output, new_pre_cache loopback - Decoder: float32 h/c, names targets/h_in/c_in, output [1,640,1] - Joint: argmax baked in, token_id output, [B,D,1] inputs - RNNT decoder: strided [B,D,T] frame copy, maxSymbolsPerStep=2 WIP: batch transcription produces empty text — need to debug encoder output handling (8 frames vs 4, encoded_length mismatch)

- Convert decoder output to float32 before joint (was float16 mismatch) - Decode only last 4 frames of 8-frame encoder output (skip pre-cache) - Add debug logging for encoder/joint output analysis WIP: model still outputs blank. Python test passes with same model. Likely mel computation difference between Swift and Python.

Revert to v4 model export (manual mel prepend, [B,T,D] encoder output, float16 decoder/joint, raw logits) which passes all 7 E2E tests. The v7 export (separate pre_cache, [B,D,T], float32, baked argmax) produces blanks due to audio_length mismatch in the tracing wrapper (full_length = audio_length + pre_cache_size causes wrong valid frame calculation). This needs further investigation. V4 model uploaded to HuggingFace. CoreML cache cleared.

- New docs/models/parakeet-streaming-asr.md and docs/inference/parakeet-streaming-asr-inference.md - DictateDemo updates: VAD-driven force-finalize, regression tests, run-loop fix for menu-bar tracking mode - Decoder fp16->fp32 cast and chunking fixes in StreamingSession/RNNTGreedyDecoder - All 10 READMEs: add Parakeet-EOU-120M to model and memory tables, DictateDemo to demo apps, streaming model and inference links to architecture section

ivan-digital added 5 commits April 5, 2026 11:38

ivan-digital changed the title ~~Add streaming ASR with Parakeet EOU 120M (WIP)~~ Add streaming ASR with Parakeet EOU 120M Apr 5, 2026

ivan-digital added 19 commits April 5, 2026 12:57

Add full session reset after EOU (caches + decoder + tokens)

7ec4bec

After end-of-utterance, zero all encoder caches (preCache, cacheLastChannel, cacheLastTime), reset decoder LSTM, re-prime with blank token. Each utterance starts with fresh state.

Debug demo: add logging, fix menu style issues

370cc14

WIP: menu bar button action not triggering consistently. The streaming pipeline works (24/24 E2E tests pass) but the SwiftUI MenuBarExtra UI has rendering/interaction issues.

Revert "Use model-managed pre_cache input/output, exact 64 mel frames"

6d34bde

This reverts commit 0a2aa16.

ivan-digital changed the title ~~Add streaming ASR with Parakeet EOU 120M~~ Add streaming ASR with Parakeet EOU 120M + DictateDemo Apr 9, 2026

ivan-digital added 4 commits April 9, 2026 19:54

ivan-digital merged commit accef28 into main Apr 11, 2026
1 check passed

ivan-digital deleted the feat/streaming-asr branch April 11, 2026 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming ASR with Parakeet EOU 120M + DictateDemo#188

Add streaming ASR with Parakeet EOU 120M + DictateDemo#188
ivan-digital merged 29 commits intomainfrom
feat/streaming-asr

ivan-digital commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ivan-digital commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

ParakeetStreamingASR module

DictateDemo (macOS)

Model (aufklarer/Parakeet-EOU-120M-CoreML-INT8)

Known limitations

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ivan-digital commented Apr 5, 2026 •

edited

Loading