TrevorS
diff --git a/‎CLAUDE.md‎
Lines changed: 9 additions & 2 deletions b/‎CLAUDE.md‎
Lines changed: 9 additions & 2 deletions
diff --git a/‎Cargo.toml‎
Lines changed: 1 addition & 0 deletions b/‎Cargo.toml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 9 additions & 1 deletion b/‎Makefile‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 17 additions & 8 deletions b/‎README.md‎
Lines changed: 17 additions & 8 deletions
diff --git a/‎docs/BENCHMARKS.md‎
Lines changed: 83 additions & 30 deletions b/‎docs/BENCHMARKS.md‎
Lines changed: 83 additions & 30 deletions
diff --git a/‎docs/CUSTOM_CUDA_KERNELS_PLAN.md‎
Lines changed: 89 additions & 0 deletions b/‎docs/CUSTOM_CUDA_KERNELS_PLAN.md‎
Lines changed: 89 additions & 0 deletions
@@ -16,26 +16,30 @@ cargo bench                                    # Criterion micro-benchmarks (no
 ```
 
 Python scripts in `scripts/` are linted with:
+
 ```bash
 uvx ruff format --check scripts/
 uvx ruff check scripts/
 ```
 
 Pre-commit (runs both Rust and Python checks):
+
 ```bash
 make pre-commit
 ```
 
 ## Profiling & Benchmarks
 
 Model weights required. Run inside Docker for CUDA:
+
 ```bash
 make profile-chrome MODEL_DIR=test_data/models/1.7B-CustomVoice
 make profile-flamegraph MODEL_DIR=test_data/models/1.7B-CustomVoice
 make audit-gpu-syncs
 ```
 
 E2E benchmarks:
+
 ```bash
 cargo run --release --features cuda,cli --bin e2e_bench -- \
   --model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 2 --streaming
@@ -47,11 +51,12 @@ Three-stage TTS pipeline, all in `src/`:
 
 1. **TalkerModel** (`models/talker.rs`) — 28-layer transformer generating semantic tokens from text. Uses MRoPE, KV caching. 0.6B models: hidden=1024, 1.7B models: hidden=2048.
 
-2. **CodePredictor** (`models/code_predictor.rs`) — 5-layer transformer generating 15 acoustic codes per semantic token. Always hidden=1024; 1.7B models use `small_to_mtp_projection` to bridge from talker's 2048-dim space. Called every frame during generation.
+1. **CodePredictor** (`models/code_predictor.rs`) — 5-layer transformer generating 15 acoustic codes per semantic token. Always hidden=1024; 1.7B models use `small_to_mtp_projection` to bridge from talker's 2048-dim space. Called every frame during generation.
 
-3. **Decoder12Hz** (`models/codec/decoder_12hz.rs`) — ConvNeXt + transposed convolution decoder converting 16-codebook codes to 24kHz audio. Always F32.
+1. **Decoder12Hz** (`models/codec/decoder_12hz.rs`) — ConvNeXt + transposed convolution decoder converting 16-codebook codes to 24kHz audio. Always F32.
 
 The generation loop (`lib.rs::generate_codes`) ties them together:
+
 ```
 For each frame:
   1. CodePredictor generates 15 acoustic codes from last_hidden + semantic embedding
@@ -73,6 +78,7 @@ For each frame:
 ## Model Variants
 
 Five variants, auto-detected from `config.json`:
+
 - **Base** (0.6B, 1.7B): Voice cloning via ECAPA-TDNN speaker encoder. ICL mode uses speech encoder + reference text.
 - **CustomVoice** (0.6B, 1.7B): 9 preset speakers (Ryan, Serena, etc.) via discrete speaker token IDs.
 - **VoiceDesign** (1.7B only): Text-described voices via instruct prompt with ChatML framing.
@@ -93,6 +99,7 @@ Five variants, auto-detected from `config.json`:
 ## Codec Token IDs
 
 Generation uses codec vocabulary (0–3071), not text vocabulary:
+
 - EOS: 2150 (generation stops here)
 - BOS: 2149, PAD: 2148
 - Speakers: Ryan=3061, Serena=3066, etc.
 
@@ -67,6 +67,7 @@ audioadapter-buffers = "2.0.0"
 
 # Profiling (optional)
 tracing-chrome = { version = "0.7", optional = true }
+half = "2.7.1"
 
 [dev-dependencies]
 criterion = "0.8"
 
@@ -1,4 +1,4 @@
-.PHONY: lint fmt pre-commit pre-commit-install profile-chrome profile-flamegraph profile-nsys audit-gpu-syncs
+.PHONY: lint fmt pre-commit pre-commit-install profile-chrome profile-flamegraph profile-nsys audit-gpu-syncs test-kernel count-kernels
 
 MODEL_DIR ?= test_data
 
@@ -33,3 +33,11 @@ profile-nsys:
 
 audit-gpu-syncs:
 	@bash scripts/audit-gpu-syncs.sh
+
+# ── Kernel Development ──────────────────────────────────────────────────
+
+test-kernel:
+	@bash scripts/test-kernel.sh $(NAME)
+
+count-kernels:
+	@bash scripts/count-kernels.sh $(MODEL_DIR)
@@ -6,6 +6,15 @@ All code in this repo was written with [Claude Code](https://claude.ai/code). Th
 
 ## Changelog
 
+### 0.4.0
+
+- Pre-allocated KV cache with InplaceOp2 (zero-copy CUDA writes, no Tensor::cat)
+- GPU-side repetition penalty mask (incremental slice_assign, eliminates growing CPU transfer)
+- Deferred acoustic codes transfer (single bulk GPU→CPU at end of generation)
+- Fused residual + RMSNorm CUDA kernel
+- GPU→CPU syncs reduced from 3/frame to 1/frame (4-byte EOS check)
+- Non-streaming RTF: 0.48–0.67 across all variants (97-100% of theoretical throughput)
+
 ### 0.3.0
 
 - GPU-side sampling: batched argmax, on-device top-k/top-p/repetition penalty
@@ -52,16 +61,16 @@ Thanks to [u/rngesius](https://www.reddit.com/r/LocalLLaMA/comments/1qqvb79/comm
 Benchmarked on an NVIDIA DGX Spark (GB10 Blackwell, ARM Cortex-X925, 120 GB unified memory).
 Default generation parameters, seed 42, 2 warmup + 3 timed iterations.
 
-| Model | RTF (short) | RTF (long) | Tok/s | TTFA | Memory |
-|-------|-------------|------------|-------|------|--------|
-| **0.6B Base (CUDA BF16)** | **0.56** | **0.68** | 22.2 | 448 ms | 814 MB |
-| **1.7B Base (CUDA BF16)** | **0.72** | **0.74** | 17.3 | 590 ms | 761 MB |
-| **1.7B CustomVoice (CUDA BF16)** | **0.72** | **0.75** | 17.3 | 585 ms | 761 MB |
-| **1.7B VoiceDesign (CUDA BF16)** | **0.72** | **0.75** | 17.3 | 585 ms | 761 MB |
-| 1.7B CustomVoice (CPU F32) | 5.39 | 6.48 | 2.1 | — | 9.1 GB |
+| Model | RTF (short) | RTF (long) | Tok/s | Memory |
+|-------|-------------|------------|-------|--------|
+| **0.6B Base (CUDA BF16)** | **0.48** | **0.50** | 25.9 | 767 MB |
+| **1.7B Base (CUDA BF16)** | **0.65** | **0.65** | 19.4 | 767 MB |
+| **1.7B CustomVoice (CUDA BF16)** | **0.64** | **0.67** | 19.2 | 772 MB |
+| **1.7B VoiceDesign (CUDA BF16)** | **0.64** | **0.66** | 19.3 | 770 MB |
+| 1.7B CustomVoice (CPU F32) | 5.39 | 6.48 | 2.1 | 9.1 GB |
 
 RTF (real-time factor) = wall-clock / audio duration. **< 1.0 is faster than real-time.**
-TTFA = time to first audio chunk via streaming.
+Non-streaming results shown above. Streaming adds ~8-12% overhead with TTFA ~444 ms (0.6B) / ~580 ms (1.7B).
 
 See [docs/BENCHMARKS.md](docs/BENCHMARKS.md) for full results, test corpus, micro-benchmarks, and reproduction instructions.
 
 
@@ -4,7 +4,7 @@ Performance measurements for `qwen3-tts-rs` inference across CPU and GPU.
 
 All results use default generation parameters
 (temperature=0.9, top_k=50, top_p=0.9, repetition_penalty=1.05, seed=42).
-2 warmup runs, 3 timed iterations, streaming mode enabled for TTFA measurement.
+2 warmup runs, 3 timed iterations.
 
 ## Test Hardware
 
@@ -31,41 +31,80 @@ Real-time factor (RTF) = wall-clock time / audio duration. **Lower is better; <
 
 Each cell shows the average of 3 timed iterations after 2 warmup runs, executed in isolation (no concurrent GPU workloads).
 
-### 0.6B Base — CUDA (BF16)
+### Non-Streaming (batch synthesis)
+
+Uses `synthesize_with_timing` — the optimized `generate_codes` path with GPU-side
+penalty mask and deferred acoustic codes transfer.
+
+#### 0.6B Base — CUDA (BF16)
+
+| Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
+|------|-------|------------|----------------|-----|-------|--------|---------|----------|--------|
+| Short | 13 | 1.82 sec | 3.76 sec | **0.49** | 25.8 | 756 MB | 12ms (1%) | 1671ms (92%) | 140ms (8%) |
+| Medium | 53 | 8.19 sec | 17.04 sec | **0.48** | 26.0 | 761 MB | 12ms (0%) | 7672ms (94%) | 504ms (6%) |
+| Long | 115 | 23.02 sec | 45.68 sec | **0.50** | 24.8 | 767 MB | 12ms (0%) | 21622ms (94%) | 1384ms (6%) |
+
+#### 1.7B Base — CUDA (BF16)
+
+| Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
+|------|-------|------------|----------------|-----|-------|--------|---------|----------|--------|
+| Short | 13 | 2.22 sec | 3.44 sec | **0.64** | 19.4 | 756 MB | 21ms (1%) | 2065ms (93%) | 129ms (6%) |
+| Medium | 53 | 11.22 sec | 17.60 sec | **0.64** | 19.6 | 761 MB | 22ms (0%) | 10672ms (95%) | 521ms (5%) |
+| Long | 115 | 29.82 sec | 45.68 sec | **0.65** | 19.2 | 767 MB | 22ms (0%) | 28409ms (95%) | 1382ms (5%) |
+
+#### 1.7B CustomVoice — CUDA (BF16)
+
+| Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
+|------|-------|------------|----------------|-----|-------|--------|---------|----------|--------|
+| Short | 13 | 3.02 sec | 4.72 sec | **0.64** | 19.6 | 756 MB | 22ms (1%) | 2834ms (94%) | 161ms (5%) |
+| Medium | 53 | 20.06 sec | 31.12 sec | **0.64** | 19.4 | 763 MB | 21ms (0%) | 19094ms (95%) | 945ms (5%) |
+| Long | 115 | 45.60 sec | 68.00 sec | **0.67** | 18.6 | 772 MB | 22ms (0%) | 43535ms (95%) | 2040ms (4%) |
+
+#### 1.7B VoiceDesign — CUDA (BF16)
+
+| Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
+|------|-------|------------|----------------|-----|-------|--------|---------|----------|--------|
+| Short | 13 | 3.13 sec | 4.88 sec | **0.64** | 19.5 | 756 MB | 22ms (1%) | 2938ms (94%) | 165ms (5%) |
+| Medium | 53 | 13.52 sec | 21.12 sec | **0.64** | 19.5 | 761 MB | 22ms (0%) | 12867ms (95%) | 626ms (5%) |
+| Long | 115 | 42.14 sec | 62.96 sec | **0.67** | 18.7 | 770 MB | 23ms (0%) | 40215ms (95%) | 1896ms (4%) |
+
+### Streaming (with TTFA)
+
+Uses `synthesize_streaming` — yields audio chunks incrementally. Both paths now
+use GPU-side penalty mask. Streaming is ~8-12% slower than non-streaming due to
+incremental decode overhead and per-frame `to_vec1` for the frame buffer.
+
+#### 0.6B Base — CUDA (BF16)
 
 | Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
 |------|-------|------------|----------------|-----|------|-------|--------|
-| Short | 13 | 2.30 sec | 4.08 sec | **0.56** | 448 ms | 22.2 | 814 MB |
-| Medium | 53 | 10.08 sec | 17.84 sec | **0.57** | 452 ms | 22.1 | 817 MB |
-| Long | 115 | 110.63 sec | 163.84 sec | **0.68** | 456 ms | 18.5 | 841 MB |
-
-> Note: The 0.6B Base model generates significantly more frames per word than 1.7B models,
-> producing longer audio from the same text. The RTF increase on the long input reflects
-> the higher frame count (2048 frames vs ~529 for 1.7B).
+| Short | 13 | 2.05 sec | 3.76 sec | **0.55** | 443 ms | 22.9 | 814 MB |
+| Medium | 53 | 9.38 sec | 17.04 sec | **0.55** | 444 ms | 22.7 | 817 MB |
+| Long | 115 | 26.01 sec | 45.68 sec | **0.57** | 445 ms | 22.0 | 820 MB |
 
-### 1.7B Base — CUDA (BF16)
+#### 1.7B Base — CUDA (BF16)
 
 | Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
 |------|-------|------------|----------------|-----|------|-------|--------|
-| Short | 13 | 2.25 sec | 3.12 sec | **0.72** | 590 ms | 17.3 | 761 MB |
-| Medium | 53 | 13.24 sec | 18.32 sec | **0.72** | 592 ms | 17.3 | 765 MB |
-| Long | 115 | 31.12 sec | 42.32 sec | **0.74** | 591 ms | 17.0 | 771 MB |
+| Short | 13 | 2.45 sec | 3.44 sec | **0.71** | 576 ms | 17.6 | 762 MB |
+| Medium | 53 | 12.37 sec | 17.60 sec | **0.70** | 579 ms | 17.8 | 765 MB |
+| Long | 115 | 32.94 sec | 45.68 sec | **0.72** | 576 ms | 17.3 | 768 MB |
 
-### 1.7B CustomVoice — CUDA (BF16)
+#### 1.7B CustomVoice — CUDA (BF16)
 
 | Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
 |------|-------|------------|----------------|-----|------|-------|--------|
-| Short | 13 | 2.65 sec | 3.68 sec | **0.72** | 585 ms | 17.3 | 761 MB |
-| Medium | 53 | 24.11 sec | 33.12 sec | **0.73** | 588 ms | 17.2 | 766 MB |
-| Long | 115 | 45.18 sec | 60.32 sec | **0.75** | 590 ms | 16.7 | 769 MB |
+| Short | 13 | 3.34 sec | 4.72 sec | **0.71** | 582 ms | 17.7 | 762 MB |
+| Medium | 53 | 22.25 sec | 31.12 sec | **0.72** | 581 ms | 17.5 | 767 MB |
+| Long | 115 | 50.52 sec | 68.00 sec | **0.74** | 585 ms | 16.8 | 773 MB |
 
-### 1.7B VoiceDesign — CUDA (BF16)
+#### 1.7B VoiceDesign — CUDA (BF16)
 
 | Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
 |------|-------|------------|----------------|-----|------|-------|--------|
-| Short | 13 | 3.01 sec | 4.16 sec | **0.72** | 585 ms | 17.3 | 761 MB |
-| Medium | 53 | 14.73 sec | 20.48 sec | **0.72** | 585 ms | 17.4 | 764 MB |
-| Long | 115 | 53.78 sec | 71.36 sec | **0.75** | 590 ms | 16.6 | 778 MB |
+| Short | 13 | 3.50 sec | 4.88 sec | **0.72** | 584 ms | 17.4 | 762 MB |
+| Medium | 53 | 15.04 sec | 21.12 sec | **0.71** | 582 ms | 17.6 | 765 MB |
+| Long | 115 | 46.46 sec | 62.96 sec | **0.74** | 582 ms | 16.9 | 771 MB |
 
 ### CPU (F32, no MKL/BLAS)
 
@@ -77,24 +116,38 @@ Each cell shows the average of 3 timed iterations after 2 warmup runs, executed
 
 ### Summary
 
+**Non-streaming** (batch synthesis — optimized `generate_codes` path):
+
 | Metric | CPU (1.7B) | 0.6B Base | 1.7B Base | 1.7B CustomVoice | 1.7B VoiceDesign |
 |--------|----------:|---------:|---------:|----------------:|----------------:|
-| RTF (avg) | 5.96 | **0.60** | 0.73 | 0.73 | 0.73 |
-| Tokens/sec | 2.1 | **20.9** | 17.2 | 17.1 | 17.1 |
-| TTFA | — | **452ms** | 591ms | 588ms | 587ms |
-| Peak memory | 9.1 GB | 841 MB | 771 MB | 769 MB | 778 MB |
+| RTF (avg) | 5.96 | **0.49** | 0.64 | 0.65 | 0.65 |
+| Tokens/sec | 2.1 | **25.5** | **19.4** | 19.2 | 19.2 |
+| Peak memory | 9.1 GB | 767 MB | 767 MB | 772 MB | 770 MB |
+
+**Streaming** (incremental chunks with TTFA):
+
+| Metric | 0.6B Base | 1.7B Base | 1.7B CustomVoice | 1.7B VoiceDesign |
+|--------|---------:|---------:|----------------:|----------------:|
+| RTF (avg) | **0.55** | 0.71 | 0.72 | 0.72 |
+| Tokens/sec | **22.5** | 17.6 | 17.3 | 17.3 |
+| TTFA | **444 ms** | 577 ms | 583 ms | 583 ms |
+| Peak memory | 820 MB | 768 MB | 773 MB | 771 MB |
+
+**CUDA delivers faster-than-real-time synthesis** across all text lengths and
+all model variants. Non-streaming is ~8-12% faster than streaming due to
+deferred acoustic codes transfer in the `generate_codes` path; both paths
+use the GPU-side penalty mask.
+
+The 0.6B model is ~30% faster than 1.7B variants, at the cost of reduced
+voice quality.
 
-**CUDA delivers faster-than-real-time synthesis** across all text lengths.
 CPU is ~6x slower than real-time without BLAS acceleration — expected for
 a 1.7B parameter model in F32. Enabling MKL (x86) or Accelerate (macOS)
 would improve CPU performance significantly.
 
-TTFA (time to first audio) via streaming is stable at ~590ms (1.7B) or ~450ms (0.6B)
+TTFA (time to first audio) via streaming is stable at ~580ms (1.7B) or ~444ms (0.6B)
 regardless of input length, making the streaming API suitable for interactive use cases.
 
-The 0.6B model is ~20% faster than 1.7B variants with lower TTFA, at the cost
-of reduced voice quality.
-
 ## Micro-Benchmarks
 
 Component-level benchmarks run via [Criterion](https://bheisler.github.io/criterion.rs/book/).
 
@@ -0,0 +1,89 @@
+# Custom CUDA Kernels Plan
+
+## Current State
+
+- **19.2 tok/s** (short), **18.3 tok/s** (long) on DGX A100
+- ~95% of theoretical throughput given current kernel launch pattern
+- ~625-740 CUDA kernel launches per talker decode step (28 layers × ~24 kernels/layer)
+- Decode is memory-bandwidth bound at batch=1
+- Target: **25-27 tok/s** (~40% improvement)
+
+## Phase 1: Fused Residual + RMSNorm (estimated 15-25% speedup)
+
+**Why first:** Executes 33 times per frame (28 talker + 5 code predictor layers). Currently
+3 separate kernel launches per norm (residual add → variance reduction → normalize+scale).
+Fusing to 1 kernel eliminates 66 launches/frame and halves memory traffic.
+
+**Approach:** Use the existing `candle-layer-norm` crate which has fused RMSNorm CUDA kernels
+for candle. If it doesn't support residual-add fusion, extend or write our own PTX kernel.
+
+**Steps:**
+1. Add `candle-layer-norm` dependency, feature-gated behind `cuda`
+2. Create `FusedRmsNorm` wrapper matching candle's `RmsNorm` interface
+3. Wire into `DecoderLayer::forward()` in `transformer.rs`
+4. Wire into `CodePredictor` layers
+5. Unit test: compare fused vs sequential output on random tensors
+6. Benchmark with e2e_bench
+
+**Files:**
+- `Cargo.toml`
+- `src/models/transformer.rs` — DecoderLayer norm calls
+- `src/models/code_predictor.rs` — CodePredictor norm calls
+
+## Phase 2: Fused SwiGLU MLP (estimated 5-10% speedup)
+
+**Why:** MLP does gate_proj → silu → up_proj → mul → down_proj. The silu+mul step is 2
+kernel launches per layer that can become 1. gate_proj and up_proj share input (one load).
+
+**Approach:** Write a custom PTX kernel via candle's `get_or_load_custom_func()`:
+- Fused op: element-wise `silu(a) * b` (matmuls stay in cuBLAS)
+- Eliminates 2 → 1 kernel launches per layer (×33 layers = 33 fewer launches/frame)
+
+**Steps:**
+1. Write `kernels/fused_silu_mul.cu` — element-wise `silu(a) * b`
+2. Compile to PTX, embed via `include_str!`
+3. Implement as `CustomOp2` in `src/models/fused_ops.rs`
+4. Replace `Activation::Silu` + `Tensor::mul` in MLP::forward
+5. Unit test: compare against sequential silu+mul
+6. Benchmark
+
+**Files:**
+- `kernels/fused_silu_mul.cu` (new)
+- `src/models/fused_ops.rs` (new)
+- `src/models/transformer.rs` — MLP::forward
+- `build.rs` or inline PTX string
+
+## Phase 3: Fused RoPE (estimated 2-5% speedup)
+
+**Why:** RoPE does cos/sin computation + element-wise ops as separate kernels. Fusing
+saves memory round-trips. Runs twice per layer (Q and K) × 28 layers = 56 calls/frame.
+
+**Steps:**
+1. Write `kernels/fused_rope.cu` — combined cos/sin rotation
+2. Implement as `CustomOp1`
+3. Replace multi-step RoPE in `Attention::forward`
+4. Unit test + benchmark
+
+**Files:**
+- `kernels/fused_rope.cu` (new)
+- `src/models/fused_ops.rs` — add RoPE op
+- `src/models/transformer.rs` — Attention::forward
+
+## Iteration Protocol
+
+After each phase:
+1. `cargo test --lib` (must pass)
+2. `cargo clippy --lib -- -D warnings`
+3. e2e_bench with 3 iterations
+4. Record in `docs/PERFORMANCE_JOURNAL.md`
+5. Chrome trace to verify kernel count reduction
+6. Commit + PR
+
+## Expected Cumulative Impact
+
+| After Phase | Estimated tok/s | Kernel launches/frame |
+|-------------|----------------|-----------------------|
+| Baseline | 18-19 | ~700 |
+| 1 (Fused RmsNorm) | 22-24 | ~634 |
+| 2 (Fused SwiGLU) | 24-26 | ~601 |
+| 3 (Fused RoPE) | 25-27 | ~545 |