Skip to content

Commit 275d528

Browse files
authored
Merge pull request #6 from TrevorS/perf/gpu-penalty-mask
2 parents 21ac025 + 3b12811 commit 275d528

26 files changed

Lines changed: 2824 additions & 268 deletions

CLAUDE.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,26 +16,30 @@ cargo bench # Criterion micro-benchmarks (no
1616
```
1717

1818
Python scripts in `scripts/` are linted with:
19+
1920
```bash
2021
uvx ruff format --check scripts/
2122
uvx ruff check scripts/
2223
```
2324

2425
Pre-commit (runs both Rust and Python checks):
26+
2527
```bash
2628
make pre-commit
2729
```
2830

2931
## Profiling & Benchmarks
3032

3133
Model weights required. Run inside Docker for CUDA:
34+
3235
```bash
3336
make profile-chrome MODEL_DIR=test_data/models/1.7B-CustomVoice
3437
make profile-flamegraph MODEL_DIR=test_data/models/1.7B-CustomVoice
3538
make audit-gpu-syncs
3639
```
3740

3841
E2E benchmarks:
42+
3943
```bash
4044
cargo run --release --features cuda,cli --bin e2e_bench -- \
4145
--model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 2 --streaming
@@ -47,11 +51,12 @@ Three-stage TTS pipeline, all in `src/`:
4751

4852
1. **TalkerModel** (`models/talker.rs`) — 28-layer transformer generating semantic tokens from text. Uses MRoPE, KV caching. 0.6B models: hidden=1024, 1.7B models: hidden=2048.
4953

50-
2. **CodePredictor** (`models/code_predictor.rs`) — 5-layer transformer generating 15 acoustic codes per semantic token. Always hidden=1024; 1.7B models use `small_to_mtp_projection` to bridge from talker's 2048-dim space. Called every frame during generation.
54+
1. **CodePredictor** (`models/code_predictor.rs`) — 5-layer transformer generating 15 acoustic codes per semantic token. Always hidden=1024; 1.7B models use `small_to_mtp_projection` to bridge from talker's 2048-dim space. Called every frame during generation.
5155

52-
3. **Decoder12Hz** (`models/codec/decoder_12hz.rs`) — ConvNeXt + transposed convolution decoder converting 16-codebook codes to 24kHz audio. Always F32.
56+
1. **Decoder12Hz** (`models/codec/decoder_12hz.rs`) — ConvNeXt + transposed convolution decoder converting 16-codebook codes to 24kHz audio. Always F32.
5357

5458
The generation loop (`lib.rs::generate_codes`) ties them together:
59+
5560
```
5661
For each frame:
5762
1. CodePredictor generates 15 acoustic codes from last_hidden + semantic embedding
@@ -73,6 +78,7 @@ For each frame:
7378
## Model Variants
7479

7580
Five variants, auto-detected from `config.json`:
81+
7682
- **Base** (0.6B, 1.7B): Voice cloning via ECAPA-TDNN speaker encoder. ICL mode uses speech encoder + reference text.
7783
- **CustomVoice** (0.6B, 1.7B): 9 preset speakers (Ryan, Serena, etc.) via discrete speaker token IDs.
7884
- **VoiceDesign** (1.7B only): Text-described voices via instruct prompt with ChatML framing.
@@ -93,6 +99,7 @@ Five variants, auto-detected from `config.json`:
9399
## Codec Token IDs
94100

95101
Generation uses codec vocabulary (0–3071), not text vocabulary:
102+
96103
- EOS: 2150 (generation stops here)
97104
- BOS: 2149, PAD: 2148
98105
- Speakers: Ryan=3061, Serena=3066, etc.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ audioadapter-buffers = "2.0.0"
6767

6868
# Profiling (optional)
6969
tracing-chrome = { version = "0.7", optional = true }
70+
half = "2.7.1"
7071

7172
[dev-dependencies]
7273
criterion = "0.8"

Makefile

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: lint fmt pre-commit pre-commit-install profile-chrome profile-flamegraph profile-nsys audit-gpu-syncs
1+
.PHONY: lint fmt pre-commit pre-commit-install profile-chrome profile-flamegraph profile-nsys audit-gpu-syncs test-kernel count-kernels
22

33
MODEL_DIR ?= test_data
44

@@ -33,3 +33,11 @@ profile-nsys:
3333

3434
audit-gpu-syncs:
3535
@bash scripts/audit-gpu-syncs.sh
36+
37+
# ── Kernel Development ──────────────────────────────────────────────────
38+
39+
test-kernel:
40+
@bash scripts/test-kernel.sh $(NAME)
41+
42+
count-kernels:
43+
@bash scripts/count-kernels.sh $(MODEL_DIR)

README.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,15 @@ All code in this repo was written with [Claude Code](https://claude.ai/code). Th
66

77
## Changelog
88

9+
### 0.4.0
10+
11+
- Pre-allocated KV cache with InplaceOp2 (zero-copy CUDA writes, no Tensor::cat)
12+
- GPU-side repetition penalty mask (incremental slice_assign, eliminates growing CPU transfer)
13+
- Deferred acoustic codes transfer (single bulk GPU→CPU at end of generation)
14+
- Fused residual + RMSNorm CUDA kernel
15+
- GPU→CPU syncs reduced from 3/frame to 1/frame (4-byte EOS check)
16+
- Non-streaming RTF: 0.48–0.67 across all variants (97-100% of theoretical throughput)
17+
918
### 0.3.0
1019

1120
- GPU-side sampling: batched argmax, on-device top-k/top-p/repetition penalty
@@ -52,16 +61,16 @@ Thanks to [u/rngesius](https://www.reddit.com/r/LocalLLaMA/comments/1qqvb79/comm
5261
Benchmarked on an NVIDIA DGX Spark (GB10 Blackwell, ARM Cortex-X925, 120 GB unified memory).
5362
Default generation parameters, seed 42, 2 warmup + 3 timed iterations.
5463

55-
| Model | RTF (short) | RTF (long) | Tok/s | TTFA | Memory |
56-
|-------|-------------|------------|-------|------|--------|
57-
| **0.6B Base (CUDA BF16)** | **0.56** | **0.68** | 22.2 | 448 ms | 814 MB |
58-
| **1.7B Base (CUDA BF16)** | **0.72** | **0.74** | 17.3 | 590 ms | 761 MB |
59-
| **1.7B CustomVoice (CUDA BF16)** | **0.72** | **0.75** | 17.3 | 585 ms | 761 MB |
60-
| **1.7B VoiceDesign (CUDA BF16)** | **0.72** | **0.75** | 17.3 | 585 ms | 761 MB |
61-
| 1.7B CustomVoice (CPU F32) | 5.39 | 6.48 | 2.1 | | 9.1 GB |
64+
| Model | RTF (short) | RTF (long) | Tok/s | Memory |
65+
|-------|-------------|------------|-------|--------|
66+
| **0.6B Base (CUDA BF16)** | **0.48** | **0.50** | 25.9 | 767 MB |
67+
| **1.7B Base (CUDA BF16)** | **0.65** | **0.65** | 19.4 | 767 MB |
68+
| **1.7B CustomVoice (CUDA BF16)** | **0.64** | **0.67** | 19.2 | 772 MB |
69+
| **1.7B VoiceDesign (CUDA BF16)** | **0.64** | **0.66** | 19.3 | 770 MB |
70+
| 1.7B CustomVoice (CPU F32) | 5.39 | 6.48 | 2.1 | 9.1 GB |
6271

6372
RTF (real-time factor) = wall-clock / audio duration. **< 1.0 is faster than real-time.**
64-
TTFA = time to first audio chunk via streaming.
73+
Non-streaming results shown above. Streaming adds ~8-12% overhead with TTFA ~444 ms (0.6B) / ~580 ms (1.7B).
6574

6675
See [docs/BENCHMARKS.md](docs/BENCHMARKS.md) for full results, test corpus, micro-benchmarks, and reproduction instructions.
6776

docs/BENCHMARKS.md

Lines changed: 83 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Performance measurements for `qwen3-tts-rs` inference across CPU and GPU.
44

55
All results use default generation parameters
66
(temperature=0.9, top_k=50, top_p=0.9, repetition_penalty=1.05, seed=42).
7-
2 warmup runs, 3 timed iterations, streaming mode enabled for TTFA measurement.
7+
2 warmup runs, 3 timed iterations.
88

99
## Test Hardware
1010

@@ -31,41 +31,80 @@ Real-time factor (RTF) = wall-clock time / audio duration. **Lower is better; <
3131

3232
Each cell shows the average of 3 timed iterations after 2 warmup runs, executed in isolation (no concurrent GPU workloads).
3333

34-
### 0.6B Base — CUDA (BF16)
34+
### Non-Streaming (batch synthesis)
35+
36+
Uses `synthesize_with_timing` — the optimized `generate_codes` path with GPU-side
37+
penalty mask and deferred acoustic codes transfer.
38+
39+
#### 0.6B Base — CUDA (BF16)
40+
41+
| Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
42+
|------|-------|------------|----------------|-----|-------|--------|---------|----------|--------|
43+
| Short | 13 | 1.82 sec | 3.76 sec | **0.49** | 25.8 | 756 MB | 12ms (1%) | 1671ms (92%) | 140ms (8%) |
44+
| Medium | 53 | 8.19 sec | 17.04 sec | **0.48** | 26.0 | 761 MB | 12ms (0%) | 7672ms (94%) | 504ms (6%) |
45+
| Long | 115 | 23.02 sec | 45.68 sec | **0.50** | 24.8 | 767 MB | 12ms (0%) | 21622ms (94%) | 1384ms (6%) |
46+
47+
#### 1.7B Base — CUDA (BF16)
48+
49+
| Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
50+
|------|-------|------------|----------------|-----|-------|--------|---------|----------|--------|
51+
| Short | 13 | 2.22 sec | 3.44 sec | **0.64** | 19.4 | 756 MB | 21ms (1%) | 2065ms (93%) | 129ms (6%) |
52+
| Medium | 53 | 11.22 sec | 17.60 sec | **0.64** | 19.6 | 761 MB | 22ms (0%) | 10672ms (95%) | 521ms (5%) |
53+
| Long | 115 | 29.82 sec | 45.68 sec | **0.65** | 19.2 | 767 MB | 22ms (0%) | 28409ms (95%) | 1382ms (5%) |
54+
55+
#### 1.7B CustomVoice — CUDA (BF16)
56+
57+
| Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
58+
|------|-------|------------|----------------|-----|-------|--------|---------|----------|--------|
59+
| Short | 13 | 3.02 sec | 4.72 sec | **0.64** | 19.6 | 756 MB | 22ms (1%) | 2834ms (94%) | 161ms (5%) |
60+
| Medium | 53 | 20.06 sec | 31.12 sec | **0.64** | 19.4 | 763 MB | 21ms (0%) | 19094ms (95%) | 945ms (5%) |
61+
| Long | 115 | 45.60 sec | 68.00 sec | **0.67** | 18.6 | 772 MB | 22ms (0%) | 43535ms (95%) | 2040ms (4%) |
62+
63+
#### 1.7B VoiceDesign — CUDA (BF16)
64+
65+
| Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
66+
|------|-------|------------|----------------|-----|-------|--------|---------|----------|--------|
67+
| Short | 13 | 3.13 sec | 4.88 sec | **0.64** | 19.5 | 756 MB | 22ms (1%) | 2938ms (94%) | 165ms (5%) |
68+
| Medium | 53 | 13.52 sec | 21.12 sec | **0.64** | 19.5 | 761 MB | 22ms (0%) | 12867ms (95%) | 626ms (5%) |
69+
| Long | 115 | 42.14 sec | 62.96 sec | **0.67** | 18.7 | 770 MB | 23ms (0%) | 40215ms (95%) | 1896ms (4%) |
70+
71+
### Streaming (with TTFA)
72+
73+
Uses `synthesize_streaming` — yields audio chunks incrementally. Both paths now
74+
use GPU-side penalty mask. Streaming is ~8-12% slower than non-streaming due to
75+
incremental decode overhead and per-frame `to_vec1` for the frame buffer.
76+
77+
#### 0.6B Base — CUDA (BF16)
3578

3679
| Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
3780
|------|-------|------------|----------------|-----|------|-------|--------|
38-
| Short | 13 | 2.30 sec | 4.08 sec | **0.56** | 448 ms | 22.2 | 814 MB |
39-
| Medium | 53 | 10.08 sec | 17.84 sec | **0.57** | 452 ms | 22.1 | 817 MB |
40-
| Long | 115 | 110.63 sec | 163.84 sec | **0.68** | 456 ms | 18.5 | 841 MB |
41-
42-
> Note: The 0.6B Base model generates significantly more frames per word than 1.7B models,
43-
> producing longer audio from the same text. The RTF increase on the long input reflects
44-
> the higher frame count (2048 frames vs ~529 for 1.7B).
81+
| Short | 13 | 2.05 sec | 3.76 sec | **0.55** | 443 ms | 22.9 | 814 MB |
82+
| Medium | 53 | 9.38 sec | 17.04 sec | **0.55** | 444 ms | 22.7 | 817 MB |
83+
| Long | 115 | 26.01 sec | 45.68 sec | **0.57** | 445 ms | 22.0 | 820 MB |
4584

46-
### 1.7B Base — CUDA (BF16)
85+
#### 1.7B Base — CUDA (BF16)
4786

4887
| Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
4988
|------|-------|------------|----------------|-----|------|-------|--------|
50-
| Short | 13 | 2.25 sec | 3.12 sec | **0.72** | 590 ms | 17.3 | 761 MB |
51-
| Medium | 53 | 13.24 sec | 18.32 sec | **0.72** | 592 ms | 17.3 | 765 MB |
52-
| Long | 115 | 31.12 sec | 42.32 sec | **0.74** | 591 ms | 17.0 | 771 MB |
89+
| Short | 13 | 2.45 sec | 3.44 sec | **0.71** | 576 ms | 17.6 | 762 MB |
90+
| Medium | 53 | 12.37 sec | 17.60 sec | **0.70** | 579 ms | 17.8 | 765 MB |
91+
| Long | 115 | 32.94 sec | 45.68 sec | **0.72** | 576 ms | 17.3 | 768 MB |
5392

54-
### 1.7B CustomVoice — CUDA (BF16)
93+
#### 1.7B CustomVoice — CUDA (BF16)
5594

5695
| Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
5796
|------|-------|------------|----------------|-----|------|-------|--------|
58-
| Short | 13 | 2.65 sec | 3.68 sec | **0.72** | 585 ms | 17.3 | 761 MB |
59-
| Medium | 53 | 24.11 sec | 33.12 sec | **0.73** | 588 ms | 17.2 | 766 MB |
60-
| Long | 115 | 45.18 sec | 60.32 sec | **0.75** | 590 ms | 16.7 | 769 MB |
97+
| Short | 13 | 3.34 sec | 4.72 sec | **0.71** | 582 ms | 17.7 | 762 MB |
98+
| Medium | 53 | 22.25 sec | 31.12 sec | **0.72** | 581 ms | 17.5 | 767 MB |
99+
| Long | 115 | 50.52 sec | 68.00 sec | **0.74** | 585 ms | 16.8 | 773 MB |
61100

62-
### 1.7B VoiceDesign — CUDA (BF16)
101+
#### 1.7B VoiceDesign — CUDA (BF16)
63102

64103
| Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
65104
|------|-------|------------|----------------|-----|------|-------|--------|
66-
| Short | 13 | 3.01 sec | 4.16 sec | **0.72** | 585 ms | 17.3 | 761 MB |
67-
| Medium | 53 | 14.73 sec | 20.48 sec | **0.72** | 585 ms | 17.4 | 764 MB |
68-
| Long | 115 | 53.78 sec | 71.36 sec | **0.75** | 590 ms | 16.6 | 778 MB |
105+
| Short | 13 | 3.50 sec | 4.88 sec | **0.72** | 584 ms | 17.4 | 762 MB |
106+
| Medium | 53 | 15.04 sec | 21.12 sec | **0.71** | 582 ms | 17.6 | 765 MB |
107+
| Long | 115 | 46.46 sec | 62.96 sec | **0.74** | 582 ms | 16.9 | 771 MB |
69108

70109
### CPU (F32, no MKL/BLAS)
71110

@@ -77,24 +116,38 @@ Each cell shows the average of 3 timed iterations after 2 warmup runs, executed
77116

78117
### Summary
79118

119+
**Non-streaming** (batch synthesis — optimized `generate_codes` path):
120+
80121
| Metric | CPU (1.7B) | 0.6B Base | 1.7B Base | 1.7B CustomVoice | 1.7B VoiceDesign |
81122
|--------|----------:|---------:|---------:|----------------:|----------------:|
82-
| RTF (avg) | 5.96 | **0.60** | 0.73 | 0.73 | 0.73 |
83-
| Tokens/sec | 2.1 | **20.9** | 17.2 | 17.1 | 17.1 |
84-
| TTFA || **452ms** | 591ms | 588ms | 587ms |
85-
| Peak memory | 9.1 GB | 841 MB | 771 MB | 769 MB | 778 MB |
123+
| RTF (avg) | 5.96 | **0.49** | 0.64 | 0.65 | 0.65 |
124+
| Tokens/sec | 2.1 | **25.5** | **19.4** | 19.2 | 19.2 |
125+
| Peak memory | 9.1 GB | 767 MB | 767 MB | 772 MB | 770 MB |
126+
127+
**Streaming** (incremental chunks with TTFA):
128+
129+
| Metric | 0.6B Base | 1.7B Base | 1.7B CustomVoice | 1.7B VoiceDesign |
130+
|--------|---------:|---------:|----------------:|----------------:|
131+
| RTF (avg) | **0.55** | 0.71 | 0.72 | 0.72 |
132+
| Tokens/sec | **22.5** | 17.6 | 17.3 | 17.3 |
133+
| TTFA | **444 ms** | 577 ms | 583 ms | 583 ms |
134+
| Peak memory | 820 MB | 768 MB | 773 MB | 771 MB |
135+
136+
**CUDA delivers faster-than-real-time synthesis** across all text lengths and
137+
all model variants. Non-streaming is ~8-12% faster than streaming due to
138+
deferred acoustic codes transfer in the `generate_codes` path; both paths
139+
use the GPU-side penalty mask.
140+
141+
The 0.6B model is ~30% faster than 1.7B variants, at the cost of reduced
142+
voice quality.
86143

87-
**CUDA delivers faster-than-real-time synthesis** across all text lengths.
88144
CPU is ~6x slower than real-time without BLAS acceleration — expected for
89145
a 1.7B parameter model in F32. Enabling MKL (x86) or Accelerate (macOS)
90146
would improve CPU performance significantly.
91147

92-
TTFA (time to first audio) via streaming is stable at ~590ms (1.7B) or ~450ms (0.6B)
148+
TTFA (time to first audio) via streaming is stable at ~580ms (1.7B) or ~444ms (0.6B)
93149
regardless of input length, making the streaming API suitable for interactive use cases.
94150

95-
The 0.6B model is ~20% faster than 1.7B variants with lower TTFA, at the cost
96-
of reduced voice quality.
97-
98151
## Micro-Benchmarks
99152

100153
Component-level benchmarks run via [Criterion](https://bheisler.github.io/criterion.rs/book/).

docs/CUSTOM_CUDA_KERNELS_PLAN.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Custom CUDA Kernels Plan
2+
3+
## Current State
4+
5+
- **19.2 tok/s** (short), **18.3 tok/s** (long) on DGX A100
6+
- ~95% of theoretical throughput given current kernel launch pattern
7+
- ~625-740 CUDA kernel launches per talker decode step (28 layers × ~24 kernels/layer)
8+
- Decode is memory-bandwidth bound at batch=1
9+
- Target: **25-27 tok/s** (~40% improvement)
10+
11+
## Phase 1: Fused Residual + RMSNorm (estimated 15-25% speedup)
12+
13+
**Why first:** Executes 33 times per frame (28 talker + 5 code predictor layers). Currently
14+
3 separate kernel launches per norm (residual add → variance reduction → normalize+scale).
15+
Fusing to 1 kernel eliminates 66 launches/frame and halves memory traffic.
16+
17+
**Approach:** Use the existing `candle-layer-norm` crate which has fused RMSNorm CUDA kernels
18+
for candle. If it doesn't support residual-add fusion, extend or write our own PTX kernel.
19+
20+
**Steps:**
21+
1. Add `candle-layer-norm` dependency, feature-gated behind `cuda`
22+
2. Create `FusedRmsNorm` wrapper matching candle's `RmsNorm` interface
23+
3. Wire into `DecoderLayer::forward()` in `transformer.rs`
24+
4. Wire into `CodePredictor` layers
25+
5. Unit test: compare fused vs sequential output on random tensors
26+
6. Benchmark with e2e_bench
27+
28+
**Files:**
29+
- `Cargo.toml`
30+
- `src/models/transformer.rs` — DecoderLayer norm calls
31+
- `src/models/code_predictor.rs` — CodePredictor norm calls
32+
33+
## Phase 2: Fused SwiGLU MLP (estimated 5-10% speedup)
34+
35+
**Why:** MLP does gate_proj → silu → up_proj → mul → down_proj. The silu+mul step is 2
36+
kernel launches per layer that can become 1. gate_proj and up_proj share input (one load).
37+
38+
**Approach:** Write a custom PTX kernel via candle's `get_or_load_custom_func()`:
39+
- Fused op: element-wise `silu(a) * b` (matmuls stay in cuBLAS)
40+
- Eliminates 2 → 1 kernel launches per layer (×33 layers = 33 fewer launches/frame)
41+
42+
**Steps:**
43+
1. Write `kernels/fused_silu_mul.cu` — element-wise `silu(a) * b`
44+
2. Compile to PTX, embed via `include_str!`
45+
3. Implement as `CustomOp2` in `src/models/fused_ops.rs`
46+
4. Replace `Activation::Silu` + `Tensor::mul` in MLP::forward
47+
5. Unit test: compare against sequential silu+mul
48+
6. Benchmark
49+
50+
**Files:**
51+
- `kernels/fused_silu_mul.cu` (new)
52+
- `src/models/fused_ops.rs` (new)
53+
- `src/models/transformer.rs` — MLP::forward
54+
- `build.rs` or inline PTX string
55+
56+
## Phase 3: Fused RoPE (estimated 2-5% speedup)
57+
58+
**Why:** RoPE does cos/sin computation + element-wise ops as separate kernels. Fusing
59+
saves memory round-trips. Runs twice per layer (Q and K) × 28 layers = 56 calls/frame.
60+
61+
**Steps:**
62+
1. Write `kernels/fused_rope.cu` — combined cos/sin rotation
63+
2. Implement as `CustomOp1`
64+
3. Replace multi-step RoPE in `Attention::forward`
65+
4. Unit test + benchmark
66+
67+
**Files:**
68+
- `kernels/fused_rope.cu` (new)
69+
- `src/models/fused_ops.rs` — add RoPE op
70+
- `src/models/transformer.rs` — Attention::forward
71+
72+
## Iteration Protocol
73+
74+
After each phase:
75+
1. `cargo test --lib` (must pass)
76+
2. `cargo clippy --lib -- -D warnings`
77+
3. e2e_bench with 3 iterations
78+
4. Record in `docs/PERFORMANCE_JOURNAL.md`
79+
5. Chrome trace to verify kernel count reduction
80+
6. Commit + PR
81+
82+
## Expected Cumulative Impact
83+
84+
| After Phase | Estimated tok/s | Kernel launches/frame |
85+
|-------------|----------------|-----------------------|
86+
| Baseline | 18-19 | ~700 |
87+
| 1 (Fused RmsNorm) | 22-24 | ~634 |
88+
| 2 (Fused SwiGLU) | 24-26 | ~601 |
89+
| 3 (Fused RoPE) | 25-27 | ~545 |

0 commit comments

Comments
 (0)