@@ -4,7 +4,7 @@ Performance measurements for `qwen3-tts-rs` inference across CPU and GPU.
44
55All results use default generation parameters
66(temperature=0.9, top_k=50, top_p=0.9, repetition_penalty=1.05, seed=42).
7- 2 warmup runs, 3 timed iterations, streaming mode enabled for TTFA measurement .
7+ 2 warmup runs, 3 timed iterations.
88
99## Test Hardware
1010
@@ -31,41 +31,80 @@ Real-time factor (RTF) = wall-clock time / audio duration. **Lower is better; <
3131
3232Each cell shows the average of 3 timed iterations after 2 warmup runs, executed in isolation (no concurrent GPU workloads).
3333
34- ### 0.6B Base — CUDA (BF16)
34+ ### Non-Streaming (batch synthesis)
35+
36+ Uses ` synthesize_with_timing ` — the optimized ` generate_codes ` path with GPU-side
37+ penalty mask and deferred acoustic codes transfer.
38+
39+ #### 0.6B Base — CUDA (BF16)
40+
41+ | Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
42+ | ------| -------| ------------| ----------------| -----| -------| --------| ---------| ----------| --------|
43+ | Short | 13 | 1.82 sec | 3.76 sec | ** 0.49** | 25.8 | 756 MB | 12ms (1%) | 1671ms (92%) | 140ms (8%) |
44+ | Medium | 53 | 8.19 sec | 17.04 sec | ** 0.48** | 26.0 | 761 MB | 12ms (0%) | 7672ms (94%) | 504ms (6%) |
45+ | Long | 115 | 23.02 sec | 45.68 sec | ** 0.50** | 24.8 | 767 MB | 12ms (0%) | 21622ms (94%) | 1384ms (6%) |
46+
47+ #### 1.7B Base — CUDA (BF16)
48+
49+ | Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
50+ | ------| -------| ------------| ----------------| -----| -------| --------| ---------| ----------| --------|
51+ | Short | 13 | 2.22 sec | 3.44 sec | ** 0.64** | 19.4 | 756 MB | 21ms (1%) | 2065ms (93%) | 129ms (6%) |
52+ | Medium | 53 | 11.22 sec | 17.60 sec | ** 0.64** | 19.6 | 761 MB | 22ms (0%) | 10672ms (95%) | 521ms (5%) |
53+ | Long | 115 | 29.82 sec | 45.68 sec | ** 0.65** | 19.2 | 767 MB | 22ms (0%) | 28409ms (95%) | 1382ms (5%) |
54+
55+ #### 1.7B CustomVoice — CUDA (BF16)
56+
57+ | Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
58+ | ------| -------| ------------| ----------------| -----| -------| --------| ---------| ----------| --------|
59+ | Short | 13 | 3.02 sec | 4.72 sec | ** 0.64** | 19.6 | 756 MB | 22ms (1%) | 2834ms (94%) | 161ms (5%) |
60+ | Medium | 53 | 20.06 sec | 31.12 sec | ** 0.64** | 19.4 | 763 MB | 21ms (0%) | 19094ms (95%) | 945ms (5%) |
61+ | Long | 115 | 45.60 sec | 68.00 sec | ** 0.67** | 18.6 | 772 MB | 22ms (0%) | 43535ms (95%) | 2040ms (4%) |
62+
63+ #### 1.7B VoiceDesign — CUDA (BF16)
64+
65+ | Text | Words | Wall Clock | Audio Duration | RTF | Tok/s | Memory | Prefill | Generate | Decode |
66+ | ------| -------| ------------| ----------------| -----| -------| --------| ---------| ----------| --------|
67+ | Short | 13 | 3.13 sec | 4.88 sec | ** 0.64** | 19.5 | 756 MB | 22ms (1%) | 2938ms (94%) | 165ms (5%) |
68+ | Medium | 53 | 13.52 sec | 21.12 sec | ** 0.64** | 19.5 | 761 MB | 22ms (0%) | 12867ms (95%) | 626ms (5%) |
69+ | Long | 115 | 42.14 sec | 62.96 sec | ** 0.67** | 18.7 | 770 MB | 23ms (0%) | 40215ms (95%) | 1896ms (4%) |
70+
71+ ### Streaming (with TTFA)
72+
73+ Uses ` synthesize_streaming ` — yields audio chunks incrementally. Both paths now
74+ use GPU-side penalty mask. Streaming is ~ 8-12% slower than non-streaming due to
75+ incremental decode overhead and per-frame ` to_vec1 ` for the frame buffer.
76+
77+ #### 0.6B Base — CUDA (BF16)
3578
3679| Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
3780| ------| -------| ------------| ----------------| -----| ------| -------| --------|
38- | Short | 13 | 2.30 sec | 4.08 sec | ** 0.56** | 448 ms | 22.2 | 814 MB |
39- | Medium | 53 | 10.08 sec | 17.84 sec | ** 0.57** | 452 ms | 22.1 | 817 MB |
40- | Long | 115 | 110.63 sec | 163.84 sec | ** 0.68** | 456 ms | 18.5 | 841 MB |
41-
42- > Note: The 0.6B Base model generates significantly more frames per word than 1.7B models,
43- > producing longer audio from the same text. The RTF increase on the long input reflects
44- > the higher frame count (2048 frames vs ~ 529 for 1.7B).
81+ | Short | 13 | 2.05 sec | 3.76 sec | ** 0.55** | 443 ms | 22.9 | 814 MB |
82+ | Medium | 53 | 9.38 sec | 17.04 sec | ** 0.55** | 444 ms | 22.7 | 817 MB |
83+ | Long | 115 | 26.01 sec | 45.68 sec | ** 0.57** | 445 ms | 22.0 | 820 MB |
4584
46- ### 1.7B Base — CUDA (BF16)
85+ #### 1.7B Base — CUDA (BF16)
4786
4887| Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
4988| ------| -------| ------------| ----------------| -----| ------| -------| --------|
50- | Short | 13 | 2.25 sec | 3.12 sec | ** 0.72 ** | 590 ms | 17.3 | 761 MB |
51- | Medium | 53 | 13.24 sec | 18.32 sec | ** 0.72 ** | 592 ms | 17.3 | 765 MB |
52- | Long | 115 | 31.12 sec | 42.32 sec | ** 0.74 ** | 591 ms | 17.0 | 771 MB |
89+ | Short | 13 | 2.45 sec | 3.44 sec | ** 0.71 ** | 576 ms | 17.6 | 762 MB |
90+ | Medium | 53 | 12.37 sec | 17.60 sec | ** 0.70 ** | 579 ms | 17.8 | 765 MB |
91+ | Long | 115 | 32.94 sec | 45.68 sec | ** 0.72 ** | 576 ms | 17.3 | 768 MB |
5392
54- ### 1.7B CustomVoice — CUDA (BF16)
93+ #### 1.7B CustomVoice — CUDA (BF16)
5594
5695| Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
5796| ------| -------| ------------| ----------------| -----| ------| -------| --------|
58- | Short | 13 | 2.65 sec | 3.68 sec | ** 0.72 ** | 585 ms | 17.3 | 761 MB |
59- | Medium | 53 | 24.11 sec | 33 .12 sec | ** 0.73 ** | 588 ms | 17.2 | 766 MB |
60- | Long | 115 | 45.18 sec | 60.32 sec | ** 0.75 ** | 590 ms | 16.7 | 769 MB |
97+ | Short | 13 | 3.34 sec | 4.72 sec | ** 0.71 ** | 582 ms | 17.7 | 762 MB |
98+ | Medium | 53 | 22.25 sec | 31 .12 sec | ** 0.72 ** | 581 ms | 17.5 | 767 MB |
99+ | Long | 115 | 50.52 sec | 68.00 sec | ** 0.74 ** | 585 ms | 16.8 | 773 MB |
61100
62- ### 1.7B VoiceDesign — CUDA (BF16)
101+ #### 1.7B VoiceDesign — CUDA (BF16)
63102
64103| Text | Words | Wall Clock | Audio Duration | RTF | TTFA | Tok/s | Memory |
65104| ------| -------| ------------| ----------------| -----| ------| -------| --------|
66- | Short | 13 | 3.01 sec | 4.16 sec | ** 0.72** | 585 ms | 17.3 | 761 MB |
67- | Medium | 53 | 14.73 sec | 20.48 sec | ** 0.72 ** | 585 ms | 17.4 | 764 MB |
68- | Long | 115 | 53.78 sec | 71.36 sec | ** 0.75 ** | 590 ms | 16.6 | 778 MB |
105+ | Short | 13 | 3.50 sec | 4.88 sec | ** 0.72** | 584 ms | 17.4 | 762 MB |
106+ | Medium | 53 | 15.04 sec | 21.12 sec | ** 0.71 ** | 582 ms | 17.6 | 765 MB |
107+ | Long | 115 | 46.46 sec | 62.96 sec | ** 0.74 ** | 582 ms | 16.9 | 771 MB |
69108
70109### CPU (F32, no MKL/BLAS)
71110
@@ -77,24 +116,38 @@ Each cell shows the average of 3 timed iterations after 2 warmup runs, executed
77116
78117### Summary
79118
119+ ** Non-streaming** (batch synthesis — optimized ` generate_codes ` path):
120+
80121| Metric | CPU (1.7B) | 0.6B Base | 1.7B Base | 1.7B CustomVoice | 1.7B VoiceDesign |
81122| --------| ----------:| ---------:| ---------:| ----------------:| ----------------:|
82- | RTF (avg) | 5.96 | ** 0.60** | 0.73 | 0.73 | 0.73 |
83- | Tokens/sec | 2.1 | ** 20.9** | 17.2 | 17.1 | 17.1 |
84- | TTFA | — | ** 452ms** | 591ms | 588ms | 587ms |
85- | Peak memory | 9.1 GB | 841 MB | 771 MB | 769 MB | 778 MB |
123+ | RTF (avg) | 5.96 | ** 0.49** | 0.64 | 0.65 | 0.65 |
124+ | Tokens/sec | 2.1 | ** 25.5** | ** 19.4** | 19.2 | 19.2 |
125+ | Peak memory | 9.1 GB | 767 MB | 767 MB | 772 MB | 770 MB |
126+
127+ ** Streaming** (incremental chunks with TTFA):
128+
129+ | Metric | 0.6B Base | 1.7B Base | 1.7B CustomVoice | 1.7B VoiceDesign |
130+ | --------| ---------:| ---------:| ----------------:| ----------------:|
131+ | RTF (avg) | ** 0.55** | 0.71 | 0.72 | 0.72 |
132+ | Tokens/sec | ** 22.5** | 17.6 | 17.3 | 17.3 |
133+ | TTFA | ** 444 ms** | 577 ms | 583 ms | 583 ms |
134+ | Peak memory | 820 MB | 768 MB | 773 MB | 771 MB |
135+
136+ ** CUDA delivers faster-than-real-time synthesis** across all text lengths and
137+ all model variants. Non-streaming is ~ 8-12% faster than streaming due to
138+ deferred acoustic codes transfer in the ` generate_codes ` path; both paths
139+ use the GPU-side penalty mask.
140+
141+ The 0.6B model is ~ 30% faster than 1.7B variants, at the cost of reduced
142+ voice quality.
86143
87- ** CUDA delivers faster-than-real-time synthesis** across all text lengths.
88144CPU is ~ 6x slower than real-time without BLAS acceleration — expected for
89145a 1.7B parameter model in F32. Enabling MKL (x86) or Accelerate (macOS)
90146would improve CPU performance significantly.
91147
92- TTFA (time to first audio) via streaming is stable at ~ 590ms (1.7B) or ~ 450ms (0.6B)
148+ TTFA (time to first audio) via streaming is stable at ~ 580ms (1.7B) or ~ 444ms (0.6B)
93149regardless of input length, making the streaming API suitable for interactive use cases.
94150
95- The 0.6B model is ~ 20% faster than 1.7B variants with lower TTFA, at the cost
96- of reduced voice quality.
97-
98151## Micro-Benchmarks
99152
100153Component-level benchmarks run via [ Criterion] ( https://bheisler.github.io/criterion.rs/book/ ) .
0 commit comments