English | 中文
ComfyUI nodes for VoxCPM2 — tokenizer-free, diffusion autoregressive Text-to-Speech.
2B parameters, 30 languages, 48kHz audio output, voice design, controllable cloning, and LoRA training.
Report Bug
·
Request Feature
VoxCPM2 is a tokenizer-free Text-to-Speech model trained on over 2 million hours of multilingual speech data. Built on a MiniCPM-4 backbone with AudioVAE V2, it outputs 48kHz studio-quality audio and supports 30 languages with no language tag needed.
This custom node provides two inference nodes and a full LoRA training pipeline, all integrated directly into ComfyUI — based on the original ComfyUI-VoxCPM by @wildminder.
Key Features:
- 30-Language Multilingual — Input text in any supported language, no language tag needed
- Voice Design — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace)
- Controllable Voice Cloning — Clone any voice from a short reference clip, with optional style guidance
- Ultimate Cloning — Provide reference audio + transcript for maximum fidelity reproduction
- 48kHz Studio-Quality Output — Accepts 16kHz reference audio, outputs 48kHz via AudioVAE V2's built-in super-resolution
- LoRA Support — Load fine-tuned LoRA checkpoints for specific voice styles
- Native LoRA Training — Train LoRA adapters directly within ComfyUI
- Automatic Model Management — Models are downloaded and managed by ComfyUI to save VRAM
- Torch Compile — Optional
torch.compileoptimization for faster inference - ASR Auto-Transcription — Auto-transcribe reference audio using SenseVoiceSmall (requires
funasr) - Reference Audio Denoiser — Optional ZipEnhancer denoising for cleaner cloning (requires
modelscope) - Loudness Normalization — Auto-normalize output loudness when denoiser is active
- Audio Duration Validation — Rejects reference audio over 50 seconds to prevent quality issues
Search for ComfyUI-VoxCPM2 and click "Install".
-
Clone into your
ComfyUI/custom_nodes/directory:git clone https://github.com/Saganaki22/ComfyUI-VoxCPM2.git
-
Install dependencies:
cd ComfyUI-VoxCPM2 pip install -r requirements.txt
Python 3.13+ on Windows: If
pip installfails oneditdistancewith apdm.backendimport error or C4819 encoding warning, run:pip install pdm-backend set CL=/utf-8 # CMD # or in PowerShell: $env:CL="/utf-8" pip install -r requirements.txt
- Restart ComfyUI. Nodes appear under the
audio/ttscategory.
The model is downloaded automatically on first use to ComfyUI/models/tts/VoxCPM/.
| Model | Parameters | Sample Rate | Description | Hugging Face |
|---|---|---|---|---|
| VoxCPM2 | 2B | 48kHz | Latest release. 30 languages, voice design, controllable cloning. | openbmb/VoxCPM2 |
Text-to-speech with optional voice design. No reference audio needed.
| Input | Type | Default | Description |
|---|---|---|---|
model_name |
Combo | — | Select the VoxCPM2 model |
lora_name |
Combo | None | LoRA checkpoint from models/loras |
voice_description |
String | — | Voice design prompt (e.g. "A young woman, gentle and sweet voice"). Auto-wrapped in parentheses and prepended to text |
text |
String | — | Target text to synthesize |
cfg_value |
Float | 2.0 | Classifier-Free Guidance scale (1.0–10.0) |
inference_timesteps |
Int | 10 | Diffusion steps. More = better quality, slower |
max_tokens |
Int | 4096 | Max generation length (64–8192) |
normalize_text |
Toggle | Normalize | Auto-process numbers, abbreviations, punctuation |
seed |
Int | -1 | Reproducibility seed (-1 = random) |
force_offload |
Toggle | Auto | Force VRAM offload after generation |
dtype |
Combo | auto | Model dtype: auto (native bf16, fp16 on older GPUs), bf16, fp16 |
device |
Combo | cuda | Inference device (cuda, mps, cpu) |
enable_asr |
Toggle | Off | Auto-transcribe reference audio using SenseVoiceSmall ASR. Requires funasr. First run downloads the model (~400MB) |
retry_max_attempts |
Int | 3 | Auto-retries on bad generation (0–10) |
retry_threshold |
Float | 6.0 | Threshold for detecting bad generations |
torch_compile |
Toggle | Standard | Enable torch.compile optimization |
Voice cloning with controllable and ultimate modes.
| Input | Type | Default | Description |
|---|---|---|---|
model_name |
Combo | — | Select the VoxCPM2 model |
lora_name |
Combo | None | LoRA checkpoint from models/loras |
voice_description |
String | — | Style control (e.g. "slightly faster, cheerful tone"). Auto-wrapped in parentheses and prepended to text |
text |
String | — | Target text to synthesize |
reference_audio |
Audio | Required | Reference audio for voice cloning (max 50 seconds) |
prompt_text |
String | — | Transcript of reference audio. Provide for Ultimate Cloning (highest fidelity). Leave empty for Controllable Cloning, or enable enable_asr to auto-transcribe |
cfg_value |
Float | 2.0 | Classifier-Free Guidance scale (1.0–10.0) |
inference_timesteps |
Int | 10 | Diffusion steps. More = better quality, slower |
max_tokens |
Int | 4096 | Max generation length (64–8192) |
normalize_text |
Toggle | Normalize | Auto-process numbers, abbreviations, punctuation |
enable_denoiser |
Toggle | Off | Denoise reference audio before cloning using ZipEnhancer. Requires modelscope. Output loudness is auto-normalized to -20 LUFS |
seed |
Int | -1 | Reproducibility seed (-1 = random) |
force_offload |
Toggle | Auto | Force VRAM offload after generation |
dtype |
Combo | auto | Model dtype: auto (native bf16, fp16 on older GPUs), bf16, fp16 |
device |
Combo | cuda | Inference device (cuda, mps, cpu) |
enable_asr |
Toggle | Off | Auto-transcribe reference audio using SenseVoiceSmall ASR. Requires funasr. Ignored when prompt_text is provided. First run downloads the model (~400MB) |
retry_max_attempts |
Int | 3 | Auto-retries on bad generation (0–10) |
retry_threshold |
Float | 6.0 | Threshold for detecting bad generations |
torch_compile |
Toggle | Standard | Enable torch.compile optimization |
- Add the VoxCPM2 TTS node to your workflow.
- Type your text in the
textfield. - Optionally describe a voice in
voice_description(e.g. "A deep male voice, calm and authoritative"). - Queue the prompt.
The voice_description field lets you create any voice without reference audio:
- "A young woman, gentle and sweet voice"
- "An old man with a gravelly, slow voice"
- "A child, excited and energetic"
The description is automatically wrapped in parentheses and prepended to your text, matching the VoxCPM2 API format (description)text.
- Add the VoxCPM2 Voice Clone node.
- Connect a
Load Audionode toreference_audio. - Enter your target text in
text. - Optionally add style guidance in
voice_description(e.g. "slightly faster, cheerful tone"). - Leave
prompt_textempty. Optionally enableenable_asrto auto-transcribe the reference audio.
- Same as above, but also provide the exact transcript of the reference audio in
prompt_text. - The model uses audio-continuation cloning to reproduce every vocal nuance.
- If you don't have a transcript, enable
enable_asr— it will auto-transcribe and enter Ultimate mode automatically.
Enable enable_asr on either node to automatically transcribe reference audio using the SenseVoiceSmall model. The first run downloads the model (~400MB). Requires pip install funasr.
When enable_asr is on:
- If
prompt_textis empty, ASR runs and fills it automatically (enters Ultimate Cloning) - If
prompt_textis already provided, ASR is skipped and the manual transcript is used instead
Enable enable_denoiser on the Voice Clone node to clean up noisy reference audio before cloning. Uses ZipEnhancer via ModelScope. Requires pip install modelscope.
When the denoiser is active, the output audio loudness is automatically normalized to -20 LUFS for consistent volume.
Reference audio is validated on upload — audio longer than 50 seconds will be rejected with an error. For best results, use 5–15 seconds of clean, continuous speech.
- Place
.safetensorsLoRA files inComfyUI/models/loras/. - Select your LoRA in the
lora_namedropdown.
Train custom LoRA adapters directly in ComfyUI using the training nodes (VoxCPM2 Train Config, VoxCPM2 Dataset Maker, VoxCPM2 LoRA Trainer).
Click here for the full LoRA Training Guide
- Use clean, high-quality reference audio (5–15 seconds of continuous speech)
- For Ultimate Cloning, provide an accurate verbatim transcript in
prompt_text - Punctuation in the transcript helps the model capture intonation
cfg_value(default 2.0): Raise for more adherence to the prompt, lower for more natural variationinference_timesteps(default 10): 5–10 for fast drafts, 15–25 for higher qualitynormalize_text: Keep ON for natural language input. Turn OFF only for phoneme input like{HH AH0 L OW1}
Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
Chinese Dialects: Sichuan, Cantonese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, Southern Min
- Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended
- Performance varies across languages depending on training data availability
- Occasional instability with very long or highly expressive inputs
- Strictly forbidden to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
The VoxCPM model and its components are subject to the Apache-2.0 License provided by OpenBMB.
- @wildminder for the original ComfyUI-VoxCPM this project is based on
- OpenBMB & ModelBest for creating and open-sourcing VoxCPM
- The ComfyUI team for their powerful and extensible platform