ComfyUI-VoxCPM2

English | 中文

ComfyUI nodes for VoxCPM2 — tokenizer-free, diffusion autoregressive Text-to-Speech.
2B parameters, 30 languages, 48kHz audio output, voice design, controllable cloning, and LoRA training.

Report Bug · Request Feature

About

VoxCPM2 is a tokenizer-free Text-to-Speech model trained on over 2 million hours of multilingual speech data. Built on a MiniCPM-4 backbone with AudioVAE V2, it outputs 48kHz studio-quality audio and supports 30 languages with no language tag needed.

This custom node provides two inference nodes and a full LoRA training pipeline, all integrated directly into ComfyUI — based on the original ComfyUI-VoxCPM by @wildminder.

Key Features:

30-Language Multilingual — Input text in any supported language, no language tag needed
Voice Design — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace)
Controllable Voice Cloning — Clone any voice from a short reference clip, with optional style guidance
Ultimate Cloning — Provide reference audio + transcript for maximum fidelity reproduction
48kHz Studio-Quality Output — Accepts 16kHz reference audio, outputs 48kHz via AudioVAE V2's built-in super-resolution
LoRA Support — Load fine-tuned LoRA checkpoints for specific voice styles
Native LoRA Training — Train LoRA adapters directly within ComfyUI
Automatic Model Management — Models are downloaded and managed by ComfyUI to save VRAM
Torch Compile — Optional torch.compile optimization for faster inference
ASR Auto-Transcription — Auto-transcribe reference audio using SenseVoiceSmall (requires funasr)
Reference Audio Denoiser — Optional ZipEnhancer denoising for cleaner cloning (requires modelscope)
Loudness Normalization — Auto-normalize output loudness when denoiser is active
Audio Duration Validation — Rejects reference audio over 50 seconds to prevent quality issues

Installation

Via ComfyUI Manager (Recommended)

Search for ComfyUI-VoxCPM2 and click "Install".

Manual

Clone into your ComfyUI/custom_nodes/ directory:

git clone https://github.com/Saganaki22/ComfyUI-VoxCPM2.git

Install dependencies:

cd ComfyUI-VoxCPM2
pip install -r requirements.txt

Python 3.13+ on Windows: If pip install fails on editdistance with a pdm.backend import error or C4819 encoding warning, run:
pip install pdm-backend
set CL=/utf-8    # CMD
# or in PowerShell: $env:CL="/utf-8"
pip install -r requirements.txt

Restart ComfyUI. Nodes appear under the audio/tts category.

Models

The model is downloaded automatically on first use to ComfyUI/models/tts/VoxCPM/.

Model	Parameters	Sample Rate	Description	Hugging Face
VoxCPM2	2B	48kHz	Latest release. 30 languages, voice design, controllable cloning.	openbmb/VoxCPM2

Nodes

VoxCPM2 TTS

Text-to-speech with optional voice design. No reference audio needed.

Input	Type	Default	Description
`model_name`	Combo	—	Select the VoxCPM2 model
`lora_name`	Combo	None	LoRA checkpoint from `models/loras`
`voice_description`	String	—	Voice design prompt (e.g. "A young woman, gentle and sweet voice"). Auto-wrapped in parentheses and prepended to text
`text`	String	—	Target text to synthesize
`cfg_value`	Float	2.0	Classifier-Free Guidance scale (1.0–10.0)
`inference_timesteps`	Int	10	Diffusion steps. More = better quality, slower
`max_tokens`	Int	4096	Max generation length (64–8192)
`normalize_text`	Toggle	Normalize	Auto-process numbers, abbreviations, punctuation
`seed`	Int	-1	Reproducibility seed (-1 = random)
`force_offload`	Toggle	Auto	Force VRAM offload after generation
`dtype`	Combo	auto	Model dtype: `auto` (native bf16, fp16 on older GPUs), `bf16`, `fp16`
`device`	Combo	cuda	Inference device (cuda, mps, cpu)
`enable_asr`	Toggle	Off	Auto-transcribe reference audio using SenseVoiceSmall ASR. Requires `funasr`. First run downloads the model (~400MB)
`retry_max_attempts`	Int	3	Auto-retries on bad generation (0–10)
`retry_threshold`	Float	6.0	Threshold for detecting bad generations
`torch_compile`	Toggle	Standard	Enable `torch.compile` optimization

VoxCPM2 Voice Clone

Voice cloning with controllable and ultimate modes.

Input	Type	Default	Description
`model_name`	Combo	—	Select the VoxCPM2 model
`lora_name`	Combo	None	LoRA checkpoint from `models/loras`
`voice_description`	String	—	Style control (e.g. "slightly faster, cheerful tone"). Auto-wrapped in parentheses and prepended to text
`text`	String	—	Target text to synthesize
`reference_audio`	Audio	Required	Reference audio for voice cloning (max 50 seconds)
`prompt_text`	String	—	Transcript of reference audio. Provide for Ultimate Cloning (highest fidelity). Leave empty for Controllable Cloning, or enable `enable_asr` to auto-transcribe
`cfg_value`	Float	2.0	Classifier-Free Guidance scale (1.0–10.0)
`inference_timesteps`	Int	10	Diffusion steps. More = better quality, slower
`max_tokens`	Int	4096	Max generation length (64–8192)
`normalize_text`	Toggle	Normalize	Auto-process numbers, abbreviations, punctuation
`enable_denoiser`	Toggle	Off	Denoise reference audio before cloning using ZipEnhancer. Requires `modelscope`. Output loudness is auto-normalized to -20 LUFS
`seed`	Int	-1	Reproducibility seed (-1 = random)
`force_offload`	Toggle	Auto	Force VRAM offload after generation
`dtype`	Combo	auto	Model dtype: `auto` (native bf16, fp16 on older GPUs), `bf16`, `fp16`
`device`	Combo	cuda	Inference device (cuda, mps, cpu)
`enable_asr`	Toggle	Off	Auto-transcribe reference audio using SenseVoiceSmall ASR. Requires `funasr`. Ignored when `prompt_text` is provided. First run downloads the model (~400MB)
`retry_max_attempts`	Int	3	Auto-retries on bad generation (0–10)
`retry_threshold`	Float	6.0	Threshold for detecting bad generations
`torch_compile`	Toggle	Standard	Enable `torch.compile` optimization

Usage

Text-to-Speech (Zero-Shot)

Add the VoxCPM2 TTS node to your workflow.
Type your text in the text field.
Optionally describe a voice in voice_description (e.g. "A deep male voice, calm and authoritative").
Queue the prompt.

Voice Design

The voice_description field lets you create any voice without reference audio:

"A young woman, gentle and sweet voice"
"An old man with a gravelly, slow voice"
"A child, excited and energetic"

The description is automatically wrapped in parentheses and prepended to your text, matching the VoxCPM2 API format (description)text.

Controllable Voice Cloning

Add the VoxCPM2 Voice Clone node.
Connect a Load Audio node to reference_audio.
Enter your target text in text.
Optionally add style guidance in voice_description (e.g. "slightly faster, cheerful tone").
Leave prompt_text empty. Optionally enable enable_asr to auto-transcribe the reference audio.

Ultimate Cloning (Highest Fidelity)

Same as above, but also provide the exact transcript of the reference audio in prompt_text.
The model uses audio-continuation cloning to reproduce every vocal nuance.
If you don't have a transcript, enable enable_asr — it will auto-transcribe and enter Ultimate mode automatically.

ASR Auto-Transcription

Enable enable_asr on either node to automatically transcribe reference audio using the SenseVoiceSmall model. The first run downloads the model (~400MB). Requires pip install funasr.

When enable_asr is on:

If prompt_text is empty, ASR runs and fills it automatically (enters Ultimate Cloning)
If prompt_text is already provided, ASR is skipped and the manual transcript is used instead

Reference Audio Denoiser

Enable enable_denoiser on the Voice Clone node to clean up noisy reference audio before cloning. Uses ZipEnhancer via ModelScope. Requires pip install modelscope.

When the denoiser is active, the output audio loudness is automatically normalized to -20 LUFS for consistent volume.

Reference Audio Duration

Reference audio is validated on upload — audio longer than 50 seconds will be rejected with an error. For best results, use 5–15 seconds of clean, continuous speech.

LoRA Support

Inference

Place .safetensors LoRA files in ComfyUI/models/loras/.
Select your LoRA in the lora_name dropdown.

Training

Train custom LoRA adapters directly in ComfyUI using the training nodes (VoxCPM2 Train Config, VoxCPM2 Dataset Maker, VoxCPM2 LoRA Trainer).

Click here for the full LoRA Training Guide

Tips for Best Results

Voice Cloning

Use clean, high-quality reference audio (5–15 seconds of continuous speech)
For Ultimate Cloning, provide an accurate verbatim transcript in prompt_text
Punctuation in the transcript helps the model capture intonation

Generation Quality

cfg_value (default 2.0): Raise for more adherence to the prompt, lower for more natural variation
inference_timesteps (default 10): 5–10 for fast drafts, 15–25 for higher quality
normalize_text: Keep ON for natural language input. Turn OFF only for phoneme input like {HH AH0 L OW1}

Supported Languages (30)

Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese

Chinese Dialects: Sichuan, Cantonese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, Southern Min

Limitations

Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended
Performance varies across languages depending on training data availability
Occasional instability with very long or highly expressive inputs
Strictly forbidden to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.

License

The VoxCPM model and its components are subject to the Apache-2.0 License provided by OpenBMB.

Acknowledgments

@wildminder for the original ComfyUI-VoxCPM this project is based on
OpenBMB & ModelBest for creating and open-sourcing VoxCPM
The ComfyUI team for their powerful and extensible platform

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
docs		docs
example_workflows		example_workflows
modules		modules
src/voxcpm		src/voxcpm
voice_samples		voice_samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
voxcpm2_nodes.py		voxcpm2_nodes.py
voxcpm2_train_nodes.py		voxcpm2_train_nodes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComfyUI-VoxCPM2

About

Installation

Via ComfyUI Manager (Recommended)

Manual

Models

Nodes

VoxCPM2 TTS

VoxCPM2 Voice Clone

Usage

Text-to-Speech (Zero-Shot)

Voice Design

Controllable Voice Cloning

Ultimate Cloning (Highest Fidelity)

ASR Auto-Transcription

Reference Audio Denoiser

Reference Audio Duration

LoRA Support

Inference

Training

Tips for Best Results

Voice Cloning

Generation Quality

Supported Languages (30)

Limitations

License

Acknowledgments

About

Uh oh!

Releases 22

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ComfyUI-VoxCPM2

About

Installation

Via ComfyUI Manager (Recommended)

Manual

Models

Nodes

VoxCPM2 TTS

VoxCPM2 Voice Clone

Usage

Text-to-Speech (Zero-Shot)

Voice Design

Controllable Voice Cloning

Ultimate Cloning (Highest Fidelity)

ASR Auto-Transcription

Reference Audio Denoiser

Reference Audio Duration

LoRA Support

Inference

Training

Tips for Best Results

Voice Cloning

Generation Quality

Supported Languages (30)

Limitations

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 22

Contributors

Uh oh!

Languages