Vietnamese Text-to-Speech system powered by Fish Speech v1.5 with voice cloning and multilingual support.
- Zero-shot & Few-shot TTS: Generate high-quality speech with just 10-30 seconds of reference audio
- Multilingual Support: English, Chinese, Japanese, German, French, Arabic, Russian, Dutch, Italian, Portuguese, Vietnamese, and more
- Cross-lingual Voice Cloning: Clone voices across different languages
- Vietnamese Optimization: Special utilities and optimizations for Vietnamese language
- High Quality: Top-ranked performance on TTS-Arena2 benchmark
- Easy to Use: Simple WebUI and REST API
- Python 3.10 or higher
- 8GB RAM minimum
- 5GB disk space for models and dependencies
- macOS (with MPS), Linux, or Windows (WSL)
-
Clone the repository:
git clone https://github.com/yourusername/text2speech-py.git cd text2speech-py -
Create and activate virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Verify setup:
./verify_setup.sh
./start_webui.shThen open your browser to http://localhost:7860
./start_api.shAPI will be available at http://localhost:8000
This project includes special utilities for Vietnamese language optimization:
from vietnamese_tts_helper import VietnameseTTSHelper
helper = VietnameseTTSHelper()
# Prepare Vietnamese text
text = "Xin chào, đây là hệ thống chuyển đổi văn bản thành giọng nói."
prepared = helper.prepare_vietnamese_text(text)
# Split long text
chunks = helper.split_long_text(prepared, max_length=150)
# Get recommended parameters
params = helper.get_recommended_parameters()See VIETNAMESE_GUIDE.md for detailed Vietnamese usage guide.
import requests
response = requests.post(
"http://localhost:8000/v1/tts",
json={
"text": "Hello, this is a test.",
"language": "en"
}
)
with open("output.wav", "wb") as f:
f.write(response.content)import requests
files = {"reference_audio": open("voice_sample.wav", "rb")}
data = {
"text": "This will sound like the reference voice.",
"language": "en"
}
response = requests.post(
"http://localhost:8000/v1/tts",
files=files,
data=data
)
with open("output.wav", "wb") as f:
f.write(response.content)- Format: WAV (16-bit, 44.1kHz) or high-quality MP3 (320kbps)
- Duration: 10-30 seconds (15-20 seconds optimal)
- Content: Natural speech, avoid monotone reading
- Environment: Quiet room, minimal background noise
- Microphone distance: 15-20cm from mouth
- Upload reference audio in any language
- Enter text in target language
- The model will maintain voice characteristics while speaking in the new language
Test different devices to find the fastest for your system:
python benchmark_devices.pyThis will test CPU and MPS (if available) and recommend the optimal device.
.
├── benchmark_devices.py # Device performance testing
├── vietnamese_tts_helper.py # Vietnamese language utilities
├── examples.py # Usage examples
├── test_tts.py # Installation test script
├── start_webui.sh # WebUI launcher
├── start_api.sh # API server launcher
├── fish-speech/ # Fish Speech engine
│ ├── checkpoints/ # Model files
│ └── tools/ # Utilities
└── .venv/ # Virtual environment
# Install development dependencies
pip install -r requirements-dev.txt
# Run code quality checks
ruff format .
ruff check .
mypy *.py
# Run tests
pytestSee CONTRIBUTING.md for detailed development guidelines.
# Check setup
./verify_setup.sh
# View detailed logs
cd fish-speech
source ../.venv/bin/activate
python tools/run_webui.py --llama-checkpoint-path checkpoints/fish-speech-1.5Check if model files exist in fish-speech/checkpoints/fish-speech-1.5/. If not, download them:
cd fish-speech
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='fishaudio/fish-speech-1.5', local_dir='checkpoints/fish-speech-1.5')"Edit the startup script to change the port, or stop the process using that port.
- Vietnamese Guide - Hướng dẫn tiếng Việt
- Vietnamese Optimization - Tối ưu cho tiếng Việt
- Contributing Guidelines - How to contribute
- Changelog - Version history
This project is licensed under the MIT License - see the LICENSE file for details.
Fish Speech components:
- Code: Apache License 2.0
- Models: CC-BY-NC-SA-4.0 License
- Fish Audio for the Fish Speech engine
- All contributors who help improve this project
- Open an issue for bug reports
- Check discussions for questions
- See CONTRIBUTING.md for contribution guidelines
Made with care for Vietnamese TTS applications