Hey, love the project — Speaches is hands down the cleanest self-hosted OpenAI-compatible speech server out there. Been running it and it's solid.
Feature request: Weighted voice blending for Kokoro TTS, e.g.:
{ "voice": "af_sarah(1)+am_adam(1)+am_onyx(0.5)" }
This is a well-established technique in the Kokoro ecosystem — weighted averaging of style vectors to create custom voice personas without any training. Several projects already implement it:
- Kokoro-FastAPI —
voice1(weight)+voice2(weight) syntax, most widely adopted. Worth noting it supports both CPU (ONNX) and GPU (PyTorch) paths, so the lightweight CPU argument isn't exclusive to kokoro-onnx.
- RealtimeTTS — formula-based blended voice cache using
KPipeline
- kokoro-tts CLI —
voice1:60,voice2:40 syntax
- Community experimentation — voice extrapolation and interpolation via linear models
The interesting design question: these implementations all use the official kokoro PyTorch package (KPipeline from hexgrad) rather than kokoro-onnx. Blending is trivial with PyTorch tensors — it's just weighted averaging before synthesis.
I understand Speaches uses kokoro-onnx for good reason (lighter footprint, ARM compatibility). A few possible paths forward:
- Add blending to the existing ONNX path — load voice arrays from the npz file, weighted average with numpy, pass the blended array to kokoro-onnx (may need kokoro-onnx to accept raw arrays)
- Add a
KPipeline executor as an optional backend — for GPU users who want blending + native PyTorch performance, alongside the existing ONNX executor for CPU/lightweight deployments. Kokoro-FastAPI already proves this dual CPU/GPU approach works well in production.
- Something else entirely — you know the codebase better than anyone
Would love to hear your thoughts on the right approach. Happy to contribute a PR if there's a direction you'd prefer.
Hey, love the project — Speaches is hands down the cleanest self-hosted OpenAI-compatible speech server out there. Been running it and it's solid.
Feature request: Weighted voice blending for Kokoro TTS, e.g.:
{ "voice": "af_sarah(1)+am_adam(1)+am_onyx(0.5)" }This is a well-established technique in the Kokoro ecosystem — weighted averaging of style vectors to create custom voice personas without any training. Several projects already implement it:
voice1(weight)+voice2(weight)syntax, most widely adopted. Worth noting it supports both CPU (ONNX) and GPU (PyTorch) paths, so the lightweight CPU argument isn't exclusive to kokoro-onnx.KPipelinevoice1:60,voice2:40syntaxThe interesting design question: these implementations all use the official
kokoroPyTorch package (KPipelinefrom hexgrad) rather thankokoro-onnx. Blending is trivial with PyTorch tensors — it's just weighted averaging before synthesis.I understand Speaches uses
kokoro-onnxfor good reason (lighter footprint, ARM compatibility). A few possible paths forward:KPipelineexecutor as an optional backend — for GPU users who want blending + native PyTorch performance, alongside the existing ONNX executor for CPU/lightweight deployments. Kokoro-FastAPI already proves this dual CPU/GPU approach works well in production.Would love to hear your thoughts on the right approach. Happy to contribute a PR if there's a direction you'd prefer.