Voice blending support for Kokoro TTS — via KPipeline backend?

Hey, love the project — Speaches is hands down the cleanest self-hosted OpenAI-compatible speech server out there. Been running it and it's solid.

**Feature request:** Weighted voice blending for Kokoro TTS, e.g.:

```json
{ "voice": "af_sarah(1)+am_adam(1)+am_onyx(0.5)" }
```

This is a well-established technique in the Kokoro ecosystem — weighted averaging of style vectors to create custom voice personas without any training. Several projects already implement it:

- [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) — `voice1(weight)+voice2(weight)` syntax, most widely adopted. Worth noting it supports **both CPU (ONNX) and GPU (PyTorch)** paths, so the lightweight CPU argument isn't exclusive to kokoro-onnx.
- [RealtimeTTS](https://github.com/KoljaB/RealtimeTTS/blob/master/RealtimeTTS/engines/kokoro_engine.py) — formula-based blended voice cache using `KPipeline`
- [kokoro-tts CLI](https://github.com/nazdridoy/kokoro-tts) — `voice1:60,voice2:40` syntax
- [Community experimentation](https://www.reddit.com/r/LocalLLaMA/comments/1iic1ks/) — voice extrapolation and interpolation via linear models

The interesting design question: these implementations all use the official `kokoro` PyTorch package (`KPipeline` from hexgrad) rather than `kokoro-onnx`. Blending is trivial with PyTorch tensors — it's just weighted averaging before synthesis.

I understand Speaches uses `kokoro-onnx` for good reason (lighter footprint, ARM compatibility). A few possible paths forward:

1. **Add blending to the existing ONNX path** — load voice arrays from the npz file, weighted average with numpy, pass the blended array to kokoro-onnx (may need kokoro-onnx to accept raw arrays)
2. **Add a `KPipeline` executor as an optional backend** — for GPU users who want blending + native PyTorch performance, alongside the existing ONNX executor for CPU/lightweight deployments. Kokoro-FastAPI already proves this dual CPU/GPU approach works well in production.
3. **Something else entirely** — you know the codebase better than anyone

Would love to hear your thoughts on the right approach. Happy to contribute a PR if there's a direction you'd prefer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice blending support for Kokoro TTS — via KPipeline backend? #614

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Voice blending support for Kokoro TTS — via KPipeline backend? #614

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions