Qwen3TTS: cache reference audio embeddings across voice clone calls#113
Closed
Oliver2213 wants to merge 2 commits intoBlaizzy:mainfrom
Closed
Qwen3TTS: cache reference audio embeddings across voice clone calls#113Oliver2213 wants to merge 2 commits intoBlaizzy:mainfrom
Oliver2213 wants to merge 2 commits intoBlaizzy:mainfrom
Conversation
When generating multiple outputs with the same cloned voice, Qwen3TTSModel recomputes identical work on every call: speaker embedding extraction, codec encoding, ref text tokenization, TTS special token embeddings, and codec embedding construction. This adds instance-level caching for all five on Qwen3TTSModel. Results are computed on first use and reused on subsequent calls with the same reference audio. Cache is keyed on refAudio.shape and invalidated automatically when the reference audio changes. A public clearRefCache() method is provided for explicit cleanup. Co-written with CLaude when working on something else, but this looks fine to me. Happy to fix others if caching like this can benefit other models.
Collaborator
|
@Oliver2213 Thanks! This patch is going to have issues with concurrent access to the cache and the shape-based cache key isn't reliable -- I put up a modified version of it in #125 that should be safer and works for what you're trying to do. |
Author
|
@lucasnewman, thanks a bunch for fixing this up and refiling. I definitely missed those when reading the diff. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When generating multiple outputs with the same cloned voice, Qwen3TTSModel recomputes identical work on every call: speaker embedding extraction, codec encoding, ref text tokenization, TTS special token embeddings, and codec embedding construction.
This adds instance-level caching for all five on Qwen3TTSModel. Results are computed on first use and reused on subsequent calls with the same reference audio. Cache is keyed on refAudio.shape and invalidated automatically when the reference audio changes. A public clearRefCache() method is provided for explicit cleanup.
Co-written with CLaude when working on something else, but this looks fine to me. Happy to fix others if caching like this can benefit other models.