feat: add Apple Silicon (MPS) support for macOS ARM64#1869
Conversation
5d2a2c9 to
029f931
Compare
Introduce a device abstraction layer (cosyvoice/utils/device.py) that unifies CUDA, MPS, and CPU device management. Replace all hardcoded CUDA-specific code paths in the inference pipeline with device-agnostic alternatives, enabling CosyVoice to run natively on Apple Silicon Macs. Key changes: - Device abstraction: get_device(), get_stream_context(), get_autocast_context(), empty_cache() - model.py: Replace CUDA device init, streams, AMP, and cache clearing across CosyVoiceModel, CosyVoice2Model, CosyVoice3Model - cosyvoice.py: MPS-aware feature gates (TRT/vLLM require CUDA, JIT/fp16 require any GPU) - frontend.py: CoreMLExecutionProvider support for ONNX Runtime - common.py: Guard torch.cuda.manual_seed_all for non-CUDA environments - requirements.txt: Remove CUDA-only index URLs, loosen PyTorch version - setup_macos.sh: One-command setup script for Apple Silicon Co-Authored-By: Claude Opus 4.6 <[email protected]>
029f931 to
fb21fd2
Compare
|
I've tried your branch on my M1 Pro but I've got an error.
I've fixed it by patching: Although it does not fail here anymore but voice still is not generated. Does it work for you? |
|
@tedbeer Apologies for the long silence — I should have replied much sooner. Thanks for trying the branch on your M1 Pro and reporting back with a patch. You're hitting On why your patch silences the output: the inline comment on that line ("f0_predictor precision is crucial for causal inference, move self.f0_predictor to cpu if necessary") flags this code as precision-sensitive. Dropping it to float32 lets execution proceed, but the predicted f0 is degraded enough that the downstream Plan: follow the intent of that existing comment and add an MPS-aware CPU fallback — move I'll push the fix to this branch and ping you here once it's verified on my M2. |
…licon MPS does not support float64 (Apple Silicon hardware limitation), causing CausalHiFTGenerator.inference to fail on M-series Macs. Following the intent of the existing inline comment, move f0_predictor and its input to CPU for this precision-sensitive step, then bring the result back to the original device. Device move and dtype cast are done as two separate .to() calls: a combined .to(device, dtype) attempts the float64 cast while the tensor is still on MPS, which raises TypeError. This preserves the float64 precision the causal inference path requires, matching the CUDA behavior. Verified on Apple Silicon (MPS) with Fun-CosyVoice3-0.5B streaming zero-shot inference. Reported by @tedbeer in FunAudioLLM#1869. Co-Authored-By: Claude Opus 4.7 <[email protected]>
|
Pushed a fix in 0cb2c1e. Root cause: on MPS, The fix runs I verified this on Apple Silicon (MPS) with Fun-CosyVoice3-0.5B streaming zero-shot inference: no TypeError, and the output is non-silent (~7.8s of speech, RMS ~0.12). One thing I could not reproduce: with your float32 patch, on my machine the output was not silent — it produced audio comparable to the float64-on-CPU path. So the "voice still is not generated" symptom may be environment-specific. Could you pull the latest branch and retry? If it's still silent, sharing your torch / torchaudio and macOS versions would help track down the difference. |
|
I did not say that silence is generated. There is no any outcome due to other errors. So I pulled the latest version with your changes. It also fails. I'm trying |
|
When I switched "Streaming Inference" to "Yes" it shows the first error only only: Interface shows a progress indicator but it does not finish even in 50 minutes. |
|
Ok, I found how to make it working on a random page in internet - add |
|
Although "Cross-Lingual Clone" still does not work with the same prompt having |
|
Thanks for the detailed traces — they isolate the issue cleanly. The MPS fix is working: your trace reaches generator.py:721 and the original TypeError is gone. The two remaining errors are unrelated to MPS:
I've opened two separate upstream PRs for these, independent of this MPS PR:
I verified the webui fix end-to-end on Apple Silicon (MPS): launched the gradio webui with Fun-CosyVoice3-0.5B and ran zero-shot over HTTP — no assertion, 7.36s of audio generated. Immediate workaround with this branch as-is: use the CosyVoice2-0.5B model, which has no <|endofprompt|> requirement. |
Summary
cosyvoice/utils/device.py) that unifies CUDA, MPS (Apple Silicon), and CPU device managementChanges
New files
cosyvoice/utils/device.py— Unified device detection (get_device()), stream context, autocast, cache management, and random seed utilitiesrequirements-cuda.txt— Separated CUDA-specific PyPI index URLs for Linux GPU environmentssetup_macos.sh— One-command setup script for Apple SiliconModified files
cosyvoice/cli/model.py— Replace CUDA device init, streams (torch.cuda.stream), AMP (torch.cuda.amp.autocast), and cache clearing acrossCosyVoiceModel,CosyVoice2Model,CosyVoice3Modelcosyvoice/cli/cosyvoice.py— MPS-aware feature gates: TRT/vLLM require CUDA, JIT/fp16 work on any GPU including MPScosyvoice/cli/frontend.py— AddCoreMLExecutionProviderfallback for ONNX Runtime on Apple Siliconcosyvoice/utils/common.py— Guardtorch.cuda.manual_seed_allfor non-CUDA environmentsrequirements.txt— Remove CUDA-only index URLs, loosen PyTorch version pin (>=2.3.1)README.md— Add macOS Apple Silicon setup instructionsDesign decisions
cuda>mps>cpu— CUDA environments are unaffectedPlatform support matrix
Test plan
device.py: All functions tested on MPS (device detection, stream context, autocast with float16, cache clear, seed)common.py:set_all_random_seed()does not crash without CUDA;fade_in_out()works on MPS tensorsmodel.py: All 3 model classes import correctly; no hardcoded CUDA references remain (except intentionalload_trtassert)cosyvoice.py: Feature gates correctly disable TRT/vLLM on MPS while keeping JIT/fp16frontend.py: Device abstraction and CoreML provider fallback verifiedgit clone🤖 Generated with Claude Code