feat: add Apple Silicon (MPS) support for macOS ARM64 by jasagiri · Pull Request #1869 · FunAudioLLM/CosyVoice

jasagiri · 2026-04-05T13:23:00Z

Summary

Introduce a device abstraction layer (cosyvoice/utils/device.py) that unifies CUDA, MPS (Apple Silicon), and CPU device management
Replace all hardcoded CUDA-specific code paths in the inference pipeline with device-agnostic alternatives
Enable CosyVoice to run natively on Apple Silicon Macs (M1/M2/M3/M4) via PyTorch MPS backend

Changes

New files

cosyvoice/utils/device.py — Unified device detection (get_device()), stream context, autocast, cache management, and random seed utilities
requirements-cuda.txt — Separated CUDA-specific PyPI index URLs for Linux GPU environments
setup_macos.sh — One-command setup script for Apple Silicon

Modified files

cosyvoice/cli/model.py — Replace CUDA device init, streams (torch.cuda.stream), AMP (torch.cuda.amp.autocast), and cache clearing across CosyVoiceModel, CosyVoice2Model, CosyVoice3Model
cosyvoice/cli/cosyvoice.py — MPS-aware feature gates: TRT/vLLM require CUDA, JIT/fp16 work on any GPU including MPS
cosyvoice/cli/frontend.py — Add CoreMLExecutionProvider fallback for ONNX Runtime on Apple Silicon
cosyvoice/utils/common.py — Guard torch.cuda.manual_seed_all for non-CUDA environments
requirements.txt — Remove CUDA-only index URLs, loosen PyTorch version pin (>=2.3.1)
README.md — Add macOS Apple Silicon setup instructions

Design decisions

Device priority: cuda > mps > cpu — CUDA environments are unaffected
TensorRT/vLLM: Remain CUDA-only (no ARM64 builds exist) — gracefully disabled with warning on MPS
JIT/fp16: Enabled on MPS since PyTorch MPS supports both
Training: Out of scope — DeepSpeed/DDP do not support MPS. This PR focuses on inference only
Zero behavioral change on CUDA: All abstractions are transparent passthrough when CUDA is available

Platform support matrix

Feature	CUDA	MPS (Apple Silicon)	CPU
Inference	✅	✅	✅
fp16	✅	✅	❌
JIT	✅	✅	❌
TensorRT	✅	❌	❌
vLLM	✅	❌	❌
Training	✅	❌	❌

Test plan

device.py: All functions tested on MPS (device detection, stream context, autocast with float16, cache clear, seed)
common.py: set_all_random_seed() does not crash without CUDA; fade_in_out() works on MPS tensors
model.py: All 3 model classes import correctly; no hardcoded CUDA references remain (except intentional load_trt assert)
cosyvoice.py: Feature gates correctly disable TRT/vLLM on MPS while keeping JIT/fp16
frontend.py: Device abstraction and CoreML provider fallback verified
Clean clone test: All checks pass from fresh git clone
End-to-end inference with model weights (requires pretrained model download)

🤖 Generated with Claude Code

Introduce a device abstraction layer (cosyvoice/utils/device.py) that unifies CUDA, MPS, and CPU device management. Replace all hardcoded CUDA-specific code paths in the inference pipeline with device-agnostic alternatives, enabling CosyVoice to run natively on Apple Silicon Macs. Key changes: - Device abstraction: get_device(), get_stream_context(), get_autocast_context(), empty_cache() - model.py: Replace CUDA device init, streams, AMP, and cache clearing across CosyVoiceModel, CosyVoice2Model, CosyVoice3Model - cosyvoice.py: MPS-aware feature gates (TRT/vLLM require CUDA, JIT/fp16 require any GPU) - frontend.py: CoreMLExecutionProvider support for ONNX Runtime - common.py: Guard torch.cuda.manual_seed_all for non-CUDA environments - requirements.txt: Remove CUDA-only index URLs, loosen PyTorch version - setup_macos.sh: One-command setup script for Apple Silicon Co-Authored-By: Claude Opus 4.6 <[email protected]>

tedbeer · 2026-04-27T16:01:05Z

I've tried your branch on my M1 Pro but I've got an error.

File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/hifigan/generator.py", line 716, in inference
self.f0_predictor.to(torch.float64)
...
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

I've fixed it by patching:

diff --git a/cosyvoice/hifigan/generator.py b/cosyvoice/hifigan/generator.py
index bbc2a21..ec075c4 100644
--- a/cosyvoice/hifigan/generator.py
+++ b/cosyvoice/hifigan/generator.py
@@ -713,8 +713,8 @@ class CausalHiFTGenerator(HiFTGenerator):
     @torch.inference_mode()
     def inference(self, speech_feat: torch.Tensor, finalize: bool = True) -> torch.Tensor:
         # mel->f0 NOTE f0_predictor precision is crucial for causal inference, move self.f0_predictor to cpu if necessary
-        self.f0_predictor.to(torch.float64)
-        f0 = self.f0_predictor(speech_feat.to(torch.float64), finalize=finalize).to(speech_feat)
+        self.f0_predictor.to(torch.float32)
+        f0 = self.f0_predictor(speech_feat.to(torch.float32), finalize=finalize).to(speech_feat)
         # f0->source
         s = self.f0_upsamp(f0[:, None]).transpose(1, 2)  # bs,n,t
         s, _, _ = self.m_source(s)

Although it does not fail here anymore but voice still is not generated. Does it work for you?

jasagiri · 2026-05-15T20:00:11Z

@tedbeer Apologies for the long silence — I should have replied much sooner. Thanks for trying the branch on your M1 Pro and reporting back with a patch.

You're hitting CausalHiFTGenerator.inference (the CosyVoice 2 streaming path), which I hadn't exercised on MPS — my testing focused on the non-causal path, so the float64 cast slipped through. Good catch.

On why your patch silences the output: the inline comment on that line ("f0_predictor precision is crucial for causal inference, move self.f0_predictor to cpu if necessary") flags this code as precision-sensitive. Dropping it to float32 lets execution proceed, but the predicted f0 is degraded enough that the downstream m_source / decoder produces near-silence. MPS itself cannot do float64 (hardware limitation, not a driver issue), so float32-on-MPS isn't a safe fix here.

Plan: follow the intent of that existing comment and add an MPS-aware CPU fallback — move f0_predictor and its input to CPU for this call, then move the result back to MPS. That keeps precision intact and matches how the CUDA path already runs (float64). The f0 sequence is short, so the CPU round-trip shouldn't be audible.

I'll push the fix to this branch and ping you here once it's verified on my M2.

@tedbeer

…licon MPS does not support float64 (Apple Silicon hardware limitation), causing CausalHiFTGenerator.inference to fail on M-series Macs. Following the intent of the existing inline comment, move f0_predictor and its input to CPU for this precision-sensitive step, then bring the result back to the original device. Device move and dtype cast are done as two separate .to() calls: a combined .to(device, dtype) attempts the float64 cast while the tensor is still on MPS, which raises TypeError. This preserves the float64 precision the causal inference path requires, matching the CUDA behavior. Verified on Apple Silicon (MPS) with Fun-CosyVoice3-0.5B streaming zero-shot inference. Reported by @tedbeer in FunAudioLLM#1869. Co-Authored-By: Claude Opus 4.7 <[email protected]>

jasagiri · 2026-05-16T08:39:48Z

Pushed a fix in 0cb2c1e.

Root cause: on MPS, CausalHiFTGenerator.inference casts f0_predictor to float64, which the MPS backend cannot represent at all — hence the TypeError.

The fix runs f0_predictor on CPU in float64 for this precision-sensitive step, then moves the result back to the original device, following the intent of the existing inline comment. One subtlety: the device move and the dtype cast must be two separate .to() calls — a combined .to(device="cpu", dtype=torch.float64) still attempts the float64 cast on the still-MPS tensor and raises the same TypeError.

I verified this on Apple Silicon (MPS) with Fun-CosyVoice3-0.5B streaming zero-shot inference: no TypeError, and the output is non-silent (~7.8s of speech, RMS ~0.12).

One thing I could not reproduce: with your float32 patch, on my machine the output was not silent — it produced audio comparable to the float64-on-CPU path. So the "voice still is not generated" symptom may be environment-specific. Could you pull the latest branch and retry? If it's still silent, sharing your torch / torchaudio and macOS versions would help track down the difference.

tedbeer · 2026-05-16T21:23:41Z

I did not say that silence is generated. There is no any outcome due to other errors. So I pulled the latest version with your changes. It also fails. I'm trying 3s Rapid Clone, I provided a text to generate, uploaded and truncated a short voice example (a prompt audio 3-4 sec long), provided required Prompt Text and Instruct Text. After I click Generate Audion it fails in several seconds. The stack trace:

Exception in thread Thread-10 (llm_job):
Traceback (most recent call last):
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 122, in llm_job
    for i in token_generator:
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
    response = gen.send(None)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/llm/llm.py", line 479, in inference
    assert 151646 in text, '<|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!'
AssertionError: <|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!
  0%|                                                                                                                                | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/queueing.py", line 624, in process_events
    response = await route_utils.call_process_api(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/route_utils.py", line 323, in call_process_api
    output = await app.get_blocks().process_api(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/blocks.py", line 2018, in process_api
    result = await self.call_function(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/blocks.py", line 1579, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/utils.py", line 691, in async_iteration
    return await anext(iterator)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/utils.py", line 685, in __anext__
    return await anyio.to_thread.run_sync(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/anyio/to_thread.py", line 63, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2518, in run_sync_in_worker_thread
    return await future
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1002, in run
    result = context.run(func, *args)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/utils.py", line 668, in run_sync_iterator_async
    return next(iterator)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/utils.py", line 829, in gen_wrapper
    response = next(iterator)
  File "/Users/tedbeer/Dev/github/CosyVoice/webui-en.py", line 104, in generate_audio
    for i in cosyvoice.inference_zero_shot(tts_text, prompt_text, prompt_wav, stream=stream, speed=speed):
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/cosyvoice.py", line 103, in inference_zero_shot
    for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 378, in tts
    this_tts_speech = self.token2wav(token=this_tts_speech_token,
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 444, in token2wav
    tts_speech, _ = self.hift.inference(speech_feat=tts_mel, finalize=finalize)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/hifigan/generator.py", line 721, in inference
    f0 = self.f0_predictor(speech_feat.to("cpu").to(torch.float64), finalize=finalize).to(speech_feat)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/hifigan/f0_predictor.py", line 97, in forward
    x = self.condnet[0](x)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/transformer/convolution.py", line 185, in forward
    x = super(CausalConv1d, self).forward(x)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 371, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 366, in _conv_forward
    return F.conv1d(
RuntimeError: Calculated padded input size per channel: (3). Kernel size: (4). Kernel size can't be greater than actual input size

tedbeer · 2026-05-16T22:32:17Z

When I switched "Streaming Inference" to "Yes" it shows the first error only only:

Traceback (most recent call last):
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 122, in llm_job
    for i in token_generator:
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
    response = gen.send(None)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/llm/llm.py", line 479, in inference
    assert 151646 in text, '<|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!'
AssertionError: <|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!

Interface shows a progress indicator but it does not finish even in 50 minutes.

tedbeer · 2026-05-17T12:43:35Z

Ok, I found how to make it working on a random page in internet - add <|endofprompt|> at the end of prompt text. Now "3s Rapid Clone" works and successfully generates audio.

tedbeer · 2026-05-17T19:15:36Z

Although "Cross-Lingual Clone" still does not work with the same prompt having <|endofprompt|>. But I think it's not the problem of this PR.

Traceback (most recent call last):
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 122, in llm_job
    for i in token_generator:
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
    response = gen.send(None)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/llm/llm.py", line 479, in inference
    assert 151646 in text, '<|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!'
AssertionError: <|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!

jasagiri · 2026-05-18T05:50:57Z

Thanks for the detailed traces — they isolate the issue cleanly.

The MPS fix is working: your trace reaches generator.py:721 and the original TypeError is gone. The two remaining errors are unrelated to MPS:

"AssertionError: <|endofprompt|> not detected" — webui.py predates CosyVoice2/3 and never inserts the <|endofprompt|> token CosyVoice3's LLM requires. The conv1d error is a knock-on effect: the LLM thread dies on that assertion and an empty token sequence reaches the vocoder.
The 50-minute hang is the same root cause — the LLM thread is dead, the main thread waits forever for tokens.

I've opened two separate upstream PRs for these, independent of this MPS PR:

feat: support CosyVoice3 in webui.py #1888 — webui.py CosyVoice3 support
fix: read wav via soundfile to avoid torchcodec dependency #1887 — soundfile-based wav loading (torchaudio 2.8+ changed the load/info paths webui.py relied on)

I verified the webui fix end-to-end on Apple Silicon (MPS): launched the gradio webui with Fun-CosyVoice3-0.5B and ran zero-shot over HTTP — no assertion, 7.36s of audio generated.

Immediate workaround with this branch as-is: use the CosyVoice2-0.5B model, which has no <|endofprompt|> requirement.

jasagiri mentioned this pull request Apr 5, 2026

feat: add Apple Silicon (MPS) support jasagiri/CosyVoice#2

Merged

jasagiri force-pushed the feat/apple-silicon branch from 5d2a2c9 to 029f931 Compare April 5, 2026 13:25

jasagiri force-pushed the feat/apple-silicon branch from 029f931 to fb21fd2 Compare April 5, 2026 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Apple Silicon (MPS) support for macOS ARM64#1869

feat: add Apple Silicon (MPS) support for macOS ARM64#1869
jasagiri wants to merge 2 commits into
FunAudioLLM:mainfrom
jasagiri:feat/apple-silicon

jasagiri commented Apr 5, 2026

Uh oh!

tedbeer commented Apr 27, 2026

Uh oh!

jasagiri commented May 15, 2026

Uh oh!

jasagiri commented May 16, 2026

Uh oh!

tedbeer commented May 16, 2026

Uh oh!

tedbeer commented May 16, 2026

Uh oh!

tedbeer commented May 17, 2026

Uh oh!

tedbeer commented May 17, 2026

Uh oh!

jasagiri commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jasagiri commented Apr 5, 2026

Summary

Changes

New files

Modified files

Design decisions

Platform support matrix

Test plan

Uh oh!

tedbeer commented Apr 27, 2026

Uh oh!

jasagiri commented May 15, 2026

Uh oh!

jasagiri commented May 16, 2026

Uh oh!

tedbeer commented May 16, 2026

Uh oh!

tedbeer commented May 16, 2026

Uh oh!

tedbeer commented May 17, 2026

Uh oh!

tedbeer commented May 17, 2026

Uh oh!

jasagiri commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants