Skip to content

feat: add Apple Silicon (MPS) support for macOS ARM64#1869

Open
jasagiri wants to merge 2 commits into
FunAudioLLM:mainfrom
jasagiri:feat/apple-silicon
Open

feat: add Apple Silicon (MPS) support for macOS ARM64#1869
jasagiri wants to merge 2 commits into
FunAudioLLM:mainfrom
jasagiri:feat/apple-silicon

Conversation

@jasagiri
Copy link
Copy Markdown

@jasagiri jasagiri commented Apr 5, 2026

Summary

  • Introduce a device abstraction layer (cosyvoice/utils/device.py) that unifies CUDA, MPS (Apple Silicon), and CPU device management
  • Replace all hardcoded CUDA-specific code paths in the inference pipeline with device-agnostic alternatives
  • Enable CosyVoice to run natively on Apple Silicon Macs (M1/M2/M3/M4) via PyTorch MPS backend

Changes

New files

  • cosyvoice/utils/device.py — Unified device detection (get_device()), stream context, autocast, cache management, and random seed utilities
  • requirements-cuda.txt — Separated CUDA-specific PyPI index URLs for Linux GPU environments
  • setup_macos.sh — One-command setup script for Apple Silicon

Modified files

  • cosyvoice/cli/model.py — Replace CUDA device init, streams (torch.cuda.stream), AMP (torch.cuda.amp.autocast), and cache clearing across CosyVoiceModel, CosyVoice2Model, CosyVoice3Model
  • cosyvoice/cli/cosyvoice.py — MPS-aware feature gates: TRT/vLLM require CUDA, JIT/fp16 work on any GPU including MPS
  • cosyvoice/cli/frontend.py — Add CoreMLExecutionProvider fallback for ONNX Runtime on Apple Silicon
  • cosyvoice/utils/common.py — Guard torch.cuda.manual_seed_all for non-CUDA environments
  • requirements.txt — Remove CUDA-only index URLs, loosen PyTorch version pin (>=2.3.1)
  • README.md — Add macOS Apple Silicon setup instructions

Design decisions

  • Device priority: cuda > mps > cpu — CUDA environments are unaffected
  • TensorRT/vLLM: Remain CUDA-only (no ARM64 builds exist) — gracefully disabled with warning on MPS
  • JIT/fp16: Enabled on MPS since PyTorch MPS supports both
  • Training: Out of scope — DeepSpeed/DDP do not support MPS. This PR focuses on inference only
  • Zero behavioral change on CUDA: All abstractions are transparent passthrough when CUDA is available

Platform support matrix

Feature CUDA MPS (Apple Silicon) CPU
Inference
fp16
JIT
TensorRT
vLLM
Training

Test plan

  • device.py: All functions tested on MPS (device detection, stream context, autocast with float16, cache clear, seed)
  • common.py: set_all_random_seed() does not crash without CUDA; fade_in_out() works on MPS tensors
  • model.py: All 3 model classes import correctly; no hardcoded CUDA references remain (except intentional load_trt assert)
  • cosyvoice.py: Feature gates correctly disable TRT/vLLM on MPS while keeping JIT/fp16
  • frontend.py: Device abstraction and CoreML provider fallback verified
  • Clean clone test: All checks pass from fresh git clone
  • End-to-end inference with model weights (requires pretrained model download)

🤖 Generated with Claude Code

Introduce a device abstraction layer (cosyvoice/utils/device.py) that
unifies CUDA, MPS, and CPU device management. Replace all hardcoded
CUDA-specific code paths in the inference pipeline with device-agnostic
alternatives, enabling CosyVoice to run natively on Apple Silicon Macs.

Key changes:
- Device abstraction: get_device(), get_stream_context(),
  get_autocast_context(), empty_cache()
- model.py: Replace CUDA device init, streams, AMP, and cache clearing
  across CosyVoiceModel, CosyVoice2Model, CosyVoice3Model
- cosyvoice.py: MPS-aware feature gates (TRT/vLLM require CUDA,
  JIT/fp16 require any GPU)
- frontend.py: CoreMLExecutionProvider support for ONNX Runtime
- common.py: Guard torch.cuda.manual_seed_all for non-CUDA environments
- requirements.txt: Remove CUDA-only index URLs, loosen PyTorch version
- setup_macos.sh: One-command setup script for Apple Silicon

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jasagiri jasagiri force-pushed the feat/apple-silicon branch from 029f931 to fb21fd2 Compare April 5, 2026 13:44
@tedbeer
Copy link
Copy Markdown

tedbeer commented Apr 27, 2026

I've tried your branch on my M1 Pro but I've got an error.

File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/hifigan/generator.py", line 716, in inference
self.f0_predictor.to(torch.float64)
...
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

I've fixed it by patching:

diff --git a/cosyvoice/hifigan/generator.py b/cosyvoice/hifigan/generator.py
index bbc2a21..ec075c4 100644
--- a/cosyvoice/hifigan/generator.py
+++ b/cosyvoice/hifigan/generator.py
@@ -713,8 +713,8 @@ class CausalHiFTGenerator(HiFTGenerator):
     @torch.inference_mode()
     def inference(self, speech_feat: torch.Tensor, finalize: bool = True) -> torch.Tensor:
         # mel->f0 NOTE f0_predictor precision is crucial for causal inference, move self.f0_predictor to cpu if necessary
-        self.f0_predictor.to(torch.float64)
-        f0 = self.f0_predictor(speech_feat.to(torch.float64), finalize=finalize).to(speech_feat)
+        self.f0_predictor.to(torch.float32)
+        f0 = self.f0_predictor(speech_feat.to(torch.float32), finalize=finalize).to(speech_feat)
         # f0->source
         s = self.f0_upsamp(f0[:, None]).transpose(1, 2)  # bs,n,t
         s, _, _ = self.m_source(s)

Although it does not fail here anymore but voice still is not generated. Does it work for you?

@jasagiri
Copy link
Copy Markdown
Author

@tedbeer Apologies for the long silence — I should have replied much sooner. Thanks for trying the branch on your M1 Pro and reporting back with a patch.

You're hitting CausalHiFTGenerator.inference (the CosyVoice 2 streaming path), which I hadn't exercised on MPS — my testing focused on the non-causal path, so the float64 cast slipped through. Good catch.

On why your patch silences the output: the inline comment on that line ("f0_predictor precision is crucial for causal inference, move self.f0_predictor to cpu if necessary") flags this code as precision-sensitive. Dropping it to float32 lets execution proceed, but the predicted f0 is degraded enough that the downstream m_source / decoder produces near-silence. MPS itself cannot do float64 (hardware limitation, not a driver issue), so float32-on-MPS isn't a safe fix here.

Plan: follow the intent of that existing comment and add an MPS-aware CPU fallback — move f0_predictor and its input to CPU for this call, then move the result back to MPS. That keeps precision intact and matches how the CUDA path already runs (float64). The f0 sequence is short, so the CPU round-trip shouldn't be audible.

I'll push the fix to this branch and ping you here once it's verified on my M2.

…licon

MPS does not support float64 (Apple Silicon hardware limitation), causing
CausalHiFTGenerator.inference to fail on M-series Macs. Following the
intent of the existing inline comment, move f0_predictor and its input
to CPU for this precision-sensitive step, then bring the result back to
the original device.

Device move and dtype cast are done as two separate .to() calls: a
combined .to(device, dtype) attempts the float64 cast while the tensor
is still on MPS, which raises TypeError.

This preserves the float64 precision the causal inference path requires,
matching the CUDA behavior. Verified on Apple Silicon (MPS) with
Fun-CosyVoice3-0.5B streaming zero-shot inference.

Reported by @tedbeer in FunAudioLLM#1869.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@jasagiri
Copy link
Copy Markdown
Author

Pushed a fix in 0cb2c1e.

Root cause: on MPS, CausalHiFTGenerator.inference casts f0_predictor to float64, which the MPS backend cannot represent at all — hence the TypeError.

The fix runs f0_predictor on CPU in float64 for this precision-sensitive step, then moves the result back to the original device, following the intent of the existing inline comment. One subtlety: the device move and the dtype cast must be two separate .to() calls — a combined .to(device="cpu", dtype=torch.float64) still attempts the float64 cast on the still-MPS tensor and raises the same TypeError.

I verified this on Apple Silicon (MPS) with Fun-CosyVoice3-0.5B streaming zero-shot inference: no TypeError, and the output is non-silent (~7.8s of speech, RMS ~0.12).

One thing I could not reproduce: with your float32 patch, on my machine the output was not silent — it produced audio comparable to the float64-on-CPU path. So the "voice still is not generated" symptom may be environment-specific. Could you pull the latest branch and retry? If it's still silent, sharing your torch / torchaudio and macOS versions would help track down the difference.

@tedbeer
Copy link
Copy Markdown

tedbeer commented May 16, 2026

I did not say that silence is generated. There is no any outcome due to other errors. So I pulled the latest version with your changes. It also fails. I'm trying 3s Rapid Clone, I provided a text to generate, uploaded and truncated a short voice example (a prompt audio 3-4 sec long), provided required Prompt Text and Instruct Text. After I click Generate Audion it fails in several seconds. The stack trace:

Exception in thread Thread-10 (llm_job):
Traceback (most recent call last):
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 122, in llm_job
    for i in token_generator:
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
    response = gen.send(None)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/llm/llm.py", line 479, in inference
    assert 151646 in text, '<|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!'
AssertionError: <|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!
  0%|                                                                                                                                | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/queueing.py", line 624, in process_events
    response = await route_utils.call_process_api(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/route_utils.py", line 323, in call_process_api
    output = await app.get_blocks().process_api(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/blocks.py", line 2018, in process_api
    result = await self.call_function(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/blocks.py", line 1579, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/utils.py", line 691, in async_iteration
    return await anext(iterator)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/utils.py", line 685, in __anext__
    return await anyio.to_thread.run_sync(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/anyio/to_thread.py", line 63, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2518, in run_sync_in_worker_thread
    return await future
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1002, in run
    result = context.run(func, *args)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/utils.py", line 668, in run_sync_iterator_async
    return next(iterator)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/gradio/utils.py", line 829, in gen_wrapper
    response = next(iterator)
  File "/Users/tedbeer/Dev/github/CosyVoice/webui-en.py", line 104, in generate_audio
    for i in cosyvoice.inference_zero_shot(tts_text, prompt_text, prompt_wav, stream=stream, speed=speed):
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/cosyvoice.py", line 103, in inference_zero_shot
    for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 378, in tts
    this_tts_speech = self.token2wav(token=this_tts_speech_token,
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 444, in token2wav
    tts_speech, _ = self.hift.inference(speech_feat=tts_mel, finalize=finalize)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/hifigan/generator.py", line 721, in inference
    f0 = self.f0_predictor(speech_feat.to("cpu").to(torch.float64), finalize=finalize).to(speech_feat)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/hifigan/f0_predictor.py", line 97, in forward
    x = self.condnet[0](x)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/transformer/convolution.py", line 185, in forward
    x = super(CausalConv1d, self).forward(x)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 371, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 366, in _conv_forward
    return F.conv1d(
RuntimeError: Calculated padded input size per channel: (3). Kernel size: (4). Kernel size can't be greater than actual input size

@tedbeer
Copy link
Copy Markdown

tedbeer commented May 16, 2026

When I switched "Streaming Inference" to "Yes" it shows the first error only only:

Traceback (most recent call last):
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 122, in llm_job
    for i in token_generator:
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
    response = gen.send(None)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/llm/llm.py", line 479, in inference
    assert 151646 in text, '<|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!'
AssertionError: <|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!

Interface shows a progress indicator but it does not finish even in 50 minutes.

@tedbeer
Copy link
Copy Markdown

tedbeer commented May 17, 2026

Ok, I found how to make it working on a random page in internet - add <|endofprompt|> at the end of prompt text. Now "3s Rapid Clone" works and successfully generates audio.

@tedbeer
Copy link
Copy Markdown

tedbeer commented May 17, 2026

Although "Cross-Lingual Clone" still does not work with the same prompt having <|endofprompt|>. But I think it's not the problem of this PR.

Traceback (most recent call last):
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/cli/model.py", line 122, in llm_job
    for i in token_generator:
  File "/Users/tedbeer/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
    response = gen.send(None)
  File "/Users/tedbeer/Dev/github/CosyVoice/cosyvoice/llm/llm.py", line 479, in inference
    assert 151646 in text, '<|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!'
AssertionError: <|endofprompt|> not detected in CosyVoice3 text or prompt_text, check your input!

@jasagiri
Copy link
Copy Markdown
Author

Thanks for the detailed traces — they isolate the issue cleanly.

The MPS fix is working: your trace reaches generator.py:721 and the original TypeError is gone. The two remaining errors are unrelated to MPS:

  1. "AssertionError: <|endofprompt|> not detected" — webui.py predates CosyVoice2/3 and never inserts the <|endofprompt|> token CosyVoice3's LLM requires. The conv1d error is a knock-on effect: the LLM thread dies on that assertion and an empty token sequence reaches the vocoder.
  2. The 50-minute hang is the same root cause — the LLM thread is dead, the main thread waits forever for tokens.

I've opened two separate upstream PRs for these, independent of this MPS PR:

I verified the webui fix end-to-end on Apple Silicon (MPS): launched the gradio webui with Fun-CosyVoice3-0.5B and ran zero-shot over HTTP — no assertion, 7.36s of audio generated.

Immediate workaround with this branch as-is: use the CosyVoice2-0.5B model, which has no <|endofprompt|> requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants