Fix Qwen Omni quantization model issue for long form audio generation#1698
Fix Qwen Omni quantization model issue for long form audio generation#1698lvliang-intel wants to merge 15 commits intomainfrom
Conversation
Signed-off-by: lvliang-intel <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <[email protected]>
…-round into lvl/add_claude_skills
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <[email protected]>
…-round into lvl/add_claude_skills
Signed-off-by: lvliang-intel <[email protected]>
…ix_omni_long_audio
9414d33 to
d964c82
Compare
Signed-off-by: lvliang-intel <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
This PR fixes long-form audio quality degradation for Qwen Omni models by excluding the talker module from quantization (keeping it in float16/BF16), while ensuring the model can still be exported correctly (including save-time handling for fused MoE expert tensors).
Changes:
- Exclude
talkerblocks from default quantization block discovery for Qwen2.5-Omni and Qwen3-Omni-MoE. - Add MoE skip-prefix support so
talker.*MoE modules remain fused during quantization, and expand fused expert tensors at save time. - Adjust missing-tensor copying/WOQ behavior to preserve
talker.*tensors by exact key and prevent unintended quantization.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
auto_round/special_model_handler.py |
Removes talker from default block lists for Qwen Omni models and documents rationale. |
auto_round/modeling/fused_moe/replace_modules.py |
Introduces MOE_SKIP_PREFIXES and wires it into MoE preparation. |
auto_round/modeling/fused_moe/moe_experts_interface.py |
Adds skip_prefixes to skip unfusing modules under excluded prefixes. |
auto_round/compressors/shard_writer.py |
Expands fused 3D expert params into per-expert 2D tensors at save time for skipped prefixes. |
auto_round/utils/missing_tensors.py |
Ensures talker.* tensors are treated as truly missing when absent and scopes WOQ quantization to block_name_to_quantize. |
auto_round/utils/common.py |
Adds collapse_ignore_layers() helper to reduce per-layer ignore config churn. |
auto_round/compressors/base.py |
Filters predefined ignore layers to quantized blocks and collapses numbered ignore layers into regex patterns. |
auto_round/modeling/fused_moe/qwen3_omni.py |
Drops talker replacement path; keeps only thinker replacement and documents save-time conversion. |
test/test_cpu/models/test_omni_model.py |
Updates assertions to reflect thinker-only default quantization and no talker replacement. |
test/test_cpu/utils/test_shard_writer.py |
Adds tests for save-time expansion of fused experts under skipped prefixes. |
test/test_cpu/utils/test_missing_tensors.py |
Adds tests ensuring missing talker tensors are copied exactly and never WOQ-quantized. |
.claude/skills/readme.md |
Adds contributor skill documentation index. |
.claude/skills/review-pr/SKILL.md |
Adds a structured PR review checklist (project-specific). |
.claude/skills/add-vlm-model/SKILL.md |
Adds workflow documentation for integrating new VLMs. |
.claude/skills/add-quantization-datatype/SKILL.md |
Adds workflow documentation for adding quantization datatypes. |
.claude/skills/add-inference-backend/SKILL.md |
Adds workflow documentation for adding inference backends. |
.claude/skills/add-export-format/SKILL.md |
Adds workflow documentation for adding export formats. |
.claude/skills/adapt-new-llm/SKILL.md |
Adds workflow documentation for adapting new LLM architectures. |
.claude/skills/adapt-new-diffusion-model/SKILL.md |
Adds workflow documentation for adapting new diffusion architectures. |
| - Optionally: visual encoder blocks, audio encoder layers | ||
|
|
||
| talker is excluded by default because quantizing it has been observed to | ||
| degrade audio quality in long-form generation . |
There was a problem hiding this comment.
The docstring has an extra space before the period in "long-form generation ." which reads as a typo. Please remove the stray space for consistent documentation formatting.
| degrade audio quality in long-form generation . | |
| degrade audio quality in long-form generation. |
| result.append((f"{prefix}.{i}.{split_name}.weight", chunk[i].clone())) | ||
| else: | ||
| for i in range(num_experts): | ||
| result.append((f"{prefix}.{i}.{attr_name}.weight", tensor[i].clone())) |
There was a problem hiding this comment.
_expand_fused_experts() clones each expert slice before passing it into _add_tensor(), but _add_tensor() immediately detaches and copies to CPU. The clone creates an extra full-size device-side copy and can significantly increase peak RAM/VRAM for large MoE weights. Prefer to avoid clone here (e.g., rely on detach().cpu() / contiguous() inside _add_tensor or just ensure the slice is contiguous) so save-time expansion doesn’t double memory usage.
| result.append((f"{prefix}.{i}.{split_name}.weight", chunk[i].clone())) | |
| else: | |
| for i in range(num_experts): | |
| result.append((f"{prefix}.{i}.{attr_name}.weight", tensor[i].clone())) | |
| result.append((f"{prefix}.{i}.{split_name}.weight", chunk[i])) | |
| else: | |
| for i in range(num_experts): | |
| result.append((f"{prefix}.{i}.{attr_name}.weight", tensor[i])) |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: lvliang-intel <[email protected]>
…to-round into lvl/fix_omni_long_audio
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Problem
The talker module was quantized, which should have been kept in float16. This caused severe audio
quality degradation for long form audio generation.
Fix
Exclude the talker part from quantization to maintain float16 precision.
Type of Change
Related Issues
https://huggingface.co/Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound/discussions/1
Checklist Before Submitting