Skip to content

Fix Qwen Omni quantization model issue for long form audio generation#1698

Open
lvliang-intel wants to merge 15 commits intomainfrom
lvl/fix_omni_long_audio
Open

Fix Qwen Omni quantization model issue for long form audio generation#1698
lvliang-intel wants to merge 15 commits intomainfrom
lvl/fix_omni_long_audio

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

Description

Problem

The talker module was quantized, which should have been kept in float16. This caused severe audio
quality degradation for long form audio generation.

Fix

Exclude the talker part from quantization to maintain float16 precision.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

https://huggingface.co/Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound/discussions/1

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Copilot AI review requested due to automatic review settings April 17, 2026 02:00
@lvliang-intel lvliang-intel force-pushed the lvl/fix_omni_long_audio branch from 9414d33 to d964c82 Compare April 17, 2026 02:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes long-form audio quality degradation for Qwen Omni models by excluding the talker module from quantization (keeping it in float16/BF16), while ensuring the model can still be exported correctly (including save-time handling for fused MoE expert tensors).

Changes:

  • Exclude talker blocks from default quantization block discovery for Qwen2.5-Omni and Qwen3-Omni-MoE.
  • Add MoE skip-prefix support so talker.* MoE modules remain fused during quantization, and expand fused expert tensors at save time.
  • Adjust missing-tensor copying/WOQ behavior to preserve talker.* tensors by exact key and prevent unintended quantization.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
auto_round/special_model_handler.py Removes talker from default block lists for Qwen Omni models and documents rationale.
auto_round/modeling/fused_moe/replace_modules.py Introduces MOE_SKIP_PREFIXES and wires it into MoE preparation.
auto_round/modeling/fused_moe/moe_experts_interface.py Adds skip_prefixes to skip unfusing modules under excluded prefixes.
auto_round/compressors/shard_writer.py Expands fused 3D expert params into per-expert 2D tensors at save time for skipped prefixes.
auto_round/utils/missing_tensors.py Ensures talker.* tensors are treated as truly missing when absent and scopes WOQ quantization to block_name_to_quantize.
auto_round/utils/common.py Adds collapse_ignore_layers() helper to reduce per-layer ignore config churn.
auto_round/compressors/base.py Filters predefined ignore layers to quantized blocks and collapses numbered ignore layers into regex patterns.
auto_round/modeling/fused_moe/qwen3_omni.py Drops talker replacement path; keeps only thinker replacement and documents save-time conversion.
test/test_cpu/models/test_omni_model.py Updates assertions to reflect thinker-only default quantization and no talker replacement.
test/test_cpu/utils/test_shard_writer.py Adds tests for save-time expansion of fused experts under skipped prefixes.
test/test_cpu/utils/test_missing_tensors.py Adds tests ensuring missing talker tensors are copied exactly and never WOQ-quantized.
.claude/skills/readme.md Adds contributor skill documentation index.
.claude/skills/review-pr/SKILL.md Adds a structured PR review checklist (project-specific).
.claude/skills/add-vlm-model/SKILL.md Adds workflow documentation for integrating new VLMs.
.claude/skills/add-quantization-datatype/SKILL.md Adds workflow documentation for adding quantization datatypes.
.claude/skills/add-inference-backend/SKILL.md Adds workflow documentation for adding inference backends.
.claude/skills/add-export-format/SKILL.md Adds workflow documentation for adding export formats.
.claude/skills/adapt-new-llm/SKILL.md Adds workflow documentation for adapting new LLM architectures.
.claude/skills/adapt-new-diffusion-model/SKILL.md Adds workflow documentation for adapting new diffusion architectures.

- Optionally: visual encoder blocks, audio encoder layers

talker is excluded by default because quantizing it has been observed to
degrade audio quality in long-form generation .
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring has an extra space before the period in "long-form generation ." which reads as a typo. Please remove the stray space for consistent documentation formatting.

Suggested change
degrade audio quality in long-form generation .
degrade audio quality in long-form generation.

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +149
result.append((f"{prefix}.{i}.{split_name}.weight", chunk[i].clone()))
else:
for i in range(num_experts):
result.append((f"{prefix}.{i}.{attr_name}.weight", tensor[i].clone()))
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_expand_fused_experts() clones each expert slice before passing it into _add_tensor(), but _add_tensor() immediately detaches and copies to CPU. The clone creates an extra full-size device-side copy and can significantly increase peak RAM/VRAM for large MoE weights. Prefer to avoid clone here (e.g., rely on detach().cpu() / contiguous() inside _add_tensor or just ensure the slice is contiguous) so save-time expansion doesn’t double memory usage.

Suggested change
result.append((f"{prefix}.{i}.{split_name}.weight", chunk[i].clone()))
else:
for i in range(num_experts):
result.append((f"{prefix}.{i}.{attr_name}.weight", tensor[i].clone()))
result.append((f"{prefix}.{i}.{split_name}.weight", chunk[i]))
else:
for i in range(num_experts):
result.append((f"{prefix}.{i}.{attr_name}.weight", tensor[i]))

Copilot uses AI. Check for mistakes.
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants