-
Add temporal layer: our previous claim that the temporal layer severely impairs lip-sync accuracy was incorrect; the issue was actually caused by a bug in the code implementation. We have corrected our paper and updated the code. After incorporating the temporal layer, LatentSync 1.5 demonstrates significantly improved temporal consistency compared to version 1.0.
-
Improves performance on Chinese videos: many issues reported poor performance on Chinese videos, so we added Chinese data to the training of the new model version.
-
Reduce the VRAM requirement of the stage2 training to 20 GB through the following optimizations:
- Implement gradient checkpointing in U-Net, VAE, SyncNet and VideoMAE
- Replace xFormers with PyTorch's native implementation of FlashAttention-2.
- Clear the CUDA cache after loading checkpoints.
- The stage2 training only requires training the temporal layer and audio cross-attention layer, which significantly reduces VRAM requirement compared to the previous full-parameter fine-tuning.
Now you can train LatentSync on a single RTX 3090! Start the stage2 training with
configs/unet/stage2_efficient.yaml. -
Other code optimizations:
- Remove the dependency on xFormers and Triton.
- Upgrade the diffusers version to
0.32.2.
| Original video | Lip-synced video |
demo1_input.mp4 |
demo1_output.mp4 |
demo2_input.mp4 |
demo2_output.mp4 |
demo3_input.mp4 |
demo3_output.mp4 |
demo4_input.mp4 |
demo4_output.mp4 |