LatentSync 1.5

What's new in LatentSync 1.5?

Add temporal layer: our previous claim that the temporal layer severely impairs lip-sync accuracy was incorrect; the issue was actually caused by a bug in the code implementation. We have corrected our paper and updated the code. After incorporating the temporal layer, LatentSync 1.5 demonstrates significantly improved temporal consistency compared to version 1.0.
Improves performance on Chinese videos: many issues reported poor performance on Chinese videos, so we added Chinese data to the training of the new model version.
Reduce the VRAM requirement of the stage2 training to 20 GB through the following optimizations:
1. Implement gradient checkpointing in U-Net, VAE, SyncNet and VideoMAE
2. Replace xFormers with PyTorch's native implementation of FlashAttention-2.
3. Clear the CUDA cache after loading checkpoints.
4. The stage2 training only requires training the temporal layer and audio cross-attention layer, which significantly reduces VRAM requirement compared to the previous full-parameter fine-tuning.
Now you can train LatentSync on a single RTX 3090! Start the stage2 training with configs/unet/stage2_efficient.yaml.
Other code optimizations:
1. Remove the dependency on xFormers and Triton.
2. Upgrade the diffusers version to 0.32.2.