Skip to content

Latest commit

 

History

History
62 lines (52 loc) · 2.79 KB

File metadata and controls

62 lines (52 loc) · 2.79 KB

LatentSync 1.5

What's new in LatentSync 1.5?

  1. Add temporal layer: our previous claim that the temporal layer severely impairs lip-sync accuracy was incorrect; the issue was actually caused by a bug in the code implementation. We have corrected our paper and updated the code. After incorporating the temporal layer, LatentSync 1.5 demonstrates significantly improved temporal consistency compared to version 1.0.

  2. Improves performance on Chinese videos: many issues reported poor performance on Chinese videos, so we added Chinese data to the training of the new model version.

  3. Reduce the VRAM requirement of the stage2 training to 20 GB through the following optimizations:

    1. Implement gradient checkpointing in U-Net, VAE, SyncNet and VideoMAE
    2. Replace xFormers with PyTorch's native implementation of FlashAttention-2.
    3. Clear the CUDA cache after loading checkpoints.
    4. The stage2 training only requires training the temporal layer and audio cross-attention layer, which significantly reduces VRAM requirement compared to the previous full-parameter fine-tuning.

    Now you can train LatentSync on a single RTX 3090! Start the stage2 training with configs/unet/stage2_efficient.yaml.

  4. Other code optimizations:

    1. Remove the dependency on xFormers and Triton.
    2. Upgrade the diffusers version to 0.32.2.

LatentSync 1.5 Demo

Original video Lip-synced video
demo1_input.mp4
demo1_output.mp4
demo2_input.mp4
demo2_output.mp4
demo3_input.mp4
demo3_output.mp4
demo4_input.mp4
demo4_output.mp4