If you find this project helpful, consider buying me a coffee:
Unofficial LatentSync 1.5 implementation for ComfyUI on Windows and WSL 2.0.
This node provides advanced lip-sync capabilities in ComfyUI using ByteDance's LatentSync 1.5 model. It allows you to synchronize video lips with audio input with improved temporal consistency and better performance on a wider range of languages.
- Temporal Layer Improvements: Corrected implementation now provides significantly improved temporal consistency compared to version 1.0
- Better Chinese Language Support: Performance on Chinese videos is now substantially improved through additional training data
- Reduced VRAM Requirements: Now only requires 20GB VRAM (can run on RTX 3090) through various optimizations:
- Gradient checkpointing in U-Net, VAE, SyncNet and VideoMAE
- Native PyTorch FlashAttention-2 implementation (no xFormers dependency)
- More efficient CUDA cache management
- Focused training of temporal and audio cross-attention layers only
- Code Optimizations:
- Removed dependencies on xFormers and Triton
- Upgraded to diffusers 0.32.2
Before installing this node, you must install the following in order:
-
ComfyUI installed and working
-
FFmpeg installed on your system:
- Windows: Download from here and add to system PATH
Only proceed with installation after confirming all prerequisites are installed and working.
- Clone this repository into your ComfyUI custom_nodes directory:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-LatentSyncWrapper.git
cd ComfyUI-LatentSyncWrapper
pip install -r requirements.txt
diffusers>=0.32.2
transformers
huggingface-hub
omegaconf
einops
opencv-python
mediapipe
face-alignment
decord
ffmpeg-python
safetensors
soundfile
On first use, the node will automatically download required model files from HuggingFace:
- LatentSync 1.5 UNet model
- Whisper model for audio processing
- You can also manually download the models from HuggingFace repo: https://huggingface.co/ByteDance/LatentSync-1.5
After successful installation and model download, your checkpoint directory structure should look like this:
./checkpoints/
|-- .cache/
|-- auxiliary/
|-- whisper/
| `-- tiny.pt
|-- config.json
|-- latentsync_unet.pt (~5GB)
|-- stable_syncnet.pt (~1.6GB)
Make sure all these files are present for proper functionality. The main model files are:
latentsync_unet.pt
: The primary LatentSync 1.5 modelstable_syncnet.pt
: The SyncNet model for lip-sync supervisionwhisper/tiny.pt
: The Whisper model for audio processing
- Select an input video file with AceNodes video loader
- Load an audio file using ComfyUI audio loader
- (Optional) Set a seed value for reproducible results
- (Optional) Adjust the lips_expression parameter to control lip movement intensity
- (Optional) Modify the inference_steps parameter to balance quality and speed
- Connect to the LatentSync1.5 node
- Run the workflow
The processed video will be saved in ComfyUI's output directory.
video_path
: Path to input video fileaudio
: Audio input from AceNodes audio loaderseed
: Random seed for reproducible results (default: 1247)lips_expression
: Controls the expressiveness of lip movements (default: 1.5)- Higher values (2.0-3.0): More pronounced lip movements, better for expressive speech
- Lower values (1.0-1.5): Subtler lip movements, better for calm speech
- This parameter affects the model's guidance scale, balancing between natural movement and lip sync accuracy
inference_steps
: Number of denoising steps during inference (default: 20)- Higher values (30-50): Better quality results but slower processing
- Lower values (10-15): Faster processing but potentially lower quality
- The default of 20 usually provides a good balance between quality and speed
- For speeches or presentations where clear lip movements are important, try increasing the lips_expression value to 2.0-2.5
- For casual conversations, the default value of 1.5 usually works well
- If lip movements appear unnatural or exaggerated, try lowering the lips_expression value
- Different values may work better for different languages and speech patterns
- If you need higher quality results and have time to wait, increase inference_steps to 30-50
- For quicker previews or less critical applications, reduce inference_steps to 10-15
- Works best with clear, frontal face videos
- Currently does not support anime/cartoon faces
- Video should be at 25 FPS (will be automatically converted)
- Face should be visible throughout the video
This is an unofficial implementation based on:
- LatentSync 1.5 by ByteDance Research
- ComfyUI
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.