Official Pytorch Implementation of FLOAT; Flow Matching for Audio-driven Talking Portrait Video Generation
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
Taekyung Ki, Dongchan Min, Gyeongsu Chae
Project Page: https://deepbrainai-research.github.io/float/
Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
TL:DR: FLOAT is a flow matching based audio-driven talking portrait video generation method, which can enhance the speech-driven emotional motion.
Result 1 | Result 2 |
---|---|
result1.mp4 |
results2.mp4 |
Result 3 | Result 4 |
---|---|
result3.mp4 |
result4.mp4 |
Our method runs faster than current diffusion-based methods with fewer sampling steps and lower memory cost. For more details, please refer to the paper.
- [2025.02.17] The inference code and checkpoints are released under a Non-commercial License.
- [2024.12.03] Selected as a HuggingFace Daily Papers on December 3, 2024.
- [2024.12.02] The paper is publicly available on ArXiv.
# 1. Create Conda Environment
conda create -n FLOAT python=3.8.5
conda activate FLOAT
# 2. Install torch and requirements
sh environments.sh
# or manual installation
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
- Test on Linux, A100 GPU, and V100 GPU.
-
Download checkpints automatically
sh download_checkpoints.sh
or download checkpoints manually from this google-drive.
-
The checkpoints should be organized as follows:
./checkpints |-- checkpoints_here |-- float.pth # main model |-- wav2vec2-base-960h/ # audio encoder | |-- .gitattributes | |-- config.json | |-- feature_extractor_config.json | |-- model.safetensors | |-- preprocessor_config.json | |-- pytorch_model.bin | |-- README.md | |-- special_tokens_map.json | |-- tf_model.h5 | |-- tokenizer_config.json | '-- vocab.json '-- wav2vec-english-speech-emotion-recognition/ # emotion encoder |-- .gitattributes |-- config.json |-- preprocessor_config.json |-- pytorch_model.bin |-- README.md '-- training_args.bin
- W2V based models could be found in the links: wav2vec2-base-960h and wav2vec-english-speech-emotion-recognition.
- Pre-process;❗ Important ❗ for better quality. Please read this.
- FLOAT is trained on the frontal head pose distributions. Non-frontal image may lead to suboptimal results.
- The performance of taking portrait methods often depends on their training preprocess strategies, e.g., the field-of-view. The inference code includes an automatic face-cropping function, which may involve black padding regions. You can manually disable the cropping process in
generate.py
, however it may lead to suboptimal performance. - If your audio contains heavy background music, please use ClearVoice to extract the vocals for better performance.
-
Generating video 1 (Emotion from Audio)
You can generate a video with an emotion from audio without specifying
--emo
. You can adjust the intensity of the emotion using--e_cfg_scale
(default 1). For more emotion intensive video, try large value from 5 to 10 for--e_cfg_scale
.CUDA_VISIBLE_DEVICES=0 python generate.py --ref_path path/to/reference/image \ --aud_path path/to/audio \ --seed 15 \ --a_cfg_scale 2 \ --e_cfg_scale 1 \ --ckpt_path ./checkpoints/float.pth --no_crop # [optional] skip cropping
-
Generate video 2 (Redirecting Emotion) You can generate a video of other emotion by specifying
--emo
. It supports seven basic emotions: ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']. You can adjust the intensity of the emotion using--e_cfg_scale
(default 1). For more emotion intensive video, try large value from 5 to 10 for--e_cfg_scale
.CUDA_VISIBLE_DEVICES=0 python generate.py\ --ref_path path/to/reference/image \ --aud_path path/to/audio \ --emo 'happy' \ # Seven emotions ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise'] --seed 15 \ --a_cfg_scale 2 \ --e_cfg_scale 1 \ --ckpt_path ./checkpoints/float.pth \ --no_crop # [optional] skip cropping
emotion_redirection_1.mp4
-
Running example and results
CUDA_VISIBLE_DEVICES=0 python generate.py \ --ref_path assets/sam_altman.webp \ --aud_path assets/aud-sample-vs-1.wav \ --seed 15 \ --a_cfg_scale 2 \ --e_cfg_scale 1 \ --ckpt_path ./checkpoints/float.pth
Before Crop After Crop Result sam_altman_result.mp4
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. You may not use this work for commercial purposes and may use it only for research purposes. For any commercial inquiries or collaboration opportunities, please contact [email protected].
This repository is a research demonstration implementation and is provided as a one-time code drop. For any research-related inquiries, please contact the first author Taekyung Ki. This work was done during the first author's South Korean Alternative Military Service at DeepBrain AI. This repository includes only the inference code; the training code will not be released.
@article{ki2024float,
title={FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait},
author={Ki, Taekyung and Min, Dongchan and Chae, Gyeongsu},
journal={arXiv preprint arXiv:2412.01064},
year={2024}
}
- StyleLipSync: Style-based Personalized Lip-sync Video Generation
- StyleTalker: One-shot Style-based Audo-driven Talking Head Video Generation
- Export3D: Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation
The source images and audio are collected from the internet and other baselines, such as SadTalker, EMO, VASA-1, Hallo, LivePortrait, Loopy, and others. We appreciate their valuable contributions to this field. We employ Wav2Vec2.0-based speech emotion recognizer by Rob Field. We appreciate this good work.