FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Official Pytorch Implementation of FLOAT; Flow Matching for Audio-driven Talking Portrait Video Generation

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
Taekyung Ki, Dongchan Min, Gyeongsu Chae

Project Page: https://deepbrainai-research.github.io/float/

Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

TL:DR: FLOAT is a flow matching based audio-driven talking portrait video generation method, which can enhance the speech-driven emotional motion.

Generation Results

Result 1	Result 2
result1.mp4	results2.mp4

Result 3	Result 4
result3.mp4	result4.mp4

Our method runs faster than current diffusion-based methods with fewer sampling steps and lower memory cost. For more details, please refer to the paper.

Updates

[2025.02.17] The inference code and checkpoints are released under a Non-commercial License.
[2024.12.03] Selected as a HuggingFace Daily Papers on December 3, 2024.
[2024.12.02] The paper is publicly available on ArXiv.

Getting Started

Requirements

# 1. Create Conda Environment
conda create -n FLOAT python=3.8.5
conda activate FLOAT

# 2. Install torch and requirements
sh environments.sh

# or manual installation
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Test on Linux, A100 GPU, and V100 GPU.

Preparing checkpoints

Download checkpints automatically
```
sh download_checkpoints.sh
```
or download checkpoints manually from this google-drive.

The checkpoints should be organized as follows:

./checkpints
|-- checkpoints_here
|-- float.pth                                       # main model
|-- wav2vec2-base-960h/                             # audio encoder
|   |-- .gitattributes
|   |-- config.json
|   |-- feature_extractor_config.json
|   |-- model.safetensors
|   |-- preprocessor_config.json
|   |-- pytorch_model.bin
|   |-- README.md
|   |-- special_tokens_map.json
|   |-- tf_model.h5
|   |-- tokenizer_config.json
|   '-- vocab.json
'-- wav2vec-english-speech-emotion-recognition/     # emotion encoder
    |-- .gitattributes
    |-- config.json
    |-- preprocessor_config.json
    |-- pytorch_model.bin
    |-- README.md
    '-- training_args.bin

W2V based models could be found in the links: wav2vec2-base-960h and wav2vec-english-speech-emotion-recognition.

Generating Talking Portait Video from Single Image and Audio

Pre-process;❗ Important ❗ for better quality. Please read this.

FLOAT is trained on the frontal head pose distributions. Non-frontal image may lead to suboptimal results.
The performance of taking portrait methods often depends on their training preprocess strategies, e.g., the field-of-view. The inference code includes an automatic face-cropping function, which may involve black padding regions. You can manually disable the cropping process in generate.py, however it may lead to suboptimal performance.
If your audio contains heavy background music, please use ClearVoice to extract the vocals for better performance.

Generating video 1 (Emotion from Audio)

You can generate a video with an emotion from audio without specifying --emo. You can adjust the intensity of the emotion using --e_cfg_scale (default 1). For more emotion intensive video, try large value from 5 to 10 for --e_cfg_scale.

CUDA_VISIBLE_DEVICES=0 python generate.py
    --ref_path path/to/reference/image \
    --aud_path path/to/audio \
    --seed 15 \
    --a_cfg_scale 2 \
    --e_cfg_scale 1 \
    --ckpt_path ./checkpoints/float.pth
    --no_crop                    # [optional] skip cropping

Generate video 2 (Redirecting Emotion) You can generate a video of other emotion by specifying --emo. It supports seven basic emotions: ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']. You can adjust the intensity of the emotion using --e_cfg_scale (default 1). For more emotion intensive video, try large value from 5 to 10 for --e_cfg_scale.

CUDA_VISIBLE_DEVICES=0 python generate.py\
    --ref_path path/to/reference/image \ 
    --aud_path path/to/audio \
    --emo 'happy' \             # Seven emotions ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise'] 
    --seed  15 \ 
    --a_cfg_scale 2 \
    --e_cfg_scale 1 \
    --ckpt_path ./checkpoints/float.pth \
    --no_crop                   # [optional] skip cropping

emotion_redirection_1.mp4

Running example and results

CUDA_VISIBLE_DEVICES=0 python generate.py \
    --ref_path assets/sam_altman.webp \ 
    --aud_path assets/aud-sample-vs-1.wav \
    --seed  15 \ 
    --a_cfg_scale 2 \
    --e_cfg_scale 1 \
    --ckpt_path ./checkpoints/float.pth

Before Crop	After Crop	Result
		sam_altman_result.mp4

❗License❗

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. You may not use this work for commercial purposes and may use it only for research purposes. For any commercial inquiries or collaboration opportunities, please contact daniel@deepbrain.io.

Development

This repository is a research demonstration implementation and is provided as a one-time code drop. For any research-related inquiries, please contact the first author Taekyung Ki. This work was done during the first author's South Korean Alternative Military Service at DeepBrain AI. This repository includes only the inference code; the training code will not be released.

Citation

@article{ki2024float,
  title={FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait},
  author={Ki, Taekyung and Min, Dongchan and Chae, Gyeongsu},
  journal={arXiv preprint arXiv:2412.01064},
  year={2024}
}

Related Works

StyleLipSync: Style-based Personalized Lip-sync Video Generation
StyleTalker: One-shot Style-based Audo-driven Talking Head Video Generation
Export3D: Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

Acknowledgements

The source images and audio are collected from the internet and other baselines, such as SadTalker, EMO, VASA-1, Hallo, LivePortrait, Loopy, and others. We appreciate their valuable contributions to this field. We employ Wav2Vec2.0-based speech emotion recognizer by Rob Field. We appreciate this good work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Generation Results

Updates

Getting Started

Requirements

Preparing checkpoints

Generating Talking Portait Video from Single Image and Audio

❗License❗

Development

Citation

Related Works

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Generation Results

Updates

Getting Started

Requirements

Preparing checkpoints

Generating Talking Portait Video from Single Image and Audio

❗License❗

Development

Citation

Related Works

Acknowledgements