Skip to content

Official Pytorch Implementation of FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait.

License

Notifications You must be signed in to change notification settings

deepbrainai-research/float

Repository files navigation

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Official Pytorch Implementation of FLOAT; Flow Matching for Audio-driven Talking Portrait Video Generation

preview

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
Taekyung Ki, Dongchan Min, Gyeongsu Chae

Project Page: https://deepbrainai-research.github.io/float/

Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

TL:DR: FLOAT is a flow matching based audio-driven talking portrait video generation method, which can enhance the speech-driven emotional motion.

Generation Results

Result 1 Result 2
result1.mp4
results2.mp4
Result 3 Result 4
result3.mp4
result4.mp4

Our method runs faster than current diffusion-based methods with fewer sampling steps and lower memory cost. For more details, please refer to the paper.

Updates

Getting Started

Requirements

# 1. Create Conda Environment
conda create -n FLOAT python=3.8.5
conda activate FLOAT

# 2. Install torch and requirements
sh environments.sh

# or manual installation
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
  • Test on Linux, A100 GPU, and V100 GPU.

Preparing checkpoints

  1. Download checkpints automatically

    sh download_checkpoints.sh

    or download checkpoints manually from this google-drive.

  2. The checkpoints should be organized as follows:

    ./checkpints
    |-- checkpoints_here
    |-- float.pth                                       # main model
    |-- wav2vec2-base-960h/                             # audio encoder
    |   |-- .gitattributes
    |   |-- config.json
    |   |-- feature_extractor_config.json
    |   |-- model.safetensors
    |   |-- preprocessor_config.json
    |   |-- pytorch_model.bin
    |   |-- README.md
    |   |-- special_tokens_map.json
    |   |-- tf_model.h5
    |   |-- tokenizer_config.json
    |   '-- vocab.json
    '-- wav2vec-english-speech-emotion-recognition/     # emotion encoder
        |-- .gitattributes
        |-- config.json
        |-- preprocessor_config.json
        |-- pytorch_model.bin
        |-- README.md
        '-- training_args.bin

Generating Talking Portait Video from Single Image and Audio

  1. Pre-process;❗ Important ❗ for better quality. Please read this.
  • FLOAT is trained on the frontal head pose distributions. Non-frontal image may lead to suboptimal results.
  • The performance of taking portrait methods often depends on their training preprocess strategies, e.g., the field-of-view. The inference code includes an automatic face-cropping function, which may involve black padding regions. You can manually disable the cropping process in generate.py, however it may lead to suboptimal performance.
  • If your audio contains heavy background music, please use ClearVoice to extract the vocals for better performance.
  1. Generating video 1 (Emotion from Audio)

    You can generate a video with an emotion from audio without specifying --emo. You can adjust the intensity of the emotion using --e_cfg_scale (default 1). For more emotion intensive video, try large value from 5 to 10 for --e_cfg_scale.

    CUDA_VISIBLE_DEVICES=0 python generate.py
        --ref_path path/to/reference/image \
        --aud_path path/to/audio \
        --seed 15 \
        --a_cfg_scale 2 \
        --e_cfg_scale 1 \
        --ckpt_path ./checkpoints/float.pth
        --no_crop                    # [optional] skip cropping
  2. Generate video 2 (Redirecting Emotion) You can generate a video of other emotion by specifying --emo. It supports seven basic emotions: ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']. You can adjust the intensity of the emotion using --e_cfg_scale (default 1). For more emotion intensive video, try large value from 5 to 10 for --e_cfg_scale.

    CUDA_VISIBLE_DEVICES=0 python generate.py\
        --ref_path path/to/reference/image \ 
        --aud_path path/to/audio \
        --emo 'happy' \             # Seven emotions ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise'] 
        --seed  15 \ 
        --a_cfg_scale 2 \
        --e_cfg_scale 1 \
        --ckpt_path ./checkpoints/float.pth \
        --no_crop                   # [optional] skip cropping

    emotion_redirection_1.mp4

  3. Running example and results

    CUDA_VISIBLE_DEVICES=0 python generate.py \
        --ref_path assets/sam_altman.webp \ 
        --aud_path assets/aud-sample-vs-1.wav \
        --seed  15 \ 
        --a_cfg_scale 2 \
        --e_cfg_scale 1 \
        --ckpt_path ./checkpoints/float.pth
    Before Crop After Crop Result
    sam_altman_result.mp4

❗License❗

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. You may not use this work for commercial purposes and may use it only for research purposes. For any commercial inquiries or collaboration opportunities, please contact [email protected].

Development

This repository is a research demonstration implementation and is provided as a one-time code drop. For any research-related inquiries, please contact the first author Taekyung Ki. This work was done during the first author's South Korean Alternative Military Service at DeepBrain AI. This repository includes only the inference code; the training code will not be released.

Citation

@article{ki2024float,
  title={FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait},
  author={Ki, Taekyung and Min, Dongchan and Chae, Gyeongsu},
  journal={arXiv preprint arXiv:2412.01064},
  year={2024}
}

Related Works

Acknowledgements

The source images and audio are collected from the internet and other baselines, such as SadTalker, EMO, VASA-1, Hallo, LivePortrait, Loopy, and others. We appreciate their valuable contributions to this field. We employ Wav2Vec2.0-based speech emotion recognizer by Rob Field. We appreciate this good work.

About

Official Pytorch Implementation of FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published