Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, Bolei Zhou.
Generating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement in a holistic manner, where poses of all joints are generated simultaneously. Such a straightforward pipeline fails to generate fine-grained co-speech gestures. One observation is that the hierarchical semantics in speech and the hierarchical structures of human gestures can be naturally described into multiple granularities and associated together. To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. In HA2G, a Hierarchical Audio Learner extracts audio representations across semantic granularities. A Hierarchical Pose Inferer subsequently renders the entire human pose gradually in a hierarchical manner. To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations. Extensive experiments and human evaluation demonstrate that the proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
- [2023/01/31] An evaluation bug on the BC metric is reported (L424 of the scripts/train.py file and L539 of the scripts/train_expressive.py file). Originally, the mean pose vectors are not added back to recover the correct skeleton in the main paper's reported BC evaluation results. We will update the quantitative results in the arxiv updates.
This project is developed and tested on Ubuntu 18.04, Python 3.6, PyTorch 1.10.2 and CUDA version 11.3. Since the repository is developed based on Gesture Generation from Trimodal Context of Yoon et al., the environment requirements, installation and dataset preparation process generally follow theirs.
-
Clone this repository:
git clone https://github.com/alvinliu0/HA2G.git
-
Install required python packages:
pip install -r requirements.txt
-
Install Gentle for audio-transcript alignment. Download the source code from Gentle github and install the library via
install.sh
. And then, you can import gentle library by specifying the path to the library atscript/synthesize.py
line 27. -
Download pretrained fasttext model from here and put
crawl-300d-2M-subword.bin
andcrawl-300d-2M-subword.vec
atdata/fasttext/
. -
Download the pretrained co-speech gesture models, which include the following:
-
TED Expressive Dataset Auto-Encoder, which is used to evaluate the FGD metric;
-
TED Gesture Dataset Pretrained Model, which is the HA2G model trained on the TED Gesture Dataset;
-
TED Expressive Dataset Pretrained Model, which is the HA2G model trained on the TED Expressive Dataset.
Download the preprocessed TED Expressive dataset (16GB) and extract the ZIP file into data/ted_expressive_dataset
.
You can find out the details of the TED Expressive dataset from here. The dataset pre-processing are extended based on youtube-gesture-dataset. Our dataset extends new features of 3D upper body keypoints annotations including fine-grained fingers.
Our codebase also supports the training and inference of TED Gesture dataset of Yoon et al. Download the preprocessed TED Gesture dataset (16GB) and extract the ZIP file into data/ted_gesture_dataset
. Please refer to here for the details of TED Gesture dataset.
We also provide the pretrained models and training logs for better reproducibility and further research in this community. Note that since this work was done during internship at SenseTime Research, only the original training logs are provided while the original pretrained models are unavailble. Instead, we provide the newly pretrained models as well as the corresponding training logs. The new models outperform the evaluation results reported in the paper.
Pretrained models contain:
-
TED Gesture Dataset Pretrained Model, which is the HA2G model trained on the TED Gesture Dataset;
-
TED Expressive Dataset Pretrained Model, which is the HA2G model trained on the TED Expressive Dataset.
Training logs contain:
-
ted_gesture_original.log, which is the original HA2G training log on TED Gesture dataset;
-
ted_gesture_new.log, which is the newly trained HA2G log on TED Gesture dataset;
-
ted_expressive_original.log, which is the original HA2G training log on TED Expressive dataset;
-
ted_expressive_new.log, which is the newly trained HA2G log on TED Expressive dataset.
Generate gestures from a clip in the TED Gesture testset using baseline models:
python scripts/synthesize.py from_db_clip [trained model path] [number of samples to generate]
You would run like this:
python scripts/synthesize.py from_db_clip output/train_multimodal_context/multimodal_context_checkpoint_best.bin 10
Generate gestures from a clip in the TED Gesture testset using HA2G models:
python scripts/synthesize_hierarchy.py from_db_clip [trained model path] [number of samples to generate]
You would run like this:
python scripts/synthesize_hierarchy.py from_db_clip TED-Gesture-output/train_hierarchy/ted_gesture_hierarchy_checkpoint_best.bin 10
Generate gestures from a clip in the TED Expressive testset using HA2G models:
python scripts/synthesize_expressive_hierarchy.py from_db_clip [trained model path] [number of samples to generate]
You would run like this:
python scripts/synthesize_expressive_hierarchy.py from_db_clip TED-Expressive-output/train_hierarchy/ted_expressive_hierarchy_checkpoint_best.bin 10
The first run takes several minutes to cache the datset. After that, it runs quickly.
You can find synthesized results in output/generation_results
. There are MP4, WAV, and PKL files for visualized output, audio, and pickled raw results, respectively. Speaker IDs are randomly selected for each generation. The following shows sample MP4 files.
Train the proposed HA2G model on TED Gesture Dataset:
python scripts/train.py --config=config/hierarchy.yml
And the baseline models on TED Gesture Dataset:
python scripts/train.py --config=config/seq2seq.yml
python scripts/train.py --config=config/speech2gesture.yml
python scripts/train.py --config=config/joint_embed.yml
python scripts/train.py --config=config/multimodal_context.yml
For the TED Expressive Dataset, you can train the HA2G model by:
python scripts/train_expressive.py --config=config_expressive/hierarchy.yml
And the baseline models on TED Expressive Dataset:
python scripts/train.py --config=config_expressive/seq2seq.yml
python scripts/train.py --config=config_expressive/speech2gesture.yml
python scripts/train.py --config=config_expressive/joint_embed.yml
python scripts/train.py --config=config_expressive/multimodal_context.yml
Caching TED training set (lmdb_train
) takes tens of minutes at your first run. Model checkpoints and sample results will be saved in subdirectories of ./TED-Gesture-output
and ./TED-Expressive-output
folder.
Note on reproducibility:
unfortunately, we didn't fix a random seed, so you are not able to reproduce the same FGD in the paper. But, several runs with different random seeds mostly fell in a similar FGD range.
You can train the autoencoder used for FGD. However, please note that FGD will change as you train the autoencoder anew. We recommend you to stick to the checkpoint that we shared.
-
For the TED Gesture Dataset, we use the pretrained Auto-Encoder model provided by Yoon et al. for better reproducibility the ckpt in the train_h36m_gesture_autoencoder folder.
-
For the TED Expressive Dataset, the pretrained Auto-Encoder model is provided here. If you want to train the autoencoder anew, you could run the following training script:
python scripts/train_feature_extractor_expressive.py --config=config_expressive/gesture_autoencoder.yml
The model checkpoints will be saved in ./TED-Expressive-output/AE-cos1e-3
.
We follow the GPL-3.0 license, please see details here.
If you find our work useful, please kindly cite as:
@inproceedings{liu2022learning,
title={Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation},
author={Liu, Xian and Wu, Qianyi and Zhou, Hang and Xu, Yinghao and Qian, Rui and Lin, Xinyi and Zhou, Xiaowei and Wu, Wayne and Dai, Bo and Zhou, Bolei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10462--10472},
year={2022}
}
If you are interested in Audio-Driven Co-Speech Gesture Generation, we would also like to recommend you to check out our other related works:
-
Audio-Driven Co-Speech Gesture Video Generation, ANGIE.
-
Taming Diffusion Model for Co-Speech Gesture, DiffGesture.
- The codebase is developed based on Gesture Generation from Trimodal Context of Yoon et al.