This paper describes the ReprGesture entry to the Generation and Evaluation of Non-verbal Behaviour for Embodied Agents (GENEA) challenge 2022. The GENEA challenge provides the processed datasets and performs crowdsourced evaluations to compare the performance of different gesture generation systems. In this paper, we explore an automatic gesture generation system based on multimodal representation learning. We use WavLM features for audio, FastText features for text and position and rotation matrix features for gesture. Each modality is projected to two distinct subspaces: modality-invariant and modality-specific. To learn inter modality-invariant commonalities and capture the characters of modality-specific representations, gradient reversal layer based adversarial classifier and modality reconstruction decoders are used during training. The gesture decoder generates proper gestures using all representations and features related to the rhythm in the audio.
Ground Truth (GT) / ReprGestrue / without Wavlm / without GAN loss / without domain loss / without Repr
Ablation_study.4.mp4
Ground Truth (GT) / ReprGestrue / with diff loss / with text emotion / with diff loss and text emotion
Additional_experiments.8.mp4
However, the results of these experiments did not turn out very well, so they were not mentioned in the final submission of the system or the paper.
More time to talk than to listen.
Ground Truth (GT) / ReprGestrue / without Wavlm / without GAN loss / without reconstruction loss / without domain loss / without Repr
Ablation_study.new2.mp4
The distribution of speaker IDs:
We noted that the data in the training, validation and test sets were extremely unbalanced, so we only used the data from the speaker with identity "1" for training.
Due to the poor quality of hand motion-capture, we only used 18 joints corresponding to the upper body without hands or fingers.
Our environment is similar to Trimodal.
Download the pre-trained model WavLM Large from here.
Download pre-trained model from here.
Then cd Tri/scripts
and modify save path in synthesize.py
, run
python synthesize.py --ckpt_path <"..your path/multimodal_context_checkpoint_080.bin"> --transcript_path "<..your path/GENEA/genea_challenge_2022/dataset/v1_18/val/tsv/val_2022_v1_000.tsv>" --wav_path "<..your path/GENEA/genea_challenge_2022/dataset/v1_18/val/wav/val_2022_v1_000.wav>"
You will get the converted gestrue .bvh
.
Our data from GENEA challenge 2022 contains folders for wav, tsv and bvh, the original data is from Talking With Hands 16.2M. You can refer to the challenge paper and download from here.
Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). ACM.
Then cd My/scripts
and modify path in twh_dataset_to_lmdb.py
, run
python twh_dataset_to_lmdb.py <..your path/GENEA/genea_challenge_2022/dataset/v1_18_1/>
Then cd Tri/scripts
and modify path in train.py
and modify <path = "...your path/wavlm_cache/WavLM-Large.pt">
in multimodal_context_net.py
, run
python train.py --config=<..your path/Tri/config/multimodal_context.yml>
For inference, run
python synthesize.py --ckpt_path <"..your path/your saved model.bin"> --transcript_path "...your path/dataset/v1_18/val/tsv/val_2022_v1_000.tsv" --wav_path "...your path/dataset/v1_18/val/wav/val_2022_v1_000.wav"
Our code is adapted from here.
You may refer to ./visualizations/genea_numerical_evaluations_1
.
An AutoEncoder model we trained on data from GENEA can be downloaded from here and used to calculate the FGD.
For visualization of output, you can use ./visualizations/simple_skeleton_visualization/:
Sheet1_generated_with_audio.mp4
Or using ./visualizations/genea_visualizer/celery-queue/blender_render.py based on here:
Sheet1_generated_0000-0220_with_audio.mp4
"... your path\Blender Foundation\Blender 2.93\blender.exe" -b --python blender_render.py -- -i "... your path\multimodal_context_40_generated.bvh" -a "... your path\audio.wav" -v -o "... your path\video" -m "upper_body" --duration 40 -r cw
This work is supported by Shenzhen Science and Technology Innovation Committee (WDZC20200818121348001), National Natural Science Foundation of China (62076144) and Shenzhen Key Laboratory of next generation interactive media innovative technology (ZDSYS20210623092001004).
Our work mainly inspired by:
(1) Gesture Generation from Trimodal Context
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. 39, 6, Article 222 (December 2020), 16 pages. https://doi.org/10.1145/3414685.3417838
(2) MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia (MM '20). Association for Computing Machinery, New York, NY, USA, 1122–1131. https://doi.org/10.1145/3394171.3413678
If you find our work useful in your research, please consider citing:
@inproceedings{yang2022genea,
author={Sicheng Yang, Zhiyong Wu, Minglei Li, Mengchen Zhao, Jiuxin Lin, Liyang Chen, Weihong Bao},
title={The ReprGesture entry to the GENEA Challenge 2022},
booktitle = {Proceedings of the ACM International Conference on Multimodal Interaction},
publisher = {ACM},
series = {ICMI '22},
year={2022}
}