T3M: Text Guided 3D Human Motion Synthesis from Speech[NAACL 2024]

This repository contains the PyTorch implementation of the "T3M: Text Guided 3D Human Motion Synthesis from Speech" project. The goal of this project is to synthesize realistic 3D human motion based on both speech and text inputs.

Environment Setup

To get started with this project, you will need to set up a Python environment using miniconda3. Follow the steps below to create the required environment:

Prerequisites

Python 3.10
Miniconda3

Creating the Environment

Install Miniconda3 if you haven't done so already.

Create a new conda environment named t3m with Python 3.10 and install the dependencies:

conda create -n t3m python=3.10
conda activate t3m
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Usage

To use this project, follow these steps:

Clone the repository:

git clone https://github.com/Gloria2tt/T3M.git
cd T3M

Download the dataset and pre-trained weight:

We provide an enhanced approach compared to the original papers by utilizing a more advanced video-text alignment model, InternVid, to extract video embeddings from the SHOW dataset.
- Download the SHOW dataset: Download the Talkshow dataset from this link and unzip the folder.
- In addition to audio and pose data, the original video is also required for training. Download the original video following instructions from the SHOW repository.
- Extract the audio-aligned segments from the video based on the file names, and use the video encoder from InternVid to extract the video embeddings. We recommend performing this step on A100 or H100 GPUs.
- Following the instructions in the TalkSHOW repository, download the pre-trained face model and VQ-VAE model from this, as our paper modifies only the body and hand generation parts.
- I've noticed that the SHOW repository no longer contains the original videos. Therefore, we've established a new repository to facilitate the download of the preprocessed dataset, which you can find here

Train

To train the model, you need to modify the body_pixel.json configuration file to match your environment:

If this is your first time running the code, set the dataset_load_mode option from pickle to json. Adjust the vq_path option to match the location of your folder.
Adjust the vq_path option to match the location of your folder.

Finally, use the following command to start training:

sh train_body_pixel.sh

Visualize

To visualize the results after training, ensure you have ffmpeg installed:
```
 sudo apt-get install ffmpeg
```
Run the visualization script:
```
 bash visualise.sh
```
Alternatively, you can visualize a specific audio file:
```
 python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file your/voice/file
```
makesure you have changed the model path correctly.

Citation

If you find our work interesting, please consider citing:

@inproceedings{peng2024t3m,
title={T3M: Text Guided 3D Human Motion Synthesis from Speech},
author={Peng, Wenshuo and Zhang, Kaipeng and Zhang, Sai Qian},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2024},
pages={1168--1177},
year={2024}
}

Acknowledgement

Our code is built upon TalkSHOW and SHOW. We specifically thanks Hongwei Yi for sharing their codebase.

Contact

Any questions just send me an email([email protected]) directly.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
data_utils		data_utils
evaluation		evaluation
losses		losses
nets		nets
scripts		scripts
trainer		trainer
visualise		visualise
voca		voca
README.md		README.md
requirements.txt		requirements.txt
test_body.sh		test_body.sh
test_face.sh		test_face.sh
train_body_pixel.sh		train_body_pixel.sh
train_body_vq.sh		train_body_vq.sh
train_face.sh		train_face.sh
visualise.sh		visualise.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

T3M: Text Guided 3D Human Motion Synthesis from Speech[NAACL 2024]

Environment Setup

Prerequisites

Creating the Environment

Usage

Train

Visualize

Citation

Acknowledgement

Contact

About

Releases

Packages

Contributors 2

Languages

Gloria2tt/T3M

Folders and files

Latest commit

History

Repository files navigation

T3M: Text Guided 3D Human Motion Synthesis from Speech[NAACL 2024]

Environment Setup

Prerequisites

Creating the Environment

Usage

Train

Visualize

Citation

Acknowledgement

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages