GPT-Talker

Introduction

This is an implementation of the following paper. 《Generative Expressive Conversational Speech Synthesis》 (Accepted by MM'2024)

Rui Liu *, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li.

Demo Page

Speech Demo

Dependencies

For details about the operating environment dependency. Please refer to GPT-SoVITS'requirements.txt
Please conda install ffmpeg
Tested environment: Ubuntu=22.04.2, python=3.9.18, torch=2.0.1+cu118

NCSSD

The large-scale conversational speech synthesis dataset we constructed, including those collected over the Internet as well as those recorded by sound recorders, consists of approximately 236 hours and over 776 speakers.

Please refer to NCSSD'repo

Prepare Datasets

Execute the five steps in the ./prepare_datastes directory to build the training data for GPT-Talker.

Train

Conversational VITS

python train_s2.py

The corresponding configuration file is in ./configs/s2.json
Conversational GPT

python train_s1.py

The corresponding configuration file is in ./configs/s1longer.yaml

Fine-tuning

Fine-tunable base models in the ./pretrained_models, from GPT-SoVITS (Single Speech).

Citations

@inproceedings{10.1145/3664647.3681697,
  author = {Liu, Rui and Hu, Yifan and Ren, Yi and Yin, Xiang and Li, Haizhou},
  title = {Generative Expressive Conversational Speech Synthesis},
  year = {2024},
  isbn = {9798400706868},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3664647.3681697},
  doi = {10.1145/3664647.3681697},
  abstract = {Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker.We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user.Furthermore, we propose a large-scale Natural CSS Dataset called NCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours. We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness. The Code, Dataset, and Pre-trained Model are available at: https://github.com/AI-S2-Lab/GPT-Talker.},
  booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia},
  pages = {4187–4196},
  numpages = {10},
  keywords = {conversational speech synthesis (css), expressiveness, gpt, user-agent conversation},
  location = {Melbourne VIC, Australia},
  series = {MM '24}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
AR		AR
__pycache__		__pycache__
configs		configs
datasets		datasets
feature_extractor		feature_extractor
i18n		i18n
module		module
prepare_datasets		prepare_datasets
pretrained_models		pretrained_models
text		text
tools		tools
README.md		README.md
inference.py		inference.py
process_ckpt.py		process_ckpt.py
s1_train.py		s1_train.py
s2_train.py		s2_train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-Talker

Introduction

Demo Page

Dependencies

NCSSD

Prepare Datasets

Train

Fine-tuning

Citations

About

Releases

Packages

Languages

walker-hyf/GPT-Talker

Folders and files

Latest commit

History

Repository files navigation

GPT-Talker

Introduction

Demo Page

Dependencies

NCSSD

Prepare Datasets

Train

Fine-tuning

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages