FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

Introduction

This is an implementation of the following paper.

FCTALKER: FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE CONVERSATIONAL SPEECH SYNTHESIS. (Accepted by ISCSLP'2024)

Yifan Hu, Rui Liu *, Guanglai Gao, Haizhou Li.

Dataset

You can download dataset from DailyTalk.

Dependencies

This project uses conda to manage all the dependencies, you should install anaconda if you have not done so.

# Clone the repo
git clone https://github.com/walker-hyf/FCTalker.git
cd $PROJECT_ROOT_DIR

Install dependencies

conda env create -f ./environment.yaml

Activate the installed environment

conda activate FCTalker

Preprocessing

Run

python3 prepare_align.py --dataset DailyTalk

for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DailyTalk/TextGrid/. Alternately, you can run the aligner by yourself. Please note that our pretrained models are not trained with supervised duration modeling (they are trained with learn_alignment: True).

After that, run the preprocessing script by

python3 preprocess.py --dataset DailyTalk

Training

Train your model with

python3 train.py --dataset DailyTalk

Useful options:

Currently only single GPU training is supported.

Inference

Only the batch inference is supported as the generation of a turn may need contextual history of the conversation. Try

python3 synthesize.py --source preprocessed_data/DailyTalk/val_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk

to synthesize all utterances in preprocessed_data/DailyTalk/val_*.txt.

About Fine-Grained Encoder

The Fine-Grained Encoder in this source code directly uses the pre-trained model of TOD_BERT. You can easily load the pretrained model using huggingface Transformers library using the AutoModel function.

import torch
from transformers import *
tokenizer = AutoTokenizer.from_pretrained("TODBERT/TOD-BERT-JNT-V1")
tod_bert = AutoModel.from_pretrained("TODBERT/TOD-BERT-JNT-V1")

The source code for the Fine-Grained Encoder can be viewed in modules.py.

Citation

@misc{https://doi.org/10.48550/arxiv.2210.15360,
  doi = {10.48550/ARXIV.2210.15360},
  url = {https://arxiv.org/abs/2210.15360},
  author = {Hu, Yifan and Liu, Rui and Gao, Guanglai and Li, Haizhou},
  keywords = {Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
  title = {FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Author

E-mail：[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
audio		audio
config/DailyTalk		config/DailyTalk
deepspeaker		deepspeaker
hifigan		hifigan
img		img
lexicon		lexicon
model		model
preprocessor		preprocessor
text		text
utils		utils
README.md		README.md
dataset.py		dataset.py
environment.yaml		environment.yaml
evaluate.py		evaluate.py
prepare_align.py		prepare_align.py
preprocess.py		preprocess.py
synthesize.py		synthesize.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

Introduction

Dataset

Dependencies

Preprocessing

Training

Inference

About Fine-Grained Encoder

Citation

Author

References

About

Releases

Packages

Languages

AI-S2-Lab/FCTalker

Folders and files

Latest commit

History

Repository files navigation

FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

Introduction

Dataset

Dependencies

Preprocessing

Training

Inference

About Fine-Grained Encoder

Citation

Author

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages