ParaSpeechCaps: Scaling Rich Style-Prompted Text-to-Speech Datasets

This repository contains the official code for Scaling Rich Style-Prompted Text-to-Speech Datasets. We release ParaSpeechCaps (Paralinguistic Speech Captions), a large-scale dataset that annotates speech utterances with rich style captions at ajd12342/paraspeechcaps. We also release Parler-TTS models finetuned on our dataset at ajd12342/parler-tts-mini-v1-paraspeechcaps and ajd12342/parler-tts-mini-v1-paraspeechcaps-only-base.

Try out our models in our interactive demo, listen to examples at our demo website, and read our paper.

LICENSE: This code repository is licensed under the MIT License - see the LICENSE file for details. The dataset and models are licensed under the CC-BY-NC-SA 4.0 license.

1. Overview

ParaSpeechCaps is a large-scale dataset that annotates speech utterances with rich style captions. It supports 59 style tags covering styles like pitch, rhythm, emotion, and more, spanning speaker-level intrinsic style tags and utterance-level situational style tags. It consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled. Our novel pipeline combining off-the-shelf text and speech embedders, classifiers and an audio language model allows us to automatically scale rich tag annotations for such a wide variety of style tags for the first time.

We finetune Parler-TTS on our ParaSpeechCaps dataset to create TTS models that can generate speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.) with a textual style prompt ('A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.').

2. ParaSpeechCaps Dataset

The ParaSpeechCaps dataset is available on the Hugging Face Hub at ajd12342/paraspeechcaps. Please refer to the dataset folder for more details on how to use it.

2.1 Installation

2.1.1 Setup Python environment

This repository has been tested with Conda and Python 3.11. Other Python versions and package managers (venv, uv, etc.) should probably work.

conda create -n paraspeechcaps python=3.11
conda activate paraspeechcaps

2.1.2 Install dependencies

pip install datasets

2.2 Quickstart

from datasets import load_dataset

# Load the entire dataset
dataset = load_dataset("ajd12342/paraspeechcaps")

# Load specific splits of the dataset
train_scaled = load_dataset("ajd12342/paraspeechcaps", split="train_scaled")
train_base = load_dataset("ajd12342/paraspeechcaps", split="train_base")
dev = load_dataset("ajd12342/paraspeechcaps", split="dev")
holdout = load_dataset("ajd12342/paraspeechcaps", split="holdout")

# View a single example
example = train_base[0]
print(example)

TODOS

Release code for our human annotation pipeline
Release code for our automatic annotation pipeline

3. ParaSpeechCaps Models

The ParaSpeechCaps models are available on the Hugging Face Hub at ajd12342/parler-tts-mini-v1-paraspeechcaps (trained on the full dataset) and ajd12342/parler-tts-mini-v1-paraspeechcaps-only-base (trained on the human-annotated subset). Please refer to the model folder for more details.

3.1 Installation

3.1.1 Setup Python environment

This repository has been tested with Conda and Python 3.11. Other Python versions and package managers (venv, uv, etc.) should probably work.

conda create -n paraspeechcaps python=3.11
conda activate paraspeechcaps

3.1.2 Install dependencies

git clone https://github.com/ajd12342/paraspeechcaps.git
cd paraspeechcaps/model/parler-tts
pip install -e .[train]

NOTE: We recommend you follow the installation instructions above because our fork of Parler-TTS adds support for inference-time classifier-free guidance (which consistently improves performance) and new training scripts. However, if you only wish to perform model inference and don't want to use classifier-free guidance, our models are fully compatible with the original Parler-TTS repository as well.

3.2 Quickstart

3.2.1 Inference

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps"
guidance_scale = 1.5

model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
description_tokenizer = AutoTokenizer.from_pretrained(model_name)
transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip()
input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip()

input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device)
input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device)

generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale)

audio_arr = generation.cpu().numpy().squeeze()
sf.write("output.wav", audio_arr, model.config.sampling_rate)

Please refer to the model folder for more inference scripts (including a CLI version, a notebook version, and a gradio demo version).

TODOS

Training and evaluation code
Annotation UIs for evaluation metrics

4. Citation

If you use this repository, the dataset or models, please cite our work as follows:

@misc{diwan2025scalingrichstylepromptedtexttospeech,
      title={Scaling Rich Style-Prompted Text-to-Speech Datasets}, 
      author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi},
      year={2025},
      eprint={2503.04713},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2503.04713}, 
}

5. Acknowledgements

We thank the authors of Parler-TTS for their excellent work on the Parler-TTS model.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dataset		dataset
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParaSpeechCaps: Scaling Rich Style-Prompted Text-to-Speech Datasets

Table of Contents

1. Overview

2. ParaSpeechCaps Dataset

2.1 Installation

2.1.1 Setup Python environment

2.1.2 Install dependencies

2.2 Quickstart

TODOS

3. ParaSpeechCaps Models

3.1 Installation

3.1.1 Setup Python environment

3.1.2 Install dependencies

3.2 Quickstart

3.2.1 Inference

TODOS

4. Citation

5. Acknowledgements

About

Releases

Packages

Languages

License

ajd12342/paraspeechcaps

Folders and files

Latest commit

History

Repository files navigation

ParaSpeechCaps: Scaling Rich Style-Prompted Text-to-Speech Datasets

Table of Contents

1. Overview

2. ParaSpeechCaps Dataset

2.1 Installation

2.1.1 Setup Python environment

2.1.2 Install dependencies

2.2 Quickstart

TODOS

3. ParaSpeechCaps Models

3.1 Installation

3.1.1 Setup Python environment

3.1.2 Install dependencies

3.2 Quickstart

3.2.1 Inference

TODOS

4. Citation

5. Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages