[ICLR'25] SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
This repository is the official implementation of "SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation"
-
Paper (ICLR'25 version): Openreview
-
Paper (NeurIPS'25 audio imagenation workshop version): arxiv
-
Demo page of SoundCTM UNet 16kHz (NeurIPS'25 audio imagenation workshop): Audio Samples
-
Chekpoints of SoundCTM UNet 16kHz (NeurIPS'25 audio imagenation workshop): Hugging Face
-
GitHub repository of SoundCTM-DiT (ICLR'25)
-
Checkpoints of SoundCTM-DiT (ICLR'25)
Contact:
- Koichi SAITO: [email protected]
- [2025/03/30] SoundCTM-DiT is uploaded on SoundCTM-DiT (ICLR'25).
- [2024/12/04] We're plainig to open-source codebase/checkpoints of DiT backbone with full-band text-to-sound setting and downstream tasks, as well.
- [2024/02/10] Our paper, updated version openreview from previous version, is accepted at ICLR'25!!
- Download and put the teacher model's checkpoints and AudioLDM-s-full checkpoints for VAE+Vocoder part to
soundctm/ckpt
- SoundCTM checkpoint on AudioCaps (ema=0.999, 30K training iterations)
For inference, both AudioLDM-s-full (for VAE's decoder+Vocoder) and SoundCTM checkpoints will be used.
Install docker to your own server and build docker container:
docker build -t soundctm .
Then run scripts in the container.
Please see ctm_train.sh
and ctm_train.py
and modify folder path dependeing on your environment.
Then run bash ctm_train.sh
Please see ctm_inference.sh
and ctm_inference.py
and modify folder path dependeing on your environment.
Then run bash ctm_inference.sh
Please see numerical_evaluation.sh
and numerical_evaluation.py
and modify folder path dependeing on your environment.
Then run bash numerical_evaluation.sh
Follow the instructions given in the AudioCaps repository for downloading the data.
Data locations are needed to be spesificied in ctm_train.sh
.
You can also see some examples at data/train.csv
.
The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:
$ wandb login
Or you can also pass an API key as an environment variable WANDB_API_KEY
.
(You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)
$ WANDB_API_KEY="12345x6789y..."
@inproceedings{
saito2025soundctm,
title={Sound{CTM}: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation},
author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=KrK6zXbjfO}
}
Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.