Skip to content

sony/soundctm

Repository files navigation

[ICLR'25] SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

This repository is the official implementation of "SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation"

Contact:

Info

  • [2025/03/30] SoundCTM-DiT is uploaded on SoundCTM-DiT (ICLR'25).
  • [2024/12/04] We're plainig to open-source codebase/checkpoints of DiT backbone with full-band text-to-sound setting and downstream tasks, as well.
  • [2024/02/10] Our paper, updated version openreview from previous version, is accepted at ICLR'25!!

Checkpoints (Current checkpoint is based on previous version.)

For inference, both AudioLDM-s-full (for VAE's decoder+Vocoder) and SoundCTM checkpoints will be used.

Prerequisites

Install docker to your own server and build docker container:

docker build -t soundctm .

Then run scripts in the container.

Training

Please see ctm_train.sh and ctm_train.py and modify folder path dependeing on your environment.

Then run bash ctm_train.sh

Inference

Please see ctm_inference.sh and ctm_inference.py and modify folder path dependeing on your environment.

Then run bash ctm_inference.sh

Numerical evaluation

Please see numerical_evaluation.sh and numerical_evaluation.py and modify folder path dependeing on your environment.

Then run bash numerical_evaluation.sh

Dataset

Follow the instructions given in the AudioCaps repository for downloading the data. Data locations are needed to be spesificied in ctm_train.sh. You can also see some examples at data/train.csv.

WandB for logging

The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:

$ wandb login

Or you can also pass an API key as an environment variable WANDB_API_KEY. (You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)

$ WANDB_API_KEY="12345x6789y..."

Citation

@inproceedings{
  saito2025soundctm,
  title={Sound{CTM}: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation},
  author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=KrK6zXbjfO}
}

Reference

Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.

https://github.com/sony/ctm

https://github.com/declare-lab/tango

https://github.com/haoheliu/AudioLDM

https://github.com/haoheliu/audioldm_eval

Releases

No releases published

Packages

No packages published