Sampling-frequency-independent (SFI) convolutional layer and its application to audio source separation

Paper | Demo

This repository is an official implementation of our paper entitled "Sampling-frequency-independent convolutional layer and its application to audio source separation".

The left and right panels of the above figure are the proposed SFI convolutional layers using time- and frequency-domain filter design method, respectively.

Abstract

Audio source separation is often used for the preprocessing of various tasks, and one of its ultimate goals is to construct a single versatile preprocessor that can handle every variety of audio signal. One of the most important varieties of the discrete-time audio signal is sampling frequency. Since it is usually task-specific, the versatile preprocessor must handle all the sampling frequencies required by the possible downstream tasks. However, conventional models based on deep neural networks (DNNs) are not designed for handling a variety of sampling frequencies. Thus, for unseen sampling frequencies, they may not work appropriately. In this paper, we propose sampling-frequency-independent (SFI) convolutional layers capable of handling various sampling frequencies. The core idea of the proposed layers comes from our finding that a convolutional layer can be viewed as a collection of digital filters and inherently depends on sampling frequency. To overcome this dependency, we propose an SFI structure that features analog filters and generates weights of a convolutional layer from the analog filters. By utilizing time- and frequency-domain analog-to-digital filter conversion techniques, we can adapt the convolutional layer for various sampling frequencies. As an example application, we construct an SFI version of a conventional source separation network. Through music source separation experiments, we show that the proposed layers enable separation networks to consistently work well for unseen sampling frequencies in objective and perceptual separation qualities. We also demonstrate that the proposed method outperforms a conventional method based on signal resampling when the sampling frequencies of input signals are significantly lower than the trained sampling frequency.

Getting started

Clone this repository

git clone git@github.com:TomohikoNakamura/sfi_convtasnet.git --recursive

Setup

You can setup with conda or docker.

conda

Install miniconda.
Create an enviroment for this repository.
```
conda env create -f environment-cuda.yml
```
Go into the created environment.
```
conda activate sfi
```

docker

Rewrite volumes in docker-compose.yaml in accordance with your environment.
```
volumes: 
  - /path/to/this/directory:/opt/src
```
Construct docker container using docker-compose
```
docker-compose build
docker-compose up -d
```

Go into the created container

docker-compose exec sfi_convtasnet bash

Separating signals

Execute separate_audiofile.py.

Use the proposed SFI mechanism.

python separate_audiofile.py --model_dir /path/to/trained/model/dir --input_files /path/to/audio/file --sample_rate 8000 --output_dir /path/to/output/dir

Use signal resampling.

python separate_audiofile.py --model_dir /path/to/trained/model/dir --input_files /path/to/audio/file --sample_rate 8000 --output_dir /path/to/output/dir --use_signal_resampling

Trained models are available in pretrained.
- Remark: The trained models are not exactly the same as those used in the paper, but they show similar performances as reported in our paper. We retrained models along with refactoring this codebase.

Train model

Training

Download MUSDB18-HQ dataset

Create training data files from wav files

Use utility/data_generator.py

python utility/data_generator.py --musdb_path /path/to/musdb18-hq/dataset --outdir data --is_wav
# If --setup_file is not set, use default training/validation data split.

Trainining and validation data are created as data/train_32 and data/validation_32. (32 kHz-sampled data)

c.f., Usage

usage: data_generator.py [-h] --musdb_path MUSDB_PATH [--setup_file SETUP_FILE] --outdir OUTDIR [--n_threads N_THREADS] [--is_wav]

optional arguments:
  -h, --help            show this help message and exit
  --musdb_path MUSDB_PATH
                        Path to the MUSDB18 dataset.
  --setup_file SETUP_FILE
  --outdir OUTDIR
  --n_threads N_THREADS
  --is_wav

Train Model (4 GPUs recommended)
- Train model with some options (we use the hydra library)
```
python train.py model=MODEL_NAME [any other options]
```

Evaluation

Prepare the test data and the trained model.
Run evaluate_specified_sampling_rate.py

How to cite

@article{KSaito2022IEEEACMTASLP,
 author={Saito, Koichi and Nakamura, Tomohiko and Yatabe, Kohei and Saruwatari, Hiroshi},
 journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
 title = {Sampling-frequency-independent convolutional layer and its application to audio source separation},
 year=2022,
 month=sep,
 volume=30,
 pages={2928--2943},
 doi={10.1109/TASLP.2022.3203907}
}

License

MIT License

Achknowledgements

This work was supported by JSPS KAKENHI under Grant JP20K19818 and JST ACT-X under Grant JPMJAX210G.
Most part of this code is borrowed from https://github.com/pfnet-research/meta-tasnet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sampling-frequency-independent (SFI) convolutional layer and its application to audio source separation

Abstract

Getting started

Clone this repository

Setup

conda

docker

Separating signals

Train model

Training

Evaluation

How to cite

License

Achknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sampling-frequency-independent (SFI) convolutional layer and its application to audio source separation

Abstract

Getting started

Clone this repository

Setup

conda

docker

Separating signals

Train model

Training

Evaluation

How to cite

License

Achknowledgements