# SampleRNN
Code accompanying the paper SampleRNN: An Unconditional End-to-End Neural Audio Generation Model [[here](https://arxiv.org/abs/1612.07837) and [here](https://openreview.net/forum?id=SkxKPDv5xl)]. Samples are available [here](https://soundcloud.com/samplernn/sets).

![Unrolled model](Fig1.png?raw=true "Three-tier SampleRNN")

## Dependencies
Extensively tested with:
- cuDNN 5105
- Python 2.7.12
- Numpy 1.11.1
- Theano 0.8.2 (0.9 for WaveNet re-implementation)
- Lasagne 0.2.dev1

## Datasets
Music dataset was created from all 32 Beethoven’s piano sonatas available publicly on [archive.org](https://archive.org/). `datasets/music` contains scripts to preprocess and build this dataset. It is also available [here](https://drive.google.com/drive/folders/0B7riq_C8aslvbWJuMGhJRFBmSHM?resourcekey=0-fM79ZaHDzE4IPUMzDUK6uA&usp=sharing) for download. Extract the tar file and put all the numpy files in `datasets/music` directory.

## Training
To train a model on an existing dataset with accelerated GPU processing, you need to run following lines from the root of `sampleRNN_ICLR2017` folder which corresponds to the best found set of hyper-paramters.

Mission control center:
```
$ pwd
/u/mehris/sampleRNN_ICLR2017
```
### SampleRNN (2-tier)
```
$ python models/two_tier/two_tier.py -h
usage: two_tier.py [-h] [--exp EXP] --n_frames N_FRAMES --frame_size
                   FRAME_SIZE --weight_norm WEIGHT_NORM --emb_size EMB_SIZE
                   --skip_conn SKIP_CONN --dim DIM --n_rnn {1,2,3,4,5}
                   --rnn_type {LSTM,GRU} --learn_h0 LEARN_H0 --q_levels
                   Q_LEVELS --q_type {linear,a-law,mu-law} --which_set
                   {ONOM,BLIZZ,MUSIC} --batch_size {64,128,256} [--debug]
                   [--resume]

two_tier.py No default value! Indicate every argument.

optional arguments:
  -h, --help            show this help message and exit
  --exp EXP             Experiment name
  --n_frames N_FRAMES   How many "frames" to include in each Truncated BPTT
                        pass
  --frame_size FRAME_SIZE
                        How many samples per frame
  --weight_norm WEIGHT_NORM
                        Adding learnable weight normalization to all the
                        linear layers (except for the embedding layer)
  --emb_size EMB_SIZE   Size of embedding layer (0 to disable)
  --skip_conn SKIP_CONN
                        Add skip connections to RNN
  --dim DIM             Dimension of RNN and MLPs
  --n_rnn {1,2,3,4,5}   Number of layers in the stacked RNN
  --rnn_type {LSTM,GRU}
                        GRU or LSTM
  --learn_h0 LEARN_H0   Whether to learn the initial state of RNN
  --q_levels Q_LEVELS   Number of bins for quantization of audio samples.
                        Should be 256 for mu-law.
  --q_type {linear,a-law,mu-law}
                        Quantization in linear-scale, a-law-companding, or mu-
                        law compandig. With mu-/a-law quantization level shoud
                        be set as 256
  --which_set {ONOM,BLIZZ,MUSIC}
                        ONOM, BLIZZ, or MUSIC
  --batch_size {64,128,256}
                        size of mini-batch
  --debug               Debug mode
  --resume              Resume the same model from the last checkpoint. Order
                        of params are important. [for now]
```
To run:
```
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/two_tier/two_tier.py --exp BEST_2TIER --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 3 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 128 --weight_norm True --learn_h0 True --which_set MUSIC
```
### SampleRNN (3-tier)
```
$ python models/three_tier/three_tier.py -h
usage: three_tier.py [-h] [--exp EXP] --seq_len SEQ_LEN --big_frame_size
                     BIG_FRAME_SIZE --frame_size FRAME_SIZE --weight_norm
                     WEIGHT_NORM --emb_size EMB_SIZE --skip_conn SKIP_CONN
                     --dim DIM --n_rnn {1,2,3,4,5} --rnn_type {LSTM,GRU}
                     --learn_h0 LEARN_H0 --q_levels Q_LEVELS --q_type
                     {linear,a-law,mu-law} --which_set {ONOM,BLIZZ,MUSIC}
                     --batch_size {64,128,256} [--debug] [--resume]

three_tier.py No default value! Indicate every argument.

optional arguments:
  -h, --help            show this help message and exit
  --exp EXP             Experiment name
  --seq_len SEQ_LEN     How many samples to include in each Truncated BPTT
                        pass
  --big_frame_size BIG_FRAME_SIZE
                        How many samples per big frame in tier 3
  --frame_size FRAME_SIZE
                        How many samples per frame in tier 2
  --weight_norm WEIGHT_NORM
                        Adding learnable weight normalization to all the
                        linear layers (except for the embedding layer)
  --emb_size EMB_SIZE   Size of embedding layer (> 0)
  --skip_conn SKIP_CONN
                        Add skip connections to RNN
  --dim DIM             Dimension of RNN and MLPs
  --n_rnn {1,2,3,4,5}   Number of layers in the stacked RNN
  --rnn_type {LSTM,GRU}
                        GRU or LSTM
  --learn_h0 LEARN_H0   Whether to learn the initial state of RNN
  --q_levels Q_LEVELS   Number of bins for quantization of audio samples.
                        Should be 256 for mu-law.
  --q_type {linear,a-law,mu-law}
                        Quantization in linear-scale, a-law-companding, or mu-
                        law compandig. With mu-/a-law quantization level shoud
                        be set as 256
  --which_set {ONOM,BLIZZ,MUSIC}
                        ONOM, BLIZZ, or MUSIC
  --batch_size {64,128,256}
                        size of mini-batch
  --debug               Debug mode
  --resume              Resume the same model from the last checkpoint. Order
                        of params are important. [for now]
```
To run:
```
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/three_tier/three_tier.py --exp BEST_3TIER --seq_len 512 --big_frame_size 8 --frame_size 2 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 1 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 128 --weight_norm True --learn_h0 True --which_set MUSIC
```

## Reference
If you are using this code, please cite the paper.

`
@article{mehri2016samplernn, Author = {Soroush Mehri and Kundan Kumar and Ishaan Gulrajani and Rithesh Kumar and Shubham Jain and Jose Sotelo and Aaron Courville and Yoshua Bengio}, Title = {SampleRNN: An Unconditional End-to-End Neural Audio Generation Model}, Year = {2016}, Journal = {arXiv preprint arXiv:1612.07837}, }
`

## Torch implementation
Thanks to [Richard Assar](https://github.com/richardassar), now we have a Torch implementation available:

[https://github.com/richardassar/SampleRNN_torch](https://github.com/richardassar/SampleRNN_torch)

## Miscellaneous
- Talk by Yoshua Bengio at CBMM, MIT: [Deep Generative Models for Speech and Images](https://www.youtube.com/watch?v=vEAq_sBf1CA)
- Follow-up project: [Char2Wav: End-To-End Speech Synthesis](https://github.com/sotelo/parrot)

If needed or have interesting related project/results, please don't hesitate to contact us.