-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthesis results of vocoder #1
Comments
Impressive results! Do you have any samples trained on music or guidance for someone seeking to create those samples? |
@SPGB welcome! Although we don't have tested this model on music data, we might can give you the guidance. What is the purpose of your model? |
@twidddj Thank you for the response. I'm hoping to use it to generate musical instrument sounds (for example a drum loop, or bass sounds). So far my results have mostly been static. I know I won't be able to replicate the reference samples such as the piano (https://storage.googleapis.com/deepmind-media/pixie/making-music/sample_1.wav) but maybe with the right parameters and enough time an approximation is possible? |
You are probably interested in neural synthesizer. Then Wavenet vocoder would help you to generate the sounds when the right encoded features are given as the local condition (like what they did). For example, you may can use pitch as a local condition and use timbre as a global condition for the vocoder. |
Thanks for sharing the link @twidddj . I've been using Ibab's wavenet implementation to some moderate success with a wide receptive field and minimal layers. Is it possible to turn off local conditions all together and just create unconditional sounds? Something similar to It would be really interesting if there was a way to condition it on MIDI but I wouldn't know where to begin for an addition like that. |
Single speaker
We stopped the training at 680K step.
You can find some results at https://twidddj.github.io/docs/vocoder.
We tested the vocoder on the set of two group: 1) samples from the datasets 2) samples generated from Tacotron.
This is because my stupid mistake (So sorry, I did not separate the data for test).
However, I believe the result shows the performance to some extent. See first section in the page.
In other section, you can guess the performance of the vocoder.
It can generate enough as much as the target using only mel-spectrum of target.
Moreover, some part of the result has better quality than target (I hope you think so too). Note that the Tacotron was trained on sample rate = 24K audio data on the other hand our vocoder was trained sample rate = 22K. This means that the vocoder has never seen the frequencies over 11K. Therefore, If you synchronize the sample rate, your results would be better than the results we reported.
By the way, we believe the pre-trained model can be used as a teacher model for parallel wavenet.
Parallel Wavenet - Single speaker
Not yet tested.
Multi speaker
Not yet tested.
The text was updated successfully, but these errors were encountered: