Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoVC on a large scale data? #26

Open
iyah4888 opened this issue Sep 1, 2019 · 2 comments
Open

AutoVC on a large scale data? #26

iyah4888 opened this issue Sep 1, 2019 · 2 comments

Comments

@iyah4888
Copy link

iyah4888 commented Sep 1, 2019

Hi @auspicious3000, thanks for sharing your research code.
I've worked on a lot of time to make the training code work (mostly due to input hyper parameter issues as the other guys are also struggling).
I'm currently working on the VoxCeleb2 dataset (near 6000 speaker, with 1M utterances).
However, I cannot make it trainable with MSE loss, but with L1 loss, I can manage to get the following auto-encoding reconstruction.

[Original]
image
[Voice converted with another speaker embedding]
image

The problem is while the network learns auto-encoding, but during the test time, it is not generalizable to voice conversion. It just did auto-encoding, not something else.
The above pair of examples are voice conversion examples, where both fundamental frequency of the mel-spectrogram looks very similar.

Could you share your your experience or any comments? I'd appreciate.

@auspicious3000
Copy link
Owner

auspicious3000 commented Sep 3, 2019

For different dataset you need to retune the bottleneck. Also, feel free to try different encoder and decoder architectures. The paper proposed a framework instead of specific architectures. Voxceleb2 is not very clean, for example, if the channel effects and background noises are different, you need to disentangle them by conditioning on these information. Otherwise, it will not achieve disentangled representations for conversion. I suggest you start with a clean dataset such as vctk.

@light1726
Copy link

Thanks for sharing.
From my experience, the temporal resolution of the bottleneck feature (related to mel-spectrogram extraction hop-length and the downsampling frequency) seems to be important for the encoder to disentangle.
When I extracted mel-spectrogram with hop-length of 250, the down-sampling frequency 32 shows better performance in conversion than the down-sampling frequency of 16.
Currently, I extract mel-spectrogram with hop-length of 200 and increase down-sampling frequency to 40, the conversion performance is still worse than 250 hop and 32 freq.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants