Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to generate mel spectrogram #4

Open
nkcdy opened this issue Jun 20, 2019 · 33 comments
Open

How to generate mel spectrogram #4

nkcdy opened this issue Jun 20, 2019 · 33 comments

Comments

@nkcdy
Copy link

nkcdy commented Jun 20, 2019

with the same wavenet model and the same utterence(p225_001.wav), i found that the quality of the waveform generated from the mel-spectrogram in provided metadata.pkl is much better than the one generated by myself. Is there any tricky on how to generate proper mel-spectrogram?

@auspicious3000
Copy link
Owner

num_mels: 80
fmin: 90
fmax: 7600
fft_size: 1024
hop_size: 256
min_level_db: -100
ref_level_db: 16

@nkcdy
Copy link
Author

nkcdy commented Jun 21, 2019

num_mels: 80
fmin: 90
fmax: 7600
fft_size: 1024
hop_size: 256
min_level_db: -100
ref_level_db: 16

Thanks a lot. the quality is improved with the above hyperparameters when i generate the mel spectrogram even if i use default parameter to generate waveform.

@nkcdy
Copy link
Author

nkcdy commented Jun 21, 2019

Another question is about the speaker embeddings. The speaker embedding in metadata.pkl is a scalar with 256-dimensions, but i got a matrix with the size of N*256 when I use the GE2E method to generate the speaker embeddings. What's the relationship between the scalar and the matrix?

@auspicious3000
Copy link
Owner

The embedding in metadata.pkl should be a vector of length 256.
The N you got might be the number of speakers.

@nkcdy
Copy link
Author

nkcdy commented Jun 22, 2019

Yes, the embedding is metadata.pkl is a vector of length 256. But I got several d-vector with length of 256 even if i use a single wave file(p225_001.wav). I did some normalizationc according to the GE2E paper( section 3.2), "the final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then takking the element-wise averge". The result looks quite different from the vector in metadata.pkl. All number in the vector were positive number while the number in metadata.pkl has both positive and negative value. Should I just average the all the d-vector without normalization?

@auspicious3000
Copy link
Owner

You can average all the d-vectors without normalization.

@nkcdy
Copy link
Author

nkcdy commented Jun 22, 2019

It didnt work... :(.

I noticed that the sampling rate of TIMIT corpus used in https://github.com/HarryVolek/PyTorch_Speaker_Verification is 16KHz while the sampling rate in VCTK corpus is 48kHz.

Should I re-train the D-vector network at the 48kHz sampling rate?

@auspicious3000
Copy link
Owner

The details are described in the paper.

@nkcdy
Copy link
Author

nkcdy commented Jun 28, 2019

The details are described in the paper.

I still can not reproduce your reults as shown in the demo. what i got were babbles. The sampling rate of all the wavefiles has been changed to 16kHz as described in your paper.

The network I used to generate the speaker embeddings was Janghyun1230's version(https://github.com/Janghyun1230/Speaker_Verification).

I noticed that the method used to generate the Mel-spectrogram in wavenet is different with that in speaker verification. So I modified the source code of the speaker verificaiton to match the mel-spectrogram of wavenet and retrain the speaker embedding netowrk. But it still doesn't work for autovc conversion.

I guess the reason lies in the method to generate the speaker embeddings.

Can you give me some advice on that?

@auspicious3000
Copy link
Owner

You are right. In this case, you have to retrain the model using your speaker embeddings.

@lhppom
Copy link

lhppom commented Jul 3, 2019

num_mels: 80
fmin: 90
fmax: 7600
fft_size: 1024
hop_size: 256
min_level_db: -100
ref_level_db: 16

Thanks a lot. the quality is improved with the above hyperparameters when i generate the mel spectrogram even if i use default parameter to generate waveform.

Do you clip the mel spectrogram to a specific range, such as [-1,1] or other case? Thanks!

@auspicious3000
Copy link
Owner

Clip to [0,1]

@liveroomand
Copy link

Clip to [0,1]

How does mel spectrogram clip to [0,1] ? what algorithm or method do you used?

@xw1324832579
Copy link

@auspicious3000 Can you please release your code to generate speaker embedding? I have the same question with @liveroomand that can't reproduce your embedding results, p225、p228、p256 and p270.Retraining the model costs a lot of time. Or please release all the parameters you set during training speaker embeddings. Thank you

@auspicious3000
Copy link
Owner

@xw1324832579 You can use one-hot embedding if you are not doing zero-shot conversion. Retraining takes less than 12 hrs on single gpu.

@liveroomand
Copy link

@auspicious3000 Are the features(80-mel) of speaker embedding and text extraction(The encoder input) the same?

@auspicious3000
Copy link
Owner

They don't have to be the same.

@liveroomand
Copy link

How to generate speaker mel spectrogram
eg:
num_mels: 40
fmin: 90
fmax: 7600
window_hight: 0.025s
hop_hight: 0.01s
don't Clip to [0,1]

@auspicious3000
Copy link
Owner

@liveroomand Looks fine. You can refer to r9y9's wavenet vocoder for more details on spectrogram normalizaton and clipping

@liveroomand
Copy link

What you mean: the Speaker Encoder is pre-trained to use the Merle spectrum also needs to be Clip to [0, 1]

@auspicious3000
Copy link
Owner

auspicious3000 commented Aug 13, 2019

@liveroomand Yes in our case. but you can design your own speaker encoder or just use onehot embedding

@smalissa
Copy link

hi all ,
can you help me please,
i have my own dataset, how i process this data , how i can build my models to get my own wav audio.?
thanks

@miaoYuanyuan
Copy link

miaoYuanyuan commented Sep 23, 2019

num_mels: 80
fmin: 90
fmax: 7600
fft_size: 1024
hop_size: 256
min_level_db: -100
ref_level_db: 16

is this params suitable for other dataset? when i change to myself dataset ,it doesn't present good quilatity like vctk. is the reason you train wavenet-vocoder on vctk once again? could you give some advices on other dataset? thanks

@auspicious3000
Copy link
Owner

@miaoYuanyuan For other dataset, you need to tune the parameters of the conversion model instead of the parameters of the feature.

@smalissa
Copy link

smalissa commented Sep 23, 2019

@miaoYuanyuan
pleaze can you tell me what do you do to get you result , can you guide me what do you do ?
i will thank you

@miaoYuanyuan
Copy link

@miaoYuanyuan For other dataset, you need to tune the parameters of the conversion model instead of the parameters of the feature.

thanks, you mean the wavenet-vocoder or auto-vc conversion model?

@auspicious3000
Copy link
Owner

@miaoYuanyuan If you change the parameters of features, you will need to retrain the wavenet-vocoder as well.

@miaoYuanyuan
Copy link

Thank you! I got it.

@miaoYuanyuan
Copy link

miaoYuanyuan commented Sep 25, 2019

@miaoYuanyuan
pleaze can you tell me what do you do to get you result , can you guide me what do you do ?
i will thank you

from wavs to mel spectrogram:
Refer to the process of preprocess.py processing audio in the wavenet_vocoder folder to get the Mel spectrum you want. I haven't done voice conversion yet, so I couldn't give you advice.

@smalissa
Copy link

@miaoYuanyuan
thanx for you reply
but if you can what do until now ,what is steps? until i could understand well, because iam confused with the details .

@smalissa
Copy link

@miaoYuanyuan
can you tell mw what the aim of preprocess.py file?
can guide me to the start point >from where i should start?
thanx

@miaoYuanyuan
Copy link

@miaoYuanyuan
can you tell mw what the aim of preprocess.py file?
can guide me to the start point >from where i should start?
thanx
this is the code . wish can help you.
https://github.com/miaoYuanyuan/gen_melSpec_from_wav

@KnurpsBram
Copy link

Thanks @miaoYuanyuan for making the preprocessing steps clear! I wanted to experiment with AutoVC and the wavenet vocoder separately, and found this thread really useful. In the end I put my experiments in a notebook and made a git repo of it. It could be useful for those of you who are in the shoes of me-a-week-ago.

https://github.com/KnurpsBram/AutoVC_WavenetVocoder_GriffinLim_experiments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants