-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to generate mel spectrogram #4
Comments
num_mels: 80 |
Thanks a lot. the quality is improved with the above hyperparameters when i generate the mel spectrogram even if i use default parameter to generate waveform. |
Another question is about the speaker embeddings. The speaker embedding in metadata.pkl is a scalar with 256-dimensions, but i got a matrix with the size of N*256 when I use the GE2E method to generate the speaker embeddings. What's the relationship between the scalar and the matrix? |
The embedding in metadata.pkl should be a vector of length 256. |
Yes, the embedding is metadata.pkl is a vector of length 256. But I got several d-vector with length of 256 even if i use a single wave file(p225_001.wav). I did some normalizationc according to the GE2E paper( section 3.2), "the final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then takking the element-wise averge". The result looks quite different from the vector in metadata.pkl. All number in the vector were positive number while the number in metadata.pkl has both positive and negative value. Should I just average the all the d-vector without normalization? |
You can average all the d-vectors without normalization. |
It didnt work... :(. I noticed that the sampling rate of TIMIT corpus used in https://github.com/HarryVolek/PyTorch_Speaker_Verification is 16KHz while the sampling rate in VCTK corpus is 48kHz. Should I re-train the D-vector network at the 48kHz sampling rate? |
The details are described in the paper. |
I still can not reproduce your reults as shown in the demo. what i got were babbles. The sampling rate of all the wavefiles has been changed to 16kHz as described in your paper. The network I used to generate the speaker embeddings was Janghyun1230's version(https://github.com/Janghyun1230/Speaker_Verification). I noticed that the method used to generate the Mel-spectrogram in wavenet is different with that in speaker verification. So I modified the source code of the speaker verificaiton to match the mel-spectrogram of wavenet and retrain the speaker embedding netowrk. But it still doesn't work for autovc conversion. I guess the reason lies in the method to generate the speaker embeddings. Can you give me some advice on that? |
You are right. In this case, you have to retrain the model using your speaker embeddings. |
Do you clip the mel spectrogram to a specific range, such as [-1,1] or other case? Thanks! |
Clip to [0,1] |
How does mel spectrogram clip to [0,1] ? what algorithm or method do you used? |
@auspicious3000 Can you please release your code to generate speaker embedding? I have the same question with @liveroomand that can't reproduce your embedding results, p225、p228、p256 and p270.Retraining the model costs a lot of time. Or please release all the parameters you set during training speaker embeddings. Thank you |
@xw1324832579 You can use one-hot embedding if you are not doing zero-shot conversion. Retraining takes less than 12 hrs on single gpu. |
@auspicious3000 Are the features(80-mel) of speaker embedding and text extraction(The encoder input) the same? |
They don't have to be the same. |
How to generate speaker mel spectrogram |
@liveroomand Looks fine. You can refer to r9y9's wavenet vocoder for more details on spectrogram normalizaton and clipping |
What you mean: the Speaker Encoder is pre-trained to use the Merle spectrum also needs to be Clip to [0, 1] |
@liveroomand Yes in our case. but you can design your own speaker encoder or just use onehot embedding |
hi all , |
is this params suitable for other dataset? when i change to myself dataset ,it doesn't present good quilatity like vctk. is the reason you train wavenet-vocoder on vctk once again? could you give some advices on other dataset? thanks |
@miaoYuanyuan For other dataset, you need to tune the parameters of the conversion model instead of the parameters of the feature. |
@miaoYuanyuan |
thanks, you mean the wavenet-vocoder or auto-vc conversion model? |
@miaoYuanyuan If you change the parameters of features, you will need to retrain the wavenet-vocoder as well. |
Thank you! I got it. |
from wavs to mel spectrogram: |
@miaoYuanyuan |
@miaoYuanyuan |
|
Thanks @miaoYuanyuan for making the preprocessing steps clear! I wanted to experiment with AutoVC and the wavenet vocoder separately, and found this thread really useful. In the end I put my experiments in a notebook and made a git repo of it. It could be useful for those of you who are in the shoes of me-a-week-ago. https://github.com/KnurpsBram/AutoVC_WavenetVocoder_GriffinLim_experiments |
with the same wavenet model and the same utterence(p225_001.wav), i found that the quality of the waveform generated from the mel-spectrogram in provided metadata.pkl is much better than the one generated by myself. Is there any tricky on how to generate proper mel-spectrogram?
The text was updated successfully, but these errors were encountered: