Black technology based on the three giants of artificial intelligence:
OpenAI's whisper, 680,000 hours in multiple languages
Nvidia's bigvgan, anti-aliasing for speech generation
Microsoft's adapter, high-efficiency for fine-tuning
Open source plug-in singing voice library model, based on the LoRA principle:
Train the model from scratch based on a large amount of data, using the branch: lora-svc-for-pretrain
lora-svc-baker.mp4
The following is the customization process base on pre-trained model.
-
1 Data preparation, segment the audio into less than 30S (recommended about 10S/you need not follow the end of the sentence), convert the sampling rate to 16000Hz, and put the audio data in
./data_svc/waves
I think you can handle~~~
-
2 Download the timbre encoder: Speaker-Encoder by @mueller91 , unzip the file, put
best_model.pth.tar
into the directoryspeaker_pretrain/
Extract the timbre of each audio file
python svc_preprocess_speaker.py ./data_svc/waves ./data_svc/speaker
-
3 Download the whisper model multiple multiple language medium model, make sure the download is
medium.pt
, put it in the folderwhisper_pretrain/
, and extract the content code of each audiosudo apt update && sudo apt install ffmpeg
python svc_preprocess_ppg.py -w ./data_svc/waves -p ./data_svc/whisper
-
4 Extract the pitch and generate the training file
filelist/train.txt
at the same time, cut the first 5 items of the train to makefilelist/eval.txt
python svc_preprocess_f0.py
-
5 Take the average of all audio timbres as the timbre of the target speaker, and complete the sound field analysis
python svc_preprocess_speaker_lora.py ./data_svc/
Generate two files, lora_speaker.npy and lora_pitch_statics.npy
-
6 Download the pre-training model maxgan_pretrain_5L.pth from the release page and put it in the
model_pretrain
folder. The pre-training model contains the generator and the discriminatorpython svc_trainer.py -c config/maxgan.yaml -n lora
Resume training
python svc_trainer.py -c config/maxgan.yaml -n lora -p chkpt/lora/***.pth
Your file directory should look like this~~~
data_svc/
│
└── lora_speaker.npy
│
└── lora_pitch_statics.npy
│
└── pitch
│ ├── 000001.pit.npy
│ ├── 000002.pit.npy
│ └── 000003.pit.npy
└── speakers
│ ├── 000001.spk.npy
│ ├── 000002.spk.npy
│ └── 000003.spk.npy
└── waves
│ ├── 000001.wav
│ ├── 000002.wav
│ └── 000003.wav
└── whisper
├── 000001.ppg.npy
├── 000002.ppg.npy
└── 000003.ppg.npy
How to use
-
Use for extremely low resources to prevent overfitting
-
Use for plug-in sound library development
-
Do not use for other
maolei.mp4
Export the generator, the discriminator will only be used in training
python svc_inference_export.py --config config/maxgan.yaml --checkpoint_path chkpt/lora/lora_0090.pt
The exported model is in the current folder maxgan_g.pth
, the file size is 54.3M ; maxgan_lora.pth
is the fine-tuning module, the file size is 0.94M.
python svc_inference.py --config config/maxgan.yaml --model maxgan_g.pth --spk ./data_svc/
lora_speaker.npy
--wave test.wav
The generated file is in the current directory svc_out.wav; at the same time, svc_out_pitch.wav is generated to visually display the pitch extraction results.
What ? The resulting sound is not quite like it!
-
1 Statistics of the speaker's vocal range
Step 5 of training generates: lora_pitch_statics.npy
-
2 Inferring with the range offset
Specify the pitch parameter:
python svc_inference.py --config config/maxgan.yaml --model maxgan_g.pth --spk ./data_svc/lora_speaker.npy --statics ./data_svc/lora_pitch_statics.npy --wave test.wav
python svc_bandex_gpu.py -w svc_out.wav or python svc_bandex.py -w svc_out.wav
Generate svc_out_48k.wav in the current directory
Download the pretrained vocoder-based enhancer from the DiffSinger Community Vocoder Project and extract it to a folder nsf_hifigan_pretrain/
.
NOTE: You should download the zip file with nsf_hifigan
in the name, not nsf_hifigan_finetune
.
Copy the svc_out_48k.wav generated after frequency expansion to path\to\input\wavs, run
python svc_val_nsf_hifigan.py
Generate enhanced files in path\to\output\wavs
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
AdaSpeech: Adaptive Text to Speech for Custom Voice
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/mindslab-ai/univnet [paper]
https://github.com/openai/whisper/ [paper]
https://github.com/NVIDIA/BigVGAN [paper]
https://github.com/brentspell/hifi-gan-bwe
https://github.com/openvpi/DiffSinger
https://github.com/chenwj1989/pafx
If you adopt the code or idea of this project, please list it in your project, which is the basic criterion for the continuation of the open source spirit.
如果你采用了本项目的代码或创意,请在你的项目中列出,这是开源精神得以延续的基本准则。
このプロジェクトのコードやアイデアを採用した場合は、オープンソースの精神が続く基本的なガイドラインであるプロジェクトにリストしてください。