-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training new language #2
Comments
Hi @thewh1teagle, I'd like to help out. I've successfully trained an OptiSpeech model for Indonesian ( It's quite simple to add support for a new language. You mainly just have to add a new text processor (i.e. add language code during phonemization step), and then designate a new training recipe that uses that text processor. Here's an example of my Indonesian recipe bookbot-hive@b5b1198. For my training, I used 43 mins of data (~800 utterances) and trained it for 2M steps (same as default config). Hope this helps! |
@w11wo How long does it took for you to train it 2M steps? It sounds a lot. Did you used colab A100 GPU? Are you satisfied with the results, Is it good as the English one here?
If the audio files I have is 1 speaker and I'll change to rate to be like in the config, should I change some configuration from yours? |
Hi @thewh1teagle.
It took me ~3 days. I used a RTX 3090 GPU (my company's local machine).
Yes, definitely. You can also modify the phonemizer backend if needed. I've successfully integrated gruut as my English phonemizer, for instance.
I've been using LightSpeech for on-device TTS, and I find OptiSpeech to be much better than LS given its size. It's probably not as good as the English sample, likely because I used a small dataset, but so far I have no issues with it.
My config is also 1 speaker and I've changed the sample rate to 44.1kHz. Feel free to adapt to your setup. If you need to change the rate, you can modify the feature extractor to your desired config. Good luck! |
@w11wo did a good job in explaining what's needed. The model is built from the ground up to support other languages besides English. If you want to train with a custom text front-end, do the following:
|
Thank you all. I'll start soon and check it. I have 6 hours of high quality audio along with their diacritized transcriptions. My primary concern is the accuracy of espeak-ng. If espeak-ng turns out to be inaccurate, would it be advisable to train on diacritized characters instead? How challenging would it be to fix or improve espeak-ng? |
@thewh1teagle If you have an alternative phonemizer you can easily adopt it. Otherwise you can go with raw characters instead, but you'll need to handle numbers and abbreviations independently. |
FYI, I do have a demo for a trained model for OptiSpeech. |
I have added the missing phonemes of Hebrew to my espeak-ng fork, but I noticed that the repository relies on the Piper phonemizer. How can I provide a metadata file with the phonemes directly? Additionally, there are many dependencies that don't seem directly related to the training process. I believe the repository could be improved by separating it into multiple packages, similar to Rust crates. For example, one package could focus on preprocessing, another on inference, and so on. This would make it easier to identify the important parts to focus on for training. I would be happy to contribute to this restructuring once I fully understand the repository. However, I'm unsure if starting training without a deeper understanding is a good idea. I have 3 hours of 22.05 kHz mono audio (2-10 second clips) with corresponding dotted transcriptions and a ready-to-use fork of espeak-ng. I hope that's all I need to get started. |
Thanks! it sound really good. |
|
Thanks. turns out that I can easily use my custom espeak-ng from within piper phonemizer package. I started to train but it failed with error. self.on_run_start()
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 328, in on_run_start
call._call_lightning_module_hook(trainer, "on_train_start")
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/home/optispeech/optispeech/model/base_lightning_module.py", line 73, in on_train_start
if self._opti_reset_optim_and_lr:
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'OptiSpeech' object has no attribute '_opti_reset_optim_and_lr'
[2024-09-07 19:37:35,831][optispeech.utils.generic][INFO] - Output dir: /root/home/optispeech/data/saspeech/logs/train/saspeech_he/runs/2024-09-07_19-37-06
Error executing job with overrides: ['experiment=saspeech-he', 'model.train_args.evaluate_utmos=false', 'data.batch_size=32', 'data.num_workers=8', 'data.train_filelist_path=data/saspeech/train.txt', 'data.valid_filelist_path=data/saspeech/val.txt', 'callbacks.model_checkpoint.every_n_epochs=5', 'paths.log_dir=data/saspeech/logs']
Traceback (most recent call last):
File "/root/home/optispeech/optispeech/train.py", line 114, in main
metric_dict, _ = train(cfg)
File "/root/home/optispeech/optispeech/utils/generic.py", line 88, in wrap
raise ex
File "/root/home/optispeech/optispeech/utils/generic.py", line 78, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
File "/root/home/optispeech/optispeech/train.py", line 81, in train
trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
results = self._run_stage()
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
self.fit_loop.run()
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
self.on_run_start()
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 328, in on_run_start
call._call_lightning_module_hook(trainer, "on_train_start")
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/home/optispeech/optispeech/model/base_lightning_module.py", line 73, in on_train_start
if self._opti_reset_optim_and_lr:
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'OptiSpeech' object has no attribute '_opti_reset_optim_and_lr'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Training: | | 0/? [00:01<?, ?it/s] This is everything I changed to start train on my dataset: Also, do I need to preprocess again if I got the statistics? or it's used only for the statistics? |
|
Oh I didn't noticed the json files there. It will be useful if the json will include the phonemes not just their IDs, for easily check that it's correct phonemes Does the configs I used looks correct? This is basically the data I have: $ ls saspeech/train/wav/ | wc -l
2688
$ ls saspeech/val/wav/ | wc -l
298
$ ffmpeg -i saspeech/train/wav/gold_001_line_128.wav
44100 Hz, mono, s16, 705 kb/s
$ head -n 1 saspeech/train/metadata.csv
gold_001_line_124|בְּבִנְיָינִים מִסְפָּר 36 וְ-38 בִּרְחוֹב סַרְלִין בְּחוֹלוֹן,
$ head -n 1 saspeech/val/metadata.csv
gold_000_line_000|שָׁלוֹם, צְלִיל אַבְרָהָם. Also turns out the sample rate of my data is Another cuda error: data.batch_size=5 data.num_workers=2 torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.44 GiB. GPU 0 has a total capacity of 23.60 GiB of which 1.66 GiB is free. Process 2892091 has 21.93 GiB memory in use. Of the allocated memory 9.52 GiB is allocated by PyTorch, and 12.07 GiB is reserved by PyTorch but unallocated. nvidia-smi
Sat Sep 7 20:44:59 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:04:00.0 Off | N/A |
| 44% 48C P8 25W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+ Update: I'm able to start the train with batch size of 2 and 1 workers. otherwise it's out of memory. although I have 24GB ram of GPU with rtx4090ti. 5 minutes for single epoch Epoch 0: 27%|█████████████████▋ | 367/1344 [00:58<02:35, 6.28it/s Does the model training in this repository has advantages over the training of piper? |
@thewh1teagle the difference between this and piper is that this is based on JETS not Vits. JETS is considered by many as being better than Vits in quality. |
I tried even on A100 with 40GB of vram and I got out of memory error with this: python -m optispeech.train \
experiment="saspeech-he" \
model.train_args.evaluate_utmos=false \
data.batch_size=16 \
data.num_workers=8 \
data.train_filelist_path="data/saspeech/train.txt" \
data.valid_filelist_path="data/saspeech/val.txt" \
callbacks.model_checkpoint.every_n_epochs=5 \
paths.log_dir="data/saspeech/logs" I don't think I can get something more powerful than that. what can I do? torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.64 GiB. GPU 0 has a total capacity of 39.50 GiB of which 12.78 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.59 GiB is allocated by PyTorch, and 4.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Tried also to set this environment vairable Maybe my dataset is too large? (4 hours of audio) |
@thewh1teagle |
I checked but I don't have wav file with duration longer than 20 seconds. from pathlib import Path
import librosa
files = [i for i in Path('wavs').glob('*')]
max = 0
for file in files:
duration = librosa.get_duration(filename=str(file))
if duration > max:
max = duration
print(max) My maximum duration is Is there another reasons that could cause that? Example long uttorence: {'text': 'רַק אִם הַיְּלָדִים שֶׁהוֹלְכִים לִלְמוֹד בּוֹ לֹא הוֹלְכִים לִלְמוֹד בִּגְלַל הַשִּׁיוּחַ הַמַּעֲמָדִי, אוֹ הַכַּלְכָּלִי, אוֹ הֵעַדְתִּי שֶׁלָּהֶם. הוּא יָכוֹל לִהְיוֹת דָּבָר טוֹב אִם הַיְּלָדִים שֶׁהוֹלְכִים לִלְמוֹד בּוֹ, הוֹלְכִים לִלְמוֹד בּוֹ מֵהַבְּחִירָה שֶׁלָּהֶם,', 'phonemes': 'rˈak ʔˈim hajlˈadim ʃehˈolχim lˈilmod bˈo lˈoʔ hˈolχim lˈilmod bˈiɡlal hˌaʃiəjˈuχa hˌamaʔamˈadij, ʔˈo hˌakalkˈalij, ʔˈo heʔˈadti ʃelˈahem.'} Is that too long? as you can see it's dotted text so it should be much longer than just like in english |
@thewh1teagle, I'm training on 1 x RTX 4090 with 24GB of VRAM. I reduced the batch size to 8, which still trained fine. Perhaps you could try reducing the batch size to 8? |
I think that batch size of 8 worked for me too. but then the training seems to be very slow. it takes few minutes for single epoch, is that expected? Epoch 0: 27%|█████████████████▋ | 367/1344 [00:58<02:35, 6.28it/s If that expected, what's the time I can try the model even just to see it it makes some sound? (I don't want to train many days before see that something in the right direction) |
Yes, it is expected, my training took 11 mins/epoch
You can have a look at the TensorBoard's Audio tab every now and then to see how it sounds on the eval set. I suggest waiting until at least 100k-300k steps to hear how it sounds -- and this shouldn't take too long. You can modify this value in the config to control how frequently to save the checkpoint: trainer:
check_val_every_n_epoch: YOUR VALUE HERE |
Thanks! |
I'm on 60k steps. It doesn't sound good, but I can understand it :) python3 -m optispeech.infer checkpoint_epoch\=192_step\=128696.ckpt hello output return self.method(cls, *args, **kwargs)
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1582, in load_from_checkpoint
loaded = _load_from_checkpoint(
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/core/saving.py", line 63, in _load_from_checkpoint
checkpoint = pl_load(checkpoint_path, map_location=map_location)
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/fabric/utilities/cloud_io.py", line 60, in _load
return torch.load(
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1097, in load
return _load(
File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1526, in _load
del torch._utils._thread_local_state.map_location
AttributeError: map_location By the way are you open to accept PRs in this repository? |
@thewh1teagle |
Is there a workaround? I can't continue the training or infer it |
@thewh1teagle I did a very ugly hack whereby I removed the offending line :D |
Any way, I'll resolve these versioning issues next weekend. |
Nice! so the latest code should work fine with that and I can continue from the checkpoint I have? Turns out I'm already on latest main but it still failed /root/home/optispeech/optispeech/train.py(81)train()
-> ckpt_model = model_cls.load_from_checkpoint(ckpt_path, map_location="cpu")
(Pdb) n
AttributeError: map_location Update: |
Thanks so much for the help! After resolving the issue with loading the model by changing PyTorch to version 2.3.1, I was finally able to export it to onnx. The model is currently at 50k training steps, and I tested it on several Hebrew sentences. I'm already impressed! It generates complete sentences in less than 0.1 seconds on a CPU with onnx runtime, and even at this stage, every word is understandable with quite good pronunciation. The only aspect that seems to need improvement is the overall audio quality, which I expect will get better as the training progresses. Do you think that once I reach 9M steps, I'll be able to fine-tune the model with a new voice to change the speaker? |
@thewh1teagle I suspect the alignment module. At least in Matcha-TTS MAS has a lot of overhead. |
This is literally all I added: thewh1teagle@aee021a And the command I use python -m optispeech.train \
experiment="saspeech-he" \
model.train_args.evaluate_utmos=false \
data.batch_size=16 \
data.num_workers=8 \
data.train_filelist_path="data/saspeech/train.txt" \
data.valid_filelist_path="data/saspeech/val.txt" \
callbacks.model_checkpoint.every_n_epochs=5 \
paths.log_dir="data/saspeech/logs" \
callbacks.model_checkpoint.save_last=True \
ckpt_path="last.ckpt" 4 hours of audio 44.10khz. Every step is sloowww... 1 second @w11wo said that it took for him 3 days for 1M steps. In that case it should be much faster than 1-step/1-second. If I just remove the |
Hi, |
Hey :) |
@thewh1teagle can you confirm that the speed up continues to steps beyond 1000? |
You right. the first 1000 steps are fast like 10it/s and then it slow down to 1.42it/s (step per second) |
@thewh1teagle |
Discriminators operate in the time domain dealing with the raw waveform, whereas the generator operates on a compact representation. |
In that case, that's expected on RTX 3090? |
That's how it sounds on I don't know if you can notice because it's Hebrew and may sounds weird anyway, but there's some white noise / vibrations (the dataset is clean). output.mp4 |
@thewh1teagle |
Seems like the default one? I use this
defaults:
- default
- _self_
sample_rate: 44100
n_feats: 80
n_fft: 2048
hop_length: 512
win_length: 2048
f_min: 20
f_max: 11025
|
@thewh1teagle would you mind telling me the exact value of the |
Oh I updated the branch from your latest code while resume training. I hope that I started with the same one. Currently my default.yaml: should I recreate the statistics for any case? _target_: optispeech.dataset.feature_extractors.CommonFeatureExtractor
sample_rate: 24000
n_feats: 80
n_fft: 2048
hop_length: 300
win_length: 1200
f_min: 80
f_max: 8000
center: false
pitch_extractor:
_target_: optispeech.dataset.feature_extractors.pitch_extractors.EnsemblePitchExtractor
_partial_: true
batch_size: 2048
interpolate: true
preemphasis_filter_coef: null # apply preemphasis filter
lowpass_freq: null
highpass_freq: null
gain_db: null
trim_silence: false
trim_silence_args:
silence_threshold: 0.2
silence_samples_per_chunk: 480
silence_keep_chunks_before: 2
silence_keep_chunks_after: 2
|
@thewh1teagle Unfortunately, there is no way other than comparing file dates with commit date. |
Thanks. I'll check and hope I've used it already. |
@thewh1teagle |
Hi @thewh1teagle. Just to be clear, for my training to reach 2M steps (which is 1M step for generator, 1M step for discriminator) in ~3 days, I didn't use the default model (Transformer backbone). I used the ConvNext backbone. Hope this helps. |
@thewh1teagle the numbers you gave are consistent with my experience. Yeah, same numbers. |
@thewh1teagle regarding memory usage... I faced the same OOM error on one dataset. Whatever I do, seams like I cannot make it train with a batch size of 16. Strange thing it only happens with some datasets not all. |
Interesting! what if you change pytorch version like I did? rye add torchaudio==2.3.1 |
@thewh1teagle did it help you? by how much? |
I could train with batch size of 16 instead of 8 without OOM. I'm not 100% sure that this worked with 16 thanks to this version downgrade but maybe. By the way since the batch size didn't changed much the speed I decided to rent cheaper GPU - RTX 3060TI with 8GB ram and I changed batch size to 4. Same speed as with RTX4090 24 vram etc... |
I created this class to improve pronunciation handling and it sounds much better now by adding pauses after As for SSML parsing, did you find solution to support it? import re
from dataclasses import dataclass
@dataclass
class Segment:
text: str
next_pause: float
class SegmentExtractor:
def __init__(self, default_pause: float = 0.02, question_pause: float = 0.05, period_pause: float = 0.05, new_line_pause = 0.3):
self.default_pause = default_pause
self.question_pause = question_pause
self.period_pause = period_pause
self.new_line_pause = new_line_pause
def extract_segments(self, text: str):
segments: list[Segment] = []
sentences = re.split(r'([.?!:\n])', text)
for i in range(0, len(sentences) - 1, 2):
sentence = sentences[i].strip()
punctuation = sentences[i + 1]
if sentence: # Ensure the sentence is not empty
if punctuation == '.':
segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.period_pause))
elif punctuation == '?':
segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.question_pause))
elif punctuation == '\n':
segments.append(Segment(text=f"{sentence}{punctuation}.", next_pause=self.new_line_pause))
else:
segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.default_pause))
last_sentence = sentences[-1].strip()
if last_sentence:
segments.append(Segment(text=f"{last_sentence}.", next_pause=self.default_pause))
return segments |
@thewh1teagle what class you're referring to? |
Click on the details button at the end of the comment :) |
I just created this repository with the new TTS model, there's some artifacts but even at this stage it's the best tts in Hebrew in terms of pronunciation and speed! https://github.com/thewh1teagle/israwave @mush42 |
Hey!
Thanks for sharing this. the repository looks well organized.
I would like to train new language - Hebrew.
Can you add information in the steps regarding training on new language?
I have 6 hours of wav file in high quality along with their corresponding transcription, diactirized.
Does that sounds enough data for training? do you know how many steps I'll need?
Looks like you trained it on English language, have you tried on another language such as Arabic?
Thanks
The text was updated successfully, but these errors were encountered: