Training new language #2

thewh1teagle · 2024-09-01T14:44:37Z

Hey!
Thanks for sharing this. the repository looks well organized.

I would like to train new language - Hebrew.
Can you add information in the steps regarding training on new language?

I have 6 hours of wav file in high quality along with their corresponding transcription, diactirized.
Does that sounds enough data for training? do you know how many steps I'll need?
Looks like you trained it on English language, have you tried on another language such as Arabic?

Thanks

w11wo · 2024-09-02T01:05:59Z

Hi @thewh1teagle, I'd like to help out.

I've successfully trained an OptiSpeech model for Indonesian (id), as shown here #1 (comment).

It's quite simple to add support for a new language. You mainly just have to add a new text processor (i.e. add language code during phonemization step), and then designate a new training recipe that uses that text processor. Here's an example of my Indonesian recipe bookbot-hive@b5b1198.

For my training, I used 43 mins of data (~800 utterances) and trained it for 2M steps (same as default config).

Hope this helps!

thewh1teagle · 2024-09-02T13:17:48Z

@w11wo
Thanks for your help!

How long does it took for you to train it 2M steps? It sounds a lot. Did you used colab A100 GPU?
I noticed that it used espeak-ng both in pre-processing and inferencing. So I need to make sure that the phonemizer of espeak-ng with Hebrew is good. Hebrew is complex language with many unusual rules...

Are you satisfied with the results, Is it good as the English one here?

Here's an example of my Indonesian recipe bookbot-hive@b5b1198.

If the audio files I have is 1 speaker and I'll change to rate to be like in the config, should I change some configuration from yours?

w11wo · 2024-09-02T23:56:26Z

Hi @thewh1teagle.

How long does it took for you to train it 2M steps? It sounds a lot. Did you used colab A100 GPU?

It took me ~3 days. I used a RTX 3090 GPU (my company's local machine).

I noticed that it used espeak-ng both in pre-processing and inferencing. So I need to make sure that the phonemizer of espeak-ng with Hebrew is good. Hebrew is complex language with many unusual rules...

Yes, definitely. You can also modify the phonemizer backend if needed. I've successfully integrated gruut as my English phonemizer, for instance.

Are you satisfied with the results, Is it good as the English one here?

I've been using LightSpeech for on-device TTS, and I find OptiSpeech to be much better than LS given its size. It's probably not as good as the English sample, likely because I used a small dataset, but so far I have no issues with it.

If the audio files I have is 1 speaker and I'll change to rate to be like in the config, should I change some configuration from yours?

My config is also 1 speaker and I've changed the sample rate to 44.1kHz. Feel free to adapt to your setup. If you need to change the rate, you can modify the feature extractor to your desired config.

Good luck!

mush42 · 2024-09-03T00:41:44Z

@thewh1teagle

@w11wo did a good job in explaining what's needed.

The model is built from the ground up to support other languages besides English.

If you want to train with a custom text front-end, do the following:

Create a custom class in optispeech.text.tokenizers that follows the interface of optispeech.text.tokenizers.BaseTokenizer. Most likely you need to override the __call__ method, which accepts the text and the language which can be None for unilingual front ends, and should return integer IDs for the given text.
In your dataset config add the following:

...
text_processor:
  tokenizer_name: <the name you gave to your custom tokenizer>

Regarding dataset size, I think 6 hours are quite enough.
Regarding training time, I think it depends on which backbone you use. ConvNeXt is the slowest to train, but it gives nice balance between inference speed and quality, while LightSpeech is fast to train and infer but it's quality is lower than ConvNeXt.
It took me between 7-10 days to train the ConvNeXt-based model for 500K steps, but stopping at 300K steps gives good quality.
Note that the default total training steps are 2M, 1M for the generator and 1M for the discriminator. So at the end, the model is only trained for 1M steps.

thewh1teagle · 2024-09-03T23:01:15Z

Thank you all. I'll start soon and check it.

I have 6 hours of high quality audio along with their diacritized transcriptions. My primary concern is the accuracy of espeak-ng. If espeak-ng turns out to be inaccurate, would it be advisable to train on diacritized characters instead? How challenging would it be to fix or improve espeak-ng?

mush42 · 2024-09-05T13:20:02Z

@thewh1teagle If you have an alternative phonemizer you can easily adopt it. Otherwise you can go with raw characters instead, but you'll need to handle numbers and abbreviations independently.

mush42 · 2024-09-05T13:20:49Z

FYI, I do have a demo for a trained model for OptiSpeech.

Find it here

thewh1teagle · 2024-09-06T21:19:58Z

If you have an alternative phonemizer you can easily adopt it.

I have added the missing phonemes of Hebrew to my espeak-ng fork, but I noticed that the repository relies on the Piper phonemizer. How can I provide a metadata file with the phonemes directly?

Additionally, there are many dependencies that don't seem directly related to the training process. I believe the repository could be improved by separating it into multiple packages, similar to Rust crates. For example, one package could focus on preprocessing, another on inference, and so on. This would make it easier to identify the important parts to focus on for training.

I would be happy to contribute to this restructuring once I fully understand the repository. However, I'm unsure if starting training without a deeper understanding is a good idea. I have 3 hours of 22.05 kHz mono audio (2-10 second clips) with corresponding dotted transcriptions and a ready-to-use fork of espeak-ng. I hope that's all I need to get started.

thewh1teagle · 2024-09-06T21:22:24Z

FYI, I do have a demo for a trained model for OptiSpeech.

Thanks! it sound really good.
It will be interesting to see how it will be in Hebrew... There is no open model in Hebrew at this level yet and the language is relatively complex

mush42 · 2024-09-07T10:05:12Z

@thewh1teagle

Regarding your eSpeak-ng fork, you can implement a custom class that calls it via a subprocess, or call the lib via ctypes. If you want the punctuation handling of piper-phonemize, you can patch and build it yourself, it is only one Docker command see build instructions here
Keeping it simple for now. I created an inference-only package, for training/development use the main package.
Regarding training, it is only three commands, as long as all custom components are ready.
This is what I do:

Preprocess the dataset

python -m optispeech.tools.preprocess_dataset \
    <my_dataset_config> \
    <my_dataset_directory> \
    <processed_dataset_directory>

Generate dataset statistics and update the config with them:

python -m optispeech.tools.generate_data_statistics <my_dataset_config>

Start training:

python -m optispeech.train \
    experiment="<my_experiment_config>" \
    model.train_args.evaluate_utmos=false \
    data.batch_size=32 \
    data.num_workers=8 \
    data.train_filelist_path="<processed_dataset_directory>/train.txt" \
    data.valid_filelist_path="<processed_dataset_directory>/val.txt" \
    callbacks.model_checkpoint.every_n_epochs=5  \
    paths.log_dir="<logs_directory>"

thewh1teagle · 2024-09-07T19:43:18Z

Thanks. turns out that I can easily use my custom espeak-ng from within piper phonemizer package.

I started to train but it failed with error.
Also looks like the train.txt and val.txt doesn't have the transcriptions / phonemes. only the path to wav files. is that expected? where it takes the transcription from?

    self.on_run_start()
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 328, in on_run_start
    call._call_lightning_module_hook(trainer, "on_train_start")
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/home/optispeech/optispeech/model/base_lightning_module.py", line 73, in on_train_start
    if self._opti_reset_optim_and_lr:
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'OptiSpeech' object has no attribute '_opti_reset_optim_and_lr'
[2024-09-07 19:37:35,831][optispeech.utils.generic][INFO] - Output dir: /root/home/optispeech/data/saspeech/logs/train/saspeech_he/runs/2024-09-07_19-37-06
Error executing job with overrides: ['experiment=saspeech-he', 'model.train_args.evaluate_utmos=false', 'data.batch_size=32', 'data.num_workers=8', 'data.train_filelist_path=data/saspeech/train.txt', 'data.valid_filelist_path=data/saspeech/val.txt', 'callbacks.model_checkpoint.every_n_epochs=5', 'paths.log_dir=data/saspeech/logs']
Traceback (most recent call last):
  File "/root/home/optispeech/optispeech/train.py", line 114, in main
    metric_dict, _ = train(cfg)
  File "/root/home/optispeech/optispeech/utils/generic.py", line 88, in wrap
    raise ex
  File "/root/home/optispeech/optispeech/utils/generic.py", line 78, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/root/home/optispeech/optispeech/train.py", line 81, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.on_run_start()
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 328, in on_run_start
    call._call_lightning_module_hook(trainer, "on_train_start")
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/home/optispeech/optispeech/model/base_lightning_module.py", line 73, in on_train_start
    if self._opti_reset_optim_and_lr:
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'OptiSpeech' object has no attribute '_opti_reset_optim_and_lr'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Training: |                                                                                                                                                                                                              | 0/? [00:01<?, ?it/s]

This is everything I changed to start train on my dataset:

thewh1teagle@2882a53

Also, do I need to preprocess again if I got the statistics? or it's used only for the statistics?

mush42 · 2024-09-07T20:28:35Z

@thewh1teagle

The processed data is stored in json and npz files for text and arrays respectively.
You don't need to preprocess again, stats are used during training.

thewh1teagle · 2024-09-07T20:43:40Z

The processed data is stored in json and npz files for text and arrays respectively.

Oh I didn't noticed the json files there. It will be useful if the json will include the phonemes not just their IDs, for easily check that it's correct phonemes

Does the configs I used looks correct?

This is basically the data I have:

$ ls saspeech/train/wav/ | wc -l
2688

$ ls saspeech/val/wav/ | wc -l
298

$ ffmpeg -i saspeech/train/wav/gold_001_line_128.wav
44100 Hz, mono, s16, 705 kb/s

$ head -n 1 saspeech/train/metadata.csv 
gold_001_line_124|בְּבִנְיָינִים מִסְפָּר 36 וְ-38 בִּרְחוֹב סַרְלִין בְּחוֹלוֹן,

$ head -n 1 saspeech/val/metadata.csv 
gold_000_line_000|שָׁלוֹם, צְלִיל אַבְרָהָם.

Also turns out the sample rate of my data is 44100 Hz but I can't find the config in feature_extractor.
Should I just copy 48khz.yaml into 44khz.yaml and change only the sample_rate to sample_rate: 44100?

Another cuda error:

data.batch_size=5     data.num_workers=2

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.44 GiB. GPU 0 has a total capacity of 23.60 GiB of which 1.66 GiB is free. Process 2892091 has 21.93 GiB memory in use. Of the allocated memory 9.52 GiB is allocated by PyTorch, and 12.07 GiB is reserved by PyTorch but unallocated.

 nvidia-smi
Sat Sep  7 20:44:59 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:04:00.0 Off |                  N/A |
| 44%   48C    P8             25W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Update:

I'm able to start the train with batch size of 2 and 1 workers. otherwise it's out of memory. although I have 24GB ram of GPU with rtx4090ti. 5 minutes for single epoch

Epoch 0:  27%|█████████████████▋                                               | 367/1344 [00:58<02:35,  6.28it/s

Does the model training in this repository has advantages over the training of piper?
I'm beginner, so maybe I'll find it easier for starting point.

mush42 · 2024-09-07T23:15:05Z

@thewh1teagle the difference between this and piper is that this is based on JETS not Vits. JETS is considered by many as being better than Vits in quality.
Also you can try gradient accumulation to train with higher effective batch sizes.

thewh1teagle · 2024-09-08T10:41:58Z

@mush42

I tried even on A100 with 40GB of vram and I got out of memory error with this:

python -m optispeech.train \
    experiment="saspeech-he" \
    model.train_args.evaluate_utmos=false \
    data.batch_size=16 \
    data.num_workers=8 \
    data.train_filelist_path="data/saspeech/train.txt" \
    data.valid_filelist_path="data/saspeech/val.txt" \
    callbacks.model_checkpoint.every_n_epochs=5  \
    paths.log_dir="data/saspeech/logs"

I don't think I can get something more powerful than that. what can I do?

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.64 GiB. GPU 0 has a total capacity of 39.50 GiB of which 12.78 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.59 GiB is allocated by PyTorch, and 4.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Tried also to set this environment vairable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True but it didn't help.

Maybe my dataset is too large? (4 hours of audio)

mush42 · 2024-09-08T11:08:30Z

@thewh1teagle
Please check that your data does not include very long utterances. I faced OOM errors with ljspeech because of that.
Filter out utterances with audio more than 20 seconds of length.
I have successfully trained all models of different architectures with a T4 (16 GB) and Quadro RTX 6000 (48 GB).
A batch-size of 16 always works without giving OOM errors.

thewh1teagle · 2024-09-08T11:17:17Z

Please check that your data does not include very long utterances. I faced OOM errors with ljspeech because of that.
Filter out utterances with audio more than 20 seconds of length.

I checked but I don't have wav file with duration longer than 20 seconds.

from pathlib import Path
import librosa

files = [i for i in Path('wavs').glob('*')]

max = 0
for file in files:
    duration = librosa.get_duration(filename=str(file))
    if duration > max:
        max = duration
print(max)

My maximum duration is 14.0

Is there another reasons that could cause that?
Maybe you mean the phoneme transcription can be too long? So I should check also the json files?

Example long uttorence:

{'text': 'רַק אִם הַיְּלָדִים שֶׁהוֹלְכִים לִלְמוֹד בּוֹ לֹא הוֹלְכִים לִלְמוֹד בִּגְלַל הַשִּׁיוּחַ הַמַּעֲמָדִי, אוֹ הַכַּלְכָּלִי, אוֹ הֵעַדְתִּי שֶׁלָּהֶם. הוּא יָכוֹל לִהְיוֹת דָּבָר טוֹב אִם הַיְּלָדִים שֶׁהוֹלְכִים לִלְמוֹד בּוֹ, הוֹלְכִים לִלְמוֹד בּוֹ מֵהַבְּחִירָה שֶׁלָּהֶם,', 'phonemes': 'rˈak ʔˈim hajlˈadim ʃehˈolχim lˈilmod bˈo lˈoʔ hˈolχim lˈilmod bˈiɡlal hˌaʃiəjˈuχa hˌamaʔamˈadij, ʔˈo hˌakalkˈalij, ʔˈo heʔˈadti ʃelˈahem.'}

Is that too long?

as you can see it's dotted text so it should be much longer than just like in english

w11wo · 2024-09-08T23:48:05Z

@thewh1teagle, I'm training on 1 x RTX 4090 with 24GB of VRAM. I reduced the batch size to 8, which still trained fine. Perhaps you could try reducing the batch size to 8?

thewh1teagle · 2024-09-08T23:59:46Z

I'm training on 1 x RTX 4090 with 24GB of VRAM. I reduced the batch size to 8, which still trained fine. Perhaps you could try reducing the batch size to 8?

I think that batch size of 8 worked for me too. but then the training seems to be very slow. it takes few minutes for single epoch, is that expected?

Epoch 0:  27%|█████████████████▋                                               | 367/1344 [00:58<02:35,  6.28it/s

If that expected, what's the time I can try the model even just to see it it makes some sound? (I don't want to train many days before see that something in the right direction)
Also I didn't found where to control how frequently to save the checkpoint

w11wo · 2024-09-09T00:08:53Z

@thewh1teagle

Yes, it is expected, my training took 11 mins/epoch

Epoch 249: 100%|█| 3927/3927 [11:17<00:00,  5.80it/s

You can have a look at the TensorBoard's Audio tab every now and then to see how it sounds on the eval set. I suggest waiting until at least 100k-300k steps to hear how it sounds -- and this shouldn't take too long.

You can modify this value in the config to control how frequently to save the checkpoint:

trainer:
  check_val_every_n_epoch: YOUR VALUE HERE

thewh1teagle · 2024-09-09T01:16:07Z

@w11wo

Thanks!
I'll try tomorrow to keep it train and then I'll check the result at 100k step in tensorboard

thewh1teagle · 2024-09-09T21:25:30Z

I'm on 60k steps. It doesn't sound good, but I can understand it :)
I forgot to config to save the checkpoints, I think that it will save only in 9M step.
But I noticed that there's checkpoint files in the logs folder, can I use it for inference / continue train? @mush42
I tried to infer it but I got this error:

python3 -m optispeech.infer checkpoint_epoch\=192_step\=128696.ckpt hello output

    return self.method(cls, *args, **kwargs)
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1582, in load_from_checkpoint
    loaded = _load_from_checkpoint(
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/core/saving.py", line 63, in _load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=map_location)
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/fabric/utilities/cloud_io.py", line 60, in _load
    return torch.load(
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1097, in load
    return _load(
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1526, in _load
    del torch._utils._thread_local_state.map_location
AttributeError: map_location

By the way are you open to accept PRs in this repository?

mush42 · 2024-09-09T23:47:25Z

@thewh1teagle
I can confirm this issue. I faced it in the past. Let me see what I can do.
I'm open to accept PRs for sure.

thewh1teagle · 2024-09-10T21:29:17Z

I can confirm this issue. I faced it in the past. Let me see what I can do.

Is there a workaround? I can't continue the training or infer it
Maybe it's because of cuda version. which cuda version do you use? (mine is 12.5)

mush42 · 2024-09-10T21:46:08Z

@thewh1teagle I did a very ugly hack whereby I removed the offending line :D

mush42 · 2024-09-10T21:47:01Z

Any way, I'll resolve these versioning issues next weekend.

thewh1teagle · 2024-09-10T22:02:15Z

I did a very ugly hack whereby I removed the offending line :D

Nice! so the latest code should work fine with that and I can continue from the checkpoint I have?
I can fix the issue with the versioning meanwhile

Turns out I'm already on latest main but it still failed

 /root/home/optispeech/optispeech/train.py(81)train()
-> ckpt_model = model_cls.load_from_checkpoint(ckpt_path, map_location="cpu")
(Pdb) n
AttributeError: map_location

Update:
Changing pytorch with rye add torch==2.3.1 fixed the issue.
But when continue training it shows that it started from epoch 0 but it should be 1

thewh1teagle · 2024-09-11T00:05:17Z

Thanks so much for the help! After resolving the issue with loading the model by changing PyTorch to version 2.3.1, I was finally able to export it to onnx.

The model is currently at 50k training steps, and I tested it on several Hebrew sentences. I'm already impressed! It generates complete sentences in less than 0.1 seconds on a CPU with onnx runtime, and even at this stage, every word is understandable with quite good pronunciation. The only aspect that seems to need improvement is the overall audio quality, which I expect will get better as the training progresses.

Do you think that once I reach 9M steps, I'll be able to fine-tune the model with a new voice to change the speaker?

thewh1teagle · 2024-09-12T16:38:40Z

Maybe the profiler can tell why it trains so slow

mush42 · 2024-09-12T18:54:49Z

@thewh1teagle I suspect the alignment module. At least in Matcha-TTS MAS has a lot of overhead.

thewh1teagle · 2024-09-12T18:58:53Z

I suspect the alignment module. At least in Matcha-TTS MAS has a lot of overhead.

This is literally all I added: thewh1teagle@aee021a

And the command I use

python -m optispeech.train \
    experiment="saspeech-he" \
    model.train_args.evaluate_utmos=false \
    data.batch_size=16 \
    data.num_workers=8 \
    data.train_filelist_path="data/saspeech/train.txt" \
    data.valid_filelist_path="data/saspeech/val.txt" \
    callbacks.model_checkpoint.every_n_epochs=5  \
    paths.log_dir="data/saspeech/logs" \
    callbacks.model_checkpoint.save_last=True \
    ckpt_path="last.ckpt"

4 hours of audio 44.10khz.

Every step is sloowww... 1 second

@w11wo said that it took for him 3 days for 1M steps. In that case it should be much faster than 1-step/1-second.

If I just remove the ckpt_path="last.ckpt" it trains 10x faster.

rmcpantoja · 2024-09-12T19:25:27Z

Hi,
I believe training speed will also depend a lot on how much speed your GPU has.
Cheers.

thewh1teagle · 2024-09-12T19:27:43Z

Hi,
I believe training speed will also depend a lot on how much speed your GPU has.

Hey :)
Of course. I mean that I use the same GPU. Tried with RTX 3090 or RTX 4090Ti. 24 vram / 16 vram.

mush42 · 2024-09-12T19:43:40Z

@thewh1teagle can you confirm that the speed up continues to steps beyond 1000?
When it starts from scratch it will train fast for the first 1000 steps then it'll include the discriminator which is going to slow training down.

thewh1teagle · 2024-09-12T19:50:22Z

@thewh1teagle can you confirm that the speed up continues to steps beyond 1000? When it starts from scratch it will train fast for the first 1000 steps then it'll include the discriminator which is going to slow training down.

You right. the first 1000 steps are fast like 10it/s and then it slow down to 1.42it/s (step per second)
On RTX 3090 with 24vram and batch size of 8. (also tried with 16).
That's too slow. it will take 60 hours to reach 300k steps.
@w11wo reported that he trained 2M steps in 3 days.

mush42 · 2024-09-12T19:59:51Z

@thewh1teagle
Training speed doesn't only depend on VRAM imo. It also depends on the overall performance of your GPU.
You can remove some of the losses/discriminators, but quality will be degraded.

mush42 · 2024-09-12T20:01:21Z

Discriminators operate in the time domain dealing with the raw waveform, whereas the generator operates on a compact representation.

thewh1teagle · 2024-09-12T20:05:43Z

That's too slow. it will take 60 hours to reach 300k steps.

@mush42

In that case, that's expected on RTX 3090?
I prefer quality... but didn't thought that I would need to train more than 3 days. I don't have local machine so I use cloud machines

thewh1teagle · 2024-09-12T22:50:39Z

That's how it sounds on checkpoint_epoch=488_step=262762.ckpt. the step considered has half?

I don't know if you can notice because it's Hebrew and may sounds weird anyway, but there's some white noise / vibrations (the dataset is clean).

output.mp4

mush42 · 2024-09-12T23:34:25Z

@thewh1teagle
I notice some artifacts.
What pitch extractor are you using?

thewh1teagle · 2024-09-12T23:39:06Z

What pitch extractor are you using?

Seems like the default one?

I use this 44.10khz config

feature_extractor/44.10khz

defaults:
  - default
  - _self_

sample_rate: 44100
n_feats: 80
n_fft: 2048
hop_length: 512
win_length: 2048
f_min: 20
f_max: 11025

mush42 · 2024-09-12T23:40:45Z

@thewh1teagle would you mind telling me the exact value of the pitch_extractor key? I switched the default last week, I don't know if you're using the up-to-date one.

thewh1teagle · 2024-09-12T23:44:41Z

would you mind telling me the exact value of the pitch_extractor key? I switched the default last week, I don't know if you're using the up-to-date one.

Oh I updated the branch from your latest code while resume training. I hope that I started with the same one.

Currently my default.yaml:

should I recreate the statistics for any case?

_target_: optispeech.dataset.feature_extractors.CommonFeatureExtractor
sample_rate: 24000
n_feats: 80
n_fft: 2048
hop_length: 300
win_length: 1200
f_min: 80
f_max: 8000
center: false
pitch_extractor:
  _target_: optispeech.dataset.feature_extractors.pitch_extractors.EnsemblePitchExtractor
  _partial_: true
  batch_size: 2048
  interpolate: true
preemphasis_filter_coef: null # apply preemphasis filter
lowpass_freq: null
highpass_freq: null
gain_db: null
trim_silence: false
trim_silence_args:
  silence_threshold: 0.2
  silence_samples_per_chunk: 480
  silence_keep_chunks_before: 2
  silence_keep_chunks_after: 2

mush42 · 2024-09-12T23:51:20Z

@thewh1teagle
Make sure that when you preprocessed your dataset you were using the `EnsemblePitchExtractor, not the JDC one. JDC extractor is very bad and leads to very robotic output. I'm getting impressive results with the Ensemble extractor.

Unfortunately, there is no way other than comparing file dates with commit date.

thewh1teagle · 2024-09-12T23:53:36Z

Unfortunately, there is no way other than comparing file dates with commit date.

Thanks. I'll check and hope I've used it already.
If not, do I have to start the train from scratch or just preprocess again - create statistics again - and resume training ?

mush42 · 2024-09-13T00:03:17Z

@thewh1teagle
Preprocess and resume from last checkpoint.

w11wo · 2024-09-13T00:05:08Z

Hi @thewh1teagle. Just to be clear, for my training to reach 2M steps (which is 1M step for generator, 1M step for discriminator) in ~3 days, I didn't use the default model (Transformer backbone). I used the ConvNext backbone.

Hope this helps.

thewh1teagle · 2024-09-13T14:30:00Z

Thank you @w11wo !
I'll try to compare it with ConvNext since I use the default one.

If you still training or @mush42 I really appreciate if you could look at the output and share here the speed (iteration per second as it/s from the progress bar)

mush42 · 2024-09-13T20:34:24Z

@thewh1teagle the numbers you gave are consistent with my experience. Yeah, same numbers.

mush42 · 2024-09-13T20:36:24Z

@thewh1teagle regarding memory usage... I faced the same OOM error on one dataset. Whatever I do, seams like I cannot make it train with a batch size of 16. Strange thing it only happens with some datasets not all.

thewh1teagle · 2024-09-13T20:44:33Z

I faced the same OOM error on one dataset. Whatever I do, seams like I cannot make it train with a batch size of 16. Strange thing it only happens with some datasets not all.

Interesting! what if you change pytorch version like I did?

rye add torchaudio==2.3.1

mush42 · 2024-09-13T20:55:36Z

@thewh1teagle did it help you? by how much?

thewh1teagle · 2024-09-13T21:03:56Z

did it help you? by how much?

I could train with batch size of 16 instead of 8 without OOM. I'm not 100% sure that this worked with 16 thanks to this version downgrade but maybe.

By the way since the batch size didn't changed much the speed I decided to rent cheaper GPU - RTX 3060TI with 8GB ram and I changed batch size to 4. Same speed as with RTX4090 24 vram etc...
I only hope that batch size of 4 is fine to use.

thewh1teagle · 2024-09-13T22:26:33Z

@mush42

I created this class to improve pronunciation handling and it sounds much better now by adding pauses after ? / ! / \n etc. what do you think? maybe piper already can handle it somehow?

As for SSML parsing, did you find solution to support it?

import re
from dataclasses import dataclass

@dataclass
class Segment:
    text: str
    next_pause: float

class SegmentExtractor:
    def __init__(self, default_pause: float = 0.02, question_pause: float = 0.05, period_pause: float = 0.05, new_line_pause = 0.3):
        self.default_pause = default_pause
        self.question_pause = question_pause
        self.period_pause = period_pause
        self.new_line_pause = new_line_pause
    
    def extract_segments(self, text: str):
        segments: list[Segment] = []
        sentences = re.split(r'([.?!:\n])', text)
        for i in range(0, len(sentences) - 1, 2):
            sentence = sentences[i].strip()
            punctuation = sentences[i + 1]
            if sentence:  # Ensure the sentence is not empty
                if punctuation == '.':
                    segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.period_pause))
                elif punctuation == '?':
                    segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.question_pause))
                elif punctuation == '\n':
                    segments.append(Segment(text=f"{sentence}{punctuation}.", next_pause=self.new_line_pause))
                else:
                    segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.default_pause))
        last_sentence = sentences[-1].strip()
        if last_sentence:
            segments.append(Segment(text=f"{last_sentence}.", next_pause=self.default_pause))
        return segments

mush42 · 2024-09-14T08:11:46Z

@thewh1teagle what class you're referring to?

thewh1teagle · 2024-09-14T09:42:32Z

Click on the details button at the end of the comment :)

thewh1teagle · 2024-09-15T13:22:15Z

I just created this repository with the new TTS model, there's some artifacts but even at this stage it's the best tts in Hebrew in terms of pronunciation and speed!

https://github.com/thewh1teagle/israwave

@mush42
note of the project structure; it appears to simplify both training and inference processes. We might be able to enhance OptiSpeech by adopting a similar approach, such as using separate notebooks for each stage, separate gradio app into it's own project and folder etc.

mush42 closed this as completed Sep 3, 2024

thewh1teagle mentioned this issue Sep 10, 2024

checkpoint cannot be loaded without source code Lightning-AI/pytorch-lightning#4792

Closed

w11wo mentioned this issue Sep 23, 2024

Harshness of speech #10

Open

Training new language #2

Training new language #2

Comments

thewh1teagle commented Sep 1, 2024

w11wo commented Sep 2, 2024

thewh1teagle commented Sep 2, 2024

w11wo commented Sep 2, 2024

mush42 commented Sep 3, 2024

thewh1teagle commented Sep 3, 2024

mush42 commented Sep 5, 2024

mush42 commented Sep 5, 2024

thewh1teagle commented Sep 6, 2024 • edited Loading

thewh1teagle commented Sep 6, 2024

mush42 commented Sep 7, 2024

thewh1teagle commented Sep 7, 2024

mush42 commented Sep 7, 2024

thewh1teagle commented Sep 7, 2024 • edited Loading

mush42 commented Sep 7, 2024

thewh1teagle commented Sep 8, 2024 • edited Loading

mush42 commented Sep 8, 2024

thewh1teagle commented Sep 8, 2024 • edited Loading

w11wo commented Sep 8, 2024

thewh1teagle commented Sep 8, 2024 • edited Loading

w11wo commented Sep 9, 2024

thewh1teagle commented Sep 9, 2024

thewh1teagle commented Sep 9, 2024 • edited Loading

mush42 commented Sep 9, 2024

thewh1teagle commented Sep 10, 2024 • edited Loading

mush42 commented Sep 10, 2024

mush42 commented Sep 10, 2024

thewh1teagle commented Sep 10, 2024 • edited Loading

thewh1teagle commented Sep 11, 2024 • edited Loading

thewh1teagle commented Sep 12, 2024

mush42 commented Sep 12, 2024

thewh1teagle commented Sep 12, 2024 • edited Loading

rmcpantoja commented Sep 12, 2024

thewh1teagle commented Sep 12, 2024

mush42 commented Sep 12, 2024

thewh1teagle commented Sep 12, 2024

mush42 commented Sep 12, 2024

mush42 commented Sep 12, 2024

thewh1teagle commented Sep 12, 2024

thewh1teagle commented Sep 12, 2024 • edited Loading

mush42 commented Sep 12, 2024

thewh1teagle commented Sep 12, 2024

mush42 commented Sep 12, 2024

thewh1teagle commented Sep 12, 2024 • edited Loading

mush42 commented Sep 12, 2024

thewh1teagle commented Sep 12, 2024 • edited Loading

mush42 commented Sep 13, 2024

w11wo commented Sep 13, 2024

thewh1teagle commented Sep 13, 2024

mush42 commented Sep 13, 2024

mush42 commented Sep 13, 2024

thewh1teagle commented Sep 13, 2024

mush42 commented Sep 13, 2024

thewh1teagle commented Sep 13, 2024

thewh1teagle commented Sep 13, 2024

mush42 commented Sep 14, 2024

thewh1teagle commented Sep 14, 2024

thewh1teagle commented Sep 15, 2024

thewh1teagle commented Sep 6, 2024 •

edited

Loading

thewh1teagle commented Sep 7, 2024 •

edited

Loading

thewh1teagle commented Sep 8, 2024 •

edited

Loading

thewh1teagle commented Sep 8, 2024 •

edited

Loading

thewh1teagle commented Sep 8, 2024 •

edited

Loading

thewh1teagle commented Sep 9, 2024 •

edited

Loading

thewh1teagle commented Sep 10, 2024 •

edited

Loading

thewh1teagle commented Sep 10, 2024 •

edited

Loading

thewh1teagle commented Sep 11, 2024 •

edited

Loading

thewh1teagle commented Sep 12, 2024 •

edited

Loading

thewh1teagle commented Sep 12, 2024 •

edited

Loading

thewh1teagle commented Sep 12, 2024 •

edited

Loading

thewh1teagle commented Sep 12, 2024 •

edited

Loading