Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training new language #2

Closed
thewh1teagle opened this issue Sep 1, 2024 · 67 comments
Closed

Training new language #2

thewh1teagle opened this issue Sep 1, 2024 · 67 comments

Comments

@thewh1teagle
Copy link

Hey!
Thanks for sharing this. the repository looks well organized.

I would like to train new language - Hebrew.
Can you add information in the steps regarding training on new language?

I have 6 hours of wav file in high quality along with their corresponding transcription, diactirized.
Does that sounds enough data for training? do you know how many steps I'll need?
Looks like you trained it on English language, have you tried on another language such as Arabic?

Thanks

@w11wo
Copy link

w11wo commented Sep 2, 2024

Hi @thewh1teagle, I'd like to help out.

I've successfully trained an OptiSpeech model for Indonesian (id), as shown here #1 (comment).

It's quite simple to add support for a new language. You mainly just have to add a new text processor (i.e. add language code during phonemization step), and then designate a new training recipe that uses that text processor. Here's an example of my Indonesian recipe bookbot-hive@b5b1198.

For my training, I used 43 mins of data (~800 utterances) and trained it for 2M steps (same as default config).

Hope this helps!

@thewh1teagle
Copy link
Author

@w11wo
Thanks for your help!

How long does it took for you to train it 2M steps? It sounds a lot. Did you used colab A100 GPU?
I noticed that it used espeak-ng both in pre-processing and inferencing. So I need to make sure that the phonemizer of espeak-ng with Hebrew is good. Hebrew is complex language with many unusual rules...

Are you satisfied with the results, Is it good as the English one here?

Here's an example of my Indonesian recipe bookbot-hive@b5b1198.

If the audio files I have is 1 speaker and I'll change to rate to be like in the config, should I change some configuration from yours?

@w11wo
Copy link

w11wo commented Sep 2, 2024

Hi @thewh1teagle.

How long does it took for you to train it 2M steps? It sounds a lot. Did you used colab A100 GPU?

It took me ~3 days. I used a RTX 3090 GPU (my company's local machine).

I noticed that it used espeak-ng both in pre-processing and inferencing. So I need to make sure that the phonemizer of espeak-ng with Hebrew is good. Hebrew is complex language with many unusual rules...

Yes, definitely. You can also modify the phonemizer backend if needed. I've successfully integrated gruut as my English phonemizer, for instance.

Are you satisfied with the results, Is it good as the English one here?

I've been using LightSpeech for on-device TTS, and I find OptiSpeech to be much better than LS given its size. It's probably not as good as the English sample, likely because I used a small dataset, but so far I have no issues with it.

If the audio files I have is 1 speaker and I'll change to rate to be like in the config, should I change some configuration from yours?

My config is also 1 speaker and I've changed the sample rate to 44.1kHz. Feel free to adapt to your setup. If you need to change the rate, you can modify the feature extractor to your desired config.

Good luck!

@mush42
Copy link
Owner

mush42 commented Sep 3, 2024

@thewh1teagle

@w11wo did a good job in explaining what's needed.

The model is built from the ground up to support other languages besides English.

If you want to train with a custom text front-end, do the following:

  1. Create a custom class in optispeech.text.tokenizers that follows the interface of optispeech.text.tokenizers.BaseTokenizer. Most likely you need to override the __call__ method, which accepts the text and the language which can be None for unilingual front ends, and should return integer IDs for the given text.
  2. In your dataset config add the following:
...
text_processor:
  tokenizer_name: <the name you gave to your custom tokenizer>
  • Regarding dataset size, I think 6 hours are quite enough.
  • Regarding training time, I think it depends on which backbone you use. ConvNeXt is the slowest to train, but it gives nice balance between inference speed and quality, while LightSpeech is fast to train and infer but it's quality is lower than ConvNeXt.
  • It took me between 7-10 days to train the ConvNeXt-based model for 500K steps, but stopping at 300K steps gives good quality.
  • Note that the default total training steps are 2M, 1M for the generator and 1M for the discriminator. So at the end, the model is only trained for 1M steps.

@mush42 mush42 closed this as completed Sep 3, 2024
@thewh1teagle
Copy link
Author

Thank you all. I'll start soon and check it.

I have 6 hours of high quality audio along with their diacritized transcriptions. My primary concern is the accuracy of espeak-ng. If espeak-ng turns out to be inaccurate, would it be advisable to train on diacritized characters instead? How challenging would it be to fix or improve espeak-ng?

@mush42
Copy link
Owner

mush42 commented Sep 5, 2024

@thewh1teagle If you have an alternative phonemizer you can easily adopt it. Otherwise you can go with raw characters instead, but you'll need to handle numbers and abbreviations independently.

@mush42
Copy link
Owner

mush42 commented Sep 5, 2024

FYI, I do have a demo for a trained model for OptiSpeech.

Find it here

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 6, 2024

If you have an alternative phonemizer you can easily adopt it.

I have added the missing phonemes of Hebrew to my espeak-ng fork, but I noticed that the repository relies on the Piper phonemizer. How can I provide a metadata file with the phonemes directly?

Additionally, there are many dependencies that don't seem directly related to the training process. I believe the repository could be improved by separating it into multiple packages, similar to Rust crates. For example, one package could focus on preprocessing, another on inference, and so on. This would make it easier to identify the important parts to focus on for training.

I would be happy to contribute to this restructuring once I fully understand the repository. However, I'm unsure if starting training without a deeper understanding is a good idea. I have 3 hours of 22.05 kHz mono audio (2-10 second clips) with corresponding dotted transcriptions and a ready-to-use fork of espeak-ng. I hope that's all I need to get started.

@thewh1teagle
Copy link
Author

FYI, I do have a demo for a trained model for OptiSpeech.

Thanks! it sound really good.
It will be interesting to see how it will be in Hebrew... There is no open model in Hebrew at this level yet and the language is relatively complex

@mush42
Copy link
Owner

mush42 commented Sep 7, 2024

@thewh1teagle

  • Regarding your eSpeak-ng fork, you can implement a custom class that calls it via a subprocess, or call the lib via ctypes. If you want the punctuation handling of piper-phonemize, you can patch and build it yourself, it is only one Docker command see build instructions here
  • Keeping it simple for now. I created an inference-only package, for training/development use the main package.
  • Regarding training, it is only three commands, as long as all custom components are ready.
    This is what I do:
  1. Preprocess the dataset
python -m optispeech.tools.preprocess_dataset \
    <my_dataset_config> \
    <my_dataset_directory> \
    <processed_dataset_directory>
  1. Generate dataset statistics and update the config with them:
python -m optispeech.tools.generate_data_statistics <my_dataset_config>
  1. Start training:
python -m optispeech.train \
    experiment="<my_experiment_config>" \
    model.train_args.evaluate_utmos=false \
    data.batch_size=32 \
    data.num_workers=8 \
    data.train_filelist_path="<processed_dataset_directory>/train.txt" \
    data.valid_filelist_path="<processed_dataset_directory>/val.txt" \
    callbacks.model_checkpoint.every_n_epochs=5  \
    paths.log_dir="<logs_directory>"

@thewh1teagle
Copy link
Author

Thanks. turns out that I can easily use my custom espeak-ng from within piper phonemizer package.

I started to train but it failed with error.
Also looks like the train.txt and val.txt doesn't have the transcriptions / phonemes. only the path to wav files. is that expected? where it takes the transcription from?

    self.on_run_start()
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 328, in on_run_start
    call._call_lightning_module_hook(trainer, "on_train_start")
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/home/optispeech/optispeech/model/base_lightning_module.py", line 73, in on_train_start
    if self._opti_reset_optim_and_lr:
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'OptiSpeech' object has no attribute '_opti_reset_optim_and_lr'
[2024-09-07 19:37:35,831][optispeech.utils.generic][INFO] - Output dir: /root/home/optispeech/data/saspeech/logs/train/saspeech_he/runs/2024-09-07_19-37-06
Error executing job with overrides: ['experiment=saspeech-he', 'model.train_args.evaluate_utmos=false', 'data.batch_size=32', 'data.num_workers=8', 'data.train_filelist_path=data/saspeech/train.txt', 'data.valid_filelist_path=data/saspeech/val.txt', 'callbacks.model_checkpoint.every_n_epochs=5', 'paths.log_dir=data/saspeech/logs']
Traceback (most recent call last):
  File "/root/home/optispeech/optispeech/train.py", line 114, in main
    metric_dict, _ = train(cfg)
  File "/root/home/optispeech/optispeech/utils/generic.py", line 88, in wrap
    raise ex
  File "/root/home/optispeech/optispeech/utils/generic.py", line 78, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/root/home/optispeech/optispeech/train.py", line 81, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.on_run_start()
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 328, in on_run_start
    call._call_lightning_module_hook(trainer, "on_train_start")
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/home/optispeech/optispeech/model/base_lightning_module.py", line 73, in on_train_start
    if self._opti_reset_optim_and_lr:
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'OptiSpeech' object has no attribute '_opti_reset_optim_and_lr'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Training: |                                                                                                                                                                                                              | 0/? [00:01<?, ?it/s]

This is everything I changed to start train on my dataset:

thewh1teagle@2882a53

Also, do I need to preprocess again if I got the statistics? or it's used only for the statistics?

@mush42
Copy link
Owner

mush42 commented Sep 7, 2024

@thewh1teagle

  • The processed data is stored in json and npz files for text and arrays respectively.
  • You don't need to preprocess again, stats are used during training.

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 7, 2024

  • The processed data is stored in json and npz files for text and arrays respectively.

Oh I didn't noticed the json files there. It will be useful if the json will include the phonemes not just their IDs, for easily check that it's correct phonemes

Does the configs I used looks correct?

This is basically the data I have:

$ ls saspeech/train/wav/ | wc -l
2688

$ ls saspeech/val/wav/ | wc -l
298

$ ffmpeg -i saspeech/train/wav/gold_001_line_128.wav
44100 Hz, mono, s16, 705 kb/s

$ head -n 1 saspeech/train/metadata.csv 
gold_001_line_124|בְּבִנְיָינִים מִסְפָּר 36 וְ-38 בִּרְחוֹב סַרְלִין בְּחוֹלוֹן,

$ head -n 1 saspeech/val/metadata.csv 
gold_000_line_000|שָׁלוֹם, צְלִיל אַבְרָהָם.

Also turns out the sample rate of my data is 44100 Hz but I can't find the config in feature_extractor.
Should I just copy 48khz.yaml into 44khz.yaml and change only the sample_rate to sample_rate: 44100?

Another cuda error:

data.batch_size=5     data.num_workers=2
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.44 GiB. GPU 0 has a total capacity of 23.60 GiB of which 1.66 GiB is free. Process 2892091 has 21.93 GiB memory in use. Of the allocated memory 9.52 GiB is allocated by PyTorch, and 12.07 GiB is reserved by PyTorch but unallocated.
 nvidia-smi
Sat Sep  7 20:44:59 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:04:00.0 Off |                  N/A |
| 44%   48C    P8             25W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Update:

I'm able to start the train with batch size of 2 and 1 workers. otherwise it's out of memory. although I have 24GB ram of GPU with rtx4090ti. 5 minutes for single epoch

Epoch 0:  27%|█████████████████▋                                               | 367/1344 [00:58<02:35,  6.28it/s

Does the model training in this repository has advantages over the training of piper?
I'm beginner, so maybe I'll find it easier for starting point.

@mush42
Copy link
Owner

mush42 commented Sep 7, 2024

@thewh1teagle the difference between this and piper is that this is based on JETS not Vits. JETS is considered by many as being better than Vits in quality.
Also you can try gradient accumulation to train with higher effective batch sizes.

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 8, 2024

@mush42

I tried even on A100 with 40GB of vram and I got out of memory error with this:

python -m optispeech.train \
    experiment="saspeech-he" \
    model.train_args.evaluate_utmos=false \
    data.batch_size=16 \
    data.num_workers=8 \
    data.train_filelist_path="data/saspeech/train.txt" \
    data.valid_filelist_path="data/saspeech/val.txt" \
    callbacks.model_checkpoint.every_n_epochs=5  \
    paths.log_dir="data/saspeech/logs"

I don't think I can get something more powerful than that. what can I do?

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.64 GiB. GPU 0 has a total capacity of 39.50 GiB of which 12.78 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.59 GiB is allocated by PyTorch, and 4.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Tried also to set this environment vairable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True but it didn't help.

Maybe my dataset is too large? (4 hours of audio)

@mush42
Copy link
Owner

mush42 commented Sep 8, 2024

@thewh1teagle
Please check that your data does not include very long utterances. I faced OOM errors with ljspeech because of that.
Filter out utterances with audio more than 20 seconds of length.
I have successfully trained all models of different architectures with a T4 (16 GB) and Quadro RTX 6000 (48 GB).
A batch-size of 16 always works without giving OOM errors.

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 8, 2024

Please check that your data does not include very long utterances. I faced OOM errors with ljspeech because of that.
Filter out utterances with audio more than 20 seconds of length.

I checked but I don't have wav file with duration longer than 20 seconds.

from pathlib import Path
import librosa

files = [i for i in Path('wavs').glob('*')]

max = 0
for file in files:
    duration = librosa.get_duration(filename=str(file))
    if duration > max:
        max = duration
print(max)

My maximum duration is 14.0

Is there another reasons that could cause that?
Maybe you mean the phoneme transcription can be too long? So I should check also the json files?

Example long uttorence:

{'text': 'רַק אִם הַיְּלָדִים שֶׁהוֹלְכִים לִלְמוֹד בּוֹ לֹא הוֹלְכִים לִלְמוֹד בִּגְלַל הַשִּׁיוּחַ הַמַּעֲמָדִי, אוֹ הַכַּלְכָּלִי, אוֹ הֵעַדְתִּי שֶׁלָּהֶם. הוּא יָכוֹל לִהְיוֹת דָּבָר טוֹב אִם הַיְּלָדִים שֶׁהוֹלְכִים לִלְמוֹד בּוֹ, הוֹלְכִים לִלְמוֹד בּוֹ מֵהַבְּחִירָה שֶׁלָּהֶם,', 'phonemes': 'rˈak ʔˈim hajlˈadim ʃehˈolχim lˈilmod bˈo lˈoʔ hˈolχim lˈilmod bˈiɡlal hˌaʃiəjˈuχa hˌamaʔamˈadij, ʔˈo hˌakalkˈalij, ʔˈo heʔˈadti ʃelˈahem.'}

Is that too long?

as you can see it's dotted text so it should be much longer than just like in english

@w11wo
Copy link

w11wo commented Sep 8, 2024

@thewh1teagle, I'm training on 1 x RTX 4090 with 24GB of VRAM. I reduced the batch size to 8, which still trained fine. Perhaps you could try reducing the batch size to 8?

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 8, 2024

I'm training on 1 x RTX 4090 with 24GB of VRAM. I reduced the batch size to 8, which still trained fine. Perhaps you could try reducing the batch size to 8?

I think that batch size of 8 worked for me too. but then the training seems to be very slow. it takes few minutes for single epoch, is that expected?

Epoch 0:  27%|█████████████████▋                                               | 367/1344 [00:58<02:35,  6.28it/s

If that expected, what's the time I can try the model even just to see it it makes some sound? (I don't want to train many days before see that something in the right direction)
Also I didn't found where to control how frequently to save the checkpoint

@w11wo
Copy link

w11wo commented Sep 9, 2024

@thewh1teagle

Yes, it is expected, my training took 11 mins/epoch

Epoch 249: 100%|█| 3927/3927 [11:17<00:00,  5.80it/s

You can have a look at the TensorBoard's Audio tab every now and then to see how it sounds on the eval set. I suggest waiting until at least 100k-300k steps to hear how it sounds -- and this shouldn't take too long.

You can modify this value in the config to control how frequently to save the checkpoint:

trainer:
  check_val_every_n_epoch: YOUR VALUE HERE

@thewh1teagle
Copy link
Author

@w11wo

Thanks!
I'll try tomorrow to keep it train and then I'll check the result at 100k step in tensorboard

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 9, 2024

I'm on 60k steps. It doesn't sound good, but I can understand it :)
I forgot to config to save the checkpoints, I think that it will save only in 9M step.
But I noticed that there's checkpoint files in the logs folder, can I use it for inference / continue train? @mush42
I tried to infer it but I got this error:

python3 -m optispeech.infer checkpoint_epoch\=192_step\=128696.ckpt hello output
    return self.method(cls, *args, **kwargs)
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1582, in load_from_checkpoint
    loaded = _load_from_checkpoint(
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/pytorch/core/saving.py", line 63, in _load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=map_location)
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/lightning/fabric/utilities/cloud_io.py", line 60, in _load
    return torch.load(
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1097, in load
    return _load(
  File "/root/home/optispeech/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1526, in _load
    del torch._utils._thread_local_state.map_location
AttributeError: map_location

By the way are you open to accept PRs in this repository?

@mush42
Copy link
Owner

mush42 commented Sep 9, 2024

@thewh1teagle
I can confirm this issue. I faced it in the past. Let me see what I can do.
I'm open to accept PRs for sure.

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 10, 2024

I can confirm this issue. I faced it in the past. Let me see what I can do.

Is there a workaround? I can't continue the training or infer it
Maybe it's because of cuda version. which cuda version do you use? (mine is 12.5)

@mush42
Copy link
Owner

mush42 commented Sep 10, 2024

@thewh1teagle I did a very ugly hack whereby I removed the offending line :D

@mush42
Copy link
Owner

mush42 commented Sep 10, 2024

Any way, I'll resolve these versioning issues next weekend.

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 10, 2024

I did a very ugly hack whereby I removed the offending line :D

Nice! so the latest code should work fine with that and I can continue from the checkpoint I have?
I can fix the issue with the versioning meanwhile

Turns out I'm already on latest main but it still failed

 /root/home/optispeech/optispeech/train.py(81)train()
-> ckpt_model = model_cls.load_from_checkpoint(ckpt_path, map_location="cpu")
(Pdb) n
AttributeError: map_location

Update:
Changing pytorch with rye add torch==2.3.1 fixed the issue.
But when continue training it shows that it started from epoch 0 but it should be 1

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 11, 2024

Thanks so much for the help! After resolving the issue with loading the model by changing PyTorch to version 2.3.1, I was finally able to export it to onnx.

The model is currently at 50k training steps, and I tested it on several Hebrew sentences. I'm already impressed! It generates complete sentences in less than 0.1 seconds on a CPU with onnx runtime, and even at this stage, every word is understandable with quite good pronunciation. The only aspect that seems to need improvement is the overall audio quality, which I expect will get better as the training progresses.

Do you think that once I reach 9M steps, I'll be able to fine-tune the model with a new voice to change the speaker?

@thewh1teagle
Copy link
Author

Maybe the profiler can tell why it trains so slow

image

@mush42
Copy link
Owner

mush42 commented Sep 12, 2024

@thewh1teagle I suspect the alignment module. At least in Matcha-TTS MAS has a lot of overhead.

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 12, 2024

I suspect the alignment module. At least in Matcha-TTS MAS has a lot of overhead.

This is literally all I added: thewh1teagle@aee021a

And the command I use

python -m optispeech.train \
    experiment="saspeech-he" \
    model.train_args.evaluate_utmos=false \
    data.batch_size=16 \
    data.num_workers=8 \
    data.train_filelist_path="data/saspeech/train.txt" \
    data.valid_filelist_path="data/saspeech/val.txt" \
    callbacks.model_checkpoint.every_n_epochs=5  \
    paths.log_dir="data/saspeech/logs" \
    callbacks.model_checkpoint.save_last=True \
    ckpt_path="last.ckpt"

4 hours of audio 44.10khz.

Every step is sloowww... 1 second

@w11wo said that it took for him 3 days for 1M steps. In that case it should be much faster than 1-step/1-second.

If I just remove the ckpt_path="last.ckpt" it trains 10x faster.

@rmcpantoja
Copy link

Hi,
I believe training speed will also depend a lot on how much speed your GPU has.
Cheers.

@thewh1teagle
Copy link
Author

Hi,
I believe training speed will also depend a lot on how much speed your GPU has.

Hey :)
Of course. I mean that I use the same GPU. Tried with RTX 3090 or RTX 4090Ti. 24 vram / 16 vram.

@mush42
Copy link
Owner

mush42 commented Sep 12, 2024

@thewh1teagle can you confirm that the speed up continues to steps beyond 1000?
When it starts from scratch it will train fast for the first 1000 steps then it'll include the discriminator which is going to slow training down.

@thewh1teagle
Copy link
Author

@thewh1teagle can you confirm that the speed up continues to steps beyond 1000? When it starts from scratch it will train fast for the first 1000 steps then it'll include the discriminator which is going to slow training down.

You right. the first 1000 steps are fast like 10it/s and then it slow down to 1.42it/s (step per second)
On RTX 3090 with 24vram and batch size of 8. (also tried with 16).
That's too slow. it will take 60 hours to reach 300k steps.
@w11wo reported that he trained 2M steps in 3 days.

@mush42
Copy link
Owner

mush42 commented Sep 12, 2024

@thewh1teagle
Training speed doesn't only depend on VRAM imo. It also depends on the overall performance of your GPU.
You can remove some of the losses/discriminators, but quality will be degraded.

@mush42
Copy link
Owner

mush42 commented Sep 12, 2024

Discriminators operate in the time domain dealing with the raw waveform, whereas the generator operates on a compact representation.

@thewh1teagle
Copy link
Author

That's too slow. it will take 60 hours to reach 300k steps.

@mush42

In that case, that's expected on RTX 3090?
I prefer quality... but didn't thought that I would need to train more than 3 days. I don't have local machine so I use cloud machines

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 12, 2024

That's how it sounds on checkpoint_epoch=488_step=262762.ckpt. the step considered has half?

I don't know if you can notice because it's Hebrew and may sounds weird anyway, but there's some white noise / vibrations (the dataset is clean).

output.mp4

@mush42
Copy link
Owner

mush42 commented Sep 12, 2024

@thewh1teagle
I notice some artifacts.
What pitch extractor are you using?

@thewh1teagle
Copy link
Author

What pitch extractor are you using?

Seems like the default one?

I use this 44.10khz config

feature_extractor/44.10khz

defaults:
  - default
  - _self_

sample_rate: 44100
n_feats: 80
n_fft: 2048
hop_length: 512
win_length: 2048
f_min: 20
f_max: 11025

@mush42
Copy link
Owner

mush42 commented Sep 12, 2024

@thewh1teagle would you mind telling me the exact value of the pitch_extractor key? I switched the default last week, I don't know if you're using the up-to-date one.

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 12, 2024

would you mind telling me the exact value of the pitch_extractor key? I switched the default last week, I don't know if you're using the up-to-date one.

Oh I updated the branch from your latest code while resume training. I hope that I started with the same one.

Currently my default.yaml:

should I recreate the statistics for any case?

_target_: optispeech.dataset.feature_extractors.CommonFeatureExtractor
sample_rate: 24000
n_feats: 80
n_fft: 2048
hop_length: 300
win_length: 1200
f_min: 80
f_max: 8000
center: false
pitch_extractor:
  _target_: optispeech.dataset.feature_extractors.pitch_extractors.EnsemblePitchExtractor
  _partial_: true
  batch_size: 2048
  interpolate: true
preemphasis_filter_coef: null # apply preemphasis filter
lowpass_freq: null
highpass_freq: null
gain_db: null
trim_silence: false
trim_silence_args:
  silence_threshold: 0.2
  silence_samples_per_chunk: 480
  silence_keep_chunks_before: 2
  silence_keep_chunks_after: 2

@mush42
Copy link
Owner

mush42 commented Sep 12, 2024

@thewh1teagle
Make sure that when you preprocessed your dataset you were using the `EnsemblePitchExtractor, not the JDC one. JDC extractor is very bad and leads to very robotic output. I'm getting impressive results with the Ensemble extractor.

Unfortunately, there is no way other than comparing file dates with commit date.

@thewh1teagle
Copy link
Author

thewh1teagle commented Sep 12, 2024

Unfortunately, there is no way other than comparing file dates with commit date.

Thanks. I'll check and hope I've used it already.
If not, do I have to start the train from scratch or just preprocess again - create statistics again - and resume training ?

@mush42
Copy link
Owner

mush42 commented Sep 13, 2024

@thewh1teagle
Preprocess and resume from last checkpoint.

@w11wo
Copy link

w11wo commented Sep 13, 2024

Hi @thewh1teagle. Just to be clear, for my training to reach 2M steps (which is 1M step for generator, 1M step for discriminator) in ~3 days, I didn't use the default model (Transformer backbone). I used the ConvNext backbone.

Hope this helps.

@thewh1teagle
Copy link
Author

Thank you @w11wo !
I'll try to compare it with ConvNext since I use the default one.

If you still training or @mush42 I really appreciate if you could look at the output and share here the speed (iteration per second as it/s from the progress bar)

@mush42
Copy link
Owner

mush42 commented Sep 13, 2024

@thewh1teagle the numbers you gave are consistent with my experience. Yeah, same numbers.

@mush42
Copy link
Owner

mush42 commented Sep 13, 2024

@thewh1teagle regarding memory usage... I faced the same OOM error on one dataset. Whatever I do, seams like I cannot make it train with a batch size of 16. Strange thing it only happens with some datasets not all.

@thewh1teagle
Copy link
Author

I faced the same OOM error on one dataset. Whatever I do, seams like I cannot make it train with a batch size of 16. Strange thing it only happens with some datasets not all.

Interesting! what if you change pytorch version like I did?

rye add torchaudio==2.3.1

@mush42
Copy link
Owner

mush42 commented Sep 13, 2024

@thewh1teagle did it help you? by how much?

@thewh1teagle
Copy link
Author

did it help you? by how much?

I could train with batch size of 16 instead of 8 without OOM. I'm not 100% sure that this worked with 16 thanks to this version downgrade but maybe.

By the way since the batch size didn't changed much the speed I decided to rent cheaper GPU - RTX 3060TI with 8GB ram and I changed batch size to 4. Same speed as with RTX4090 24 vram etc...
I only hope that batch size of 4 is fine to use.

@thewh1teagle
Copy link
Author

@mush42

I created this class to improve pronunciation handling and it sounds much better now by adding pauses after ? / ! / \n etc. what do you think? maybe piper already can handle it somehow?

As for SSML parsing, did you find solution to support it?

import re
from dataclasses import dataclass

@dataclass
class Segment:
    text: str
    next_pause: float

class SegmentExtractor:
    def __init__(self, default_pause: float = 0.02, question_pause: float = 0.05, period_pause: float = 0.05, new_line_pause = 0.3):
        self.default_pause = default_pause
        self.question_pause = question_pause
        self.period_pause = period_pause
        self.new_line_pause = new_line_pause
    
    def extract_segments(self, text: str):
        segments: list[Segment] = []
        sentences = re.split(r'([.?!:\n])', text)
        for i in range(0, len(sentences) - 1, 2):
            sentence = sentences[i].strip()
            punctuation = sentences[i + 1]
            if sentence:  # Ensure the sentence is not empty
                if punctuation == '.':
                    segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.period_pause))
                elif punctuation == '?':
                    segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.question_pause))
                elif punctuation == '\n':
                    segments.append(Segment(text=f"{sentence}{punctuation}.", next_pause=self.new_line_pause))
                else:
                    segments.append(Segment(text=f"{sentence}{punctuation}", next_pause=self.default_pause))
        last_sentence = sentences[-1].strip()
        if last_sentence:
            segments.append(Segment(text=f"{last_sentence}.", next_pause=self.default_pause))
        return segments

@mush42
Copy link
Owner

mush42 commented Sep 14, 2024

@thewh1teagle what class you're referring to?

@thewh1teagle
Copy link
Author

Click on the details button at the end of the comment :)

@thewh1teagle
Copy link
Author

I just created this repository with the new TTS model, there's some artifacts but even at this stage it's the best tts in Hebrew in terms of pronunciation and speed!

https://github.com/thewh1teagle/israwave

@mush42
note of the project structure; it appears to simplify both training and inference processes. We might be able to enhance OptiSpeech by adopting a similar approach, such as using separate notebooks for each stage, separate gradio app into it's own project and folder etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants