Arabic TTS model (FastPitch, MixerTTS) from the tts-arabic-pytorch repo in the ONNX format.
Audio samples can be found here.
Install with:
pip install git+https://github.com/nipponjo/tts_arabic.git
Examples:
# %%
from tts_arabic import tts
# %%
text = "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي."
wave = tts(text, speaker=2, pace=0.9, play=True)
# %% Buckwalter transliteration
text = ">als~alAmu Ealaykum yA Sadiyqiy."
wave = tts(text, speaker=0, play=True)
# %% Unvocalized input
text_unvoc = "القهوة مشروب يعد من بذور البن المحمصة"
wave = tts(text_unvoc, play=True, vowelizer='shakkelha')
Pretrained models:
Model | Model ID | Type | #params | Paper | Output |
---|---|---|---|---|---|
FastPitch | fastpitch | Text->Mel | 46.3M | arxiv | Mel (80 bins) |
MixerTTS | mixer128 | Text->Mel | 2.9M | arxiv | Mel (80 bins) |
MixerTTS | mixer80 | Text->Mel | 1.5M | arxiv | Mel (80 bins) |
HiFi-GAN | hifigan | Vocoder | 13.9M | arxiv | Wave (22.05kHz) |
Vocos | vocos | Vocoder | 13.4M | arxiv | Wave (22.05kHz) |
Vocos | vocos44 | Vocoder | 14.0M | arxiv | Wave (44.1kHz) |
The sequence of transformations is as follows:
Text → Phonemizer → Phonemes → Tokenizer → Token Ids → Text->Mel model → Mel spectrogram → Vocoder model → Wave
The Text->Mel
models map token ids to mel frames. All models use the 80 bin configuration proposed by HiFi-GAN. This mel spectrogram contains frequencies up to 8kHz. The vocoder
models map the mel spectrogram to a waveform. The vocoders with vocoder_id
hifigan
and vocos
artificially extend the bandwidth to 11025Hz, and vocos44
to 22050Hz. Samples for comparing the models can be found here.
TTS options:
from tts_arabic import tts
text = "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي."
wave = tts(
text, # input text
speaker = 1, # speaker id; choose between 0,1,2,3
pace = 1, # speaker pace
denoise = 0.005, # vocoder denoiser strength
volume = 0.9, # Max amplitude (between 0 and 1)
play = True, # play audio?
pitch_mul = 1, # pitch multiplier
pitch_add = 0, # pitch offset
vowelizer = None, # vowelizer model
model_id = 'fastpitch', # Model ID for Text->Mel model
vocoder_id = 'hifigan', # Model ID for vocoder model
cuda = None, # Optional; CUDA device index
save_to = './test.wav', # Optionally; save audio WAV file
bits_per_sample = 32, # when save_to is specified (8, 16 or 32 bits)
)
Vowelizer models:
Model | Model ID | Paper | Repo | Architecture |
---|---|---|---|---|
CATT | catt_eo | arxiv | github | Transformer Encoder |
Shakkelha | shakkelha | arxiv | github | Bi-LSTM |
Shakkala | shakkala | - | github | Bi-LSTM |
References:
The vocoder vocos44
was converted from (patriotyk/vocos-mel-hifigan-compat-44100khz).
The vowelizer catt_eo
was converted from https://github.com/abjadai/catt/releases/tag/v2 best_eo_mlm_ns_epoch_193.pt (License: CC BY-NC 4.0)