Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure example ("letter") #20

Open
xenova opened this issue Feb 2, 2025 · 7 comments
Open

Failure example ("letter") #20

xenova opened this issue Feb 2, 2025 · 7 comments

Comments

@xenova
Copy link

xenova commented Feb 2, 2025

>>> from misaki import en
>>> g2p = en.G2P(trf=False, british=False, fallback=None)
>>> phonemes, tokens = g2p("the letter")
>>> phonemes
'ðə lˈɛɾəɹ'

but it should be

ðə ˈɫɛtɝ
@hexgrad
Copy link
Owner

hexgrad commented Feb 2, 2025

the letter => ðə lˈɛɾəɹ is actually correct based on the custom v1.0 English phoneset documented here (which is just missing ɐ): https://github.com/hexgrad/misaki/blob/main/EN_PHONES.md

For these types of speech models, your phoneset can vary slightly from what is by-the-book correct IPA, as long as its consistently trained. You can see how the various phonemes are used in this demo: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

@xenova
Copy link
Author

xenova commented Feb 3, 2025

I see! Thanks! Does this mean that the to_espeak function should be updated?

def to_espeak(ps):
    # Optionally, you can add a tie character in between the 2 replacement characters.
    ps = ps.replace('ʤ', 'dʒ').replace('ʧ', 'tʃ')
    ps = ps.replace('A', 'eɪ').replace('I', 'aɪ').replace('Y', 'ɔɪ')
    ps = ps.replace('O', 'oʊ').replace('Q', 'əʊ').replace('W', 'aʊ')
    return ps.replace('ᵊ', 'ə')

(or is there another way to convert to IPA?)

@hexgrad
Copy link
Owner

hexgrad commented Feb 3, 2025

The to_espeak function converts back to espeak-ng phonemes, aka more standard IPA.

If going from standard IPA phonemes to the custom vocab Kokoro v1.0 understands, that would be more like the mapping logic in the EspeakFallback class:

misaki/misaki/espeak.py

Lines 22 to 67 in 2432307

# EspeakFallback is used as a last resort for English
class EspeakFallback:
E2M = sorted({
'ʔˌn\u0329':'tn', 'ʔn\u0329':'tn', 'ʔn':'tn', 'ʔ':'t',
'a^ɪ':'I', 'a^ʊ':'W',
'd^ʒ':'ʤ',
'e^ɪ':'A', 'e':'A',
't^ʃ':'ʧ',
'ɔ^ɪ':'Y',
'ə^l':'ᵊl',
'ʲo':'jo', 'ʲə':'jə', 'ʲ':'',
'ɚ':'əɹ',
'r':'ɹ',
'x':'k', 'ç':'k',
'ɐ':'ə',
'ɬ':'l',
'\u0303':'',
}.items(), key=lambda kv: -len(kv[0]))
def __init__(self, british):
self.british = british
self.backend = phonemizer.backend.EspeakBackend(
language=f"en-{'gb' if british else 'us'}",
preserve_punctuation=True, with_stress=True, tie='^'
)
def __call__(self, token):
ps = self.backend.phonemize([token.text])
if not ps:
return None, None
ps = ps[0].strip()
for old, new in type(self).E2M:
ps = ps.replace(old, new)
ps = re.sub(r'(\S)\u0329', r'ᵊ\1', ps).replace(chr(809), '')
if self.british:
ps = ps.replace('e^ə', 'ɛː')
ps = ps.replace('iə', 'ɪə')
ps = ps.replace('ə^ʊ', 'Q')
else:
ps = ps.replace('o^ʊ', 'O')
ps = ps.replace('ɜːɹ', 'ɜɹ')
ps = ps.replace('ɜː', 'ɜɹ')
ps = ps.replace('ɪə', 'iə')
ps = ps.replace('ː', '')
ps = ps.replace('o', 'ɔ') # for espeak < 1.52
return ps.replace('^', ''), 2

@xenova
Copy link
Author

xenova commented Feb 3, 2025

The to_espeak function converts back to espeak-ng phonemes, aka more standard IPA.

Yes, exactly :) So, to convert from misaki to IPA (needed for my use-case), how should the "the letter" case be handled?

@hexgrad
Copy link
Owner

hexgrad commented Feb 3, 2025

Oh, if going from misaki to IPA, I think that to_espeak function should be mostly accurate, but it may not be complete. For example, like you pointed out, misaki writes əɹ while other IPA systems might use ɝ or in the case of espeak, ɚ.

May I ask what use-case requires misaki to IPA? (That was originally intended for linguists/researchers to understand mapping back to standard phonemes.) If running Kokoro v1.0 inference, just using misaki is the way to go. If doing v0.19, you can use espeak-ng directly.

@xenova
Copy link
Author

xenova commented Feb 3, 2025

I'm currently setting up an evaluation framework/benchmark to:

  1. compare different LLMs for the G2P task. I found that many LLMs fail on very basic examples like homographs, even though the context should be very clear based on the input.
  2. generate synthetic data by using multiple LLMs and arriving at a consensus. As part of this, I'll be using non-LLM approaches, to help with voting too!

TLDR: I need a standard format all models understand, so I chose IPA.

@hexgrad
Copy link
Owner

hexgrad commented Feb 3, 2025

Ah, makes sense. I have definitely spent a fair amount of time thinking about G2P, both neural and non-neural. For English I'm fairly bearish on neural G2P, unless it is (1) implicitly done as part of large end-to-end TTS or (2) used as a last resort fallback model. From what I've seen, neural English G2P simply does not put up good numbers on the speed vs accuracy tradeoff curve.

Feel free to use misaki to produce this data or use the .json data files directly, but you should keep in mind these may not losslessly bridge the gap back to standard IPA. For example, misaki[en] will only use the vowel extender ː in British English and not American English, but many other G2P systems will include the vowel extender for both. It still could be helpful for consensus voting, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants