Index out of bounds in _create_audio() #115

maxpatiiuk · 2025-02-23T03:54:00Z

What happened?

Follow up on #95

I get an IndexError: index 510 is out of bounds for axis 0 with size 510 error for any input text that contains a long-ish sentence.

Steps to reproduce

Python script:

# examples/play.py
from kokoro_onnx import Kokoro

kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")
# Putting a dot in the middle of this text fixes the issue:
text="It may be that this communication will be considered as a madman's freak but at any rate it must be admitted that in its clearness and frankness it left nothing to be desired The serious part of it was that the Federal Government had undertaken to treat a sale by auction as a valid concession of these undiscovered territories Opinions on the matter were many Some readers saw in it only one of those prodigious outbursts of American humbug which would exceed the limits of puffism if the depths of human credulity were not unfathomable"
kokoro.create(text, voice="af_heart", lang="en-us")

Run the script:

LOG_LEVEL=DEBUG python examples/play.py

Thank you for the work on kokoro-onnx!

What OS are you seeing the problem on?

MacOS

Package version

0.4.2

Relevant log output

DEBUG    [__init__.py:34] koko-onnx version 0.4.2 on macOS-15.3.1-arm64-arm-64bit Darwin Kernel Version 24.3.0: Thu Jan  2 20:24:16 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6000
DEBUG    [__init__.py:53] Providers: ['CPUExecutionProvider']
DEBUG    [__init__.py:169] Creating audio for 2 batches for 556 phonemes
DEBUG    [__init__.py:76] Phonemes: 
DEBUG    [__init__.py:100] Created audio in length of 0.47s for 0 phonemes in 0.16s (RTF: 0.33
DEBUG    [__init__.py:76] Phonemes: ɪt mˈeɪ biː ðæt ðɪs kəmjˌuːnɪkˈeɪʃən wɪl biː kənsˈɪdɚd æz ɐ mˈædmənz fɹˈiːk bˌʌt æɾ ˌɛni ɹˈeɪt ɪt mˈʌst biː ɐdmˈɪɾᵻd ðæt ɪn ɪts klˈɪɹnəs ænd fɹˈæŋknəs ɪt lˈɛft nˈʌθɪŋ təbi dɪzˈaɪɚd ðə sˈɪɹiəs pˈɑːɹt ʌv ɪt wʌz ðætðə fˈɛdɚɹəl ɡˈʌvɚnmənt hæd ˌʌndɚtˈeɪkən tə tɹˈiːt ɐ sˈeɪl baɪ ˈɔːkʃən æz ɐ vˈælɪd kənsˈɛʃən ʌv ðiːz ʌndɪskˈʌvɚd tˈɛɹɪtˌɔːɹiz əpˈɪniənz ɔnðə mˈæɾɚ wɜː mˈɛni sˌʌm ɹˈiːdɚz sˈɔː ɪn ɪɾ ˈoʊnli wˈʌn ʌv ðoʊz pɹədˈɪdʒəs ˈaʊtbɜːsts ʌv ɐmˈɛɹɪkən hˈʌmbʌɡ wˌɪtʃ wʊd ɛksˈiːd ðə lˈɪmɪts ʌv pˈʌfɪzəm ɪf ðə dˈɛpθs ʌv hjˈuːmən kɹɛdʒˈuːlᵻɾi wɜː nˌɑːt ʌnfˈæðəməbəl
# (I modified the warning in the source code to include more details)
WARNING  [__init__.py:78] Phonemes (556) are too long, truncating to 510 phonemes (ɪt mˈeɪ biː ðæt ðɪs kəmjˌuːnɪkˈeɪʃən wɪl biː kənsˈɪdɚd æz ɐ mˈædmənz fɹˈiːk bˌʌt æɾ ˌɛni ɹˈeɪt ɪt mˈʌst biː ɐdmˈɪɾᵻd ðæt ɪn ɪts klˈɪɹnəs ænd fɹˈæŋknəs ɪt lˈɛft nˈʌθɪŋ təbi dɪzˈaɪɚd ðə sˈɪɹiəs pˈɑːɹt ʌv ɪt wʌz ðætðə fˈɛdɚɹəl ɡˈʌvɚnmənt hæd ˌʌndɚtˈeɪkən tə tɹˈiːt ɐ sˈeɪl baɪ ˈɔːkʃən æz ɐ vˈælɪd kənsˈɛʃən ʌv ðiːz ʌndɪskˈʌvɚd tˈɛɹɪtˌɔːɹiz əpˈɪniənz ɔnðə mˈæɾɚ wɜː mˈɛni sˌʌm ɹˈiːdɚz sˈɔː ɪn ɪɾ ˈoʊnli wˈʌn ʌv ðoʊz pɹədˈɪdʒəs ˈaʊtbɜːsts ʌv ɐmˈɛɹɪkən hˈʌmbʌɡ wˌɪtʃ wʊd ɛksˈiːd ðə lˈɪmɪts ʌv pˈʌfɪzəm ɪf ðə dˈɛpθs ʌv hjˈuːmən kɹɛdʒˈuːlᵻɾi wɜː nˌɑːt ʌnfˈæðəməbəl)
Traceback (most recent call last):
  File "/Users/maxpatiiuk/site/python/tts-nn/kokoro-onnx/examples/play.py", line 6, in <module>
    kokoro.create(text, voice="af_heart", lang="en-us")
  File "/Users/maxpatiiuk/site/python/tts-nn/venv/lib/python3.12/site-packages/kokoro_onnx/__init__.py", line 173, in create
    audio_part, _ = self._create_audio(phonemes, voice, speed)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/maxpatiiuk/site/python/tts-nn/venv/lib/python3.12/site-packages/kokoro_onnx/__init__.py", line 88, in _create_audio
    voice = voice[len(tokens)]
            ~~~~~^^^^^^^^^^^^^
IndexError: index 510 is out of bounds for axis 0 with size 510

The text was updated successfully, but these errors were encountered:

maxpatiiuk · 2025-02-23T04:28:43Z

My workaround was to override _split_phonemes with a more robust implementation:

class FixedKokoro(Kokoro):
    # Workaround for https://github.com/thewh1teagle/kokoro-onnx/issues/115
    def _split_phonemes(self, phonemes: str) -> list[str]:
        batched_phonemes = []
        while len(phonemes) > MAX_PHONEME_LENGTH:
            # Find best split point within limit
            split_idx = MAX_PHONEME_LENGTH
            
            # Try to find the last period before MAX_PHONEME_LENGTH
            period_idx = phonemes.rfind('.', 0, MAX_PHONEME_LENGTH)
            if period_idx != -1:
                split_idx = period_idx + 1  # Include period
            
            else:
                # Try other punctuation
                match = re.search(r'[!?;,]', phonemes[:MAX_PHONEME_LENGTH][::-1])  # Search backwards
                if match:
                    split_idx = MAX_PHONEME_LENGTH - match.start()
                
                else:
                    # Try last space
                    space_idx = phonemes.rfind(' ', 0, MAX_PHONEME_LENGTH)
                    if space_idx != -1:
                        split_idx = space_idx
            
            # If no good split point is found, force split at MAX_PHONEME_LENGTH
            chunk = phonemes[:split_idx].strip()
            batched_phonemes.append(chunk)
            
            # Move to the next part
            phonemes = phonemes[split_idx:].strip()
        
        # Add remaining phonemes
        if phonemes:
            batched_phonemes.append(phonemes)
        
        return batched_phonemes

Happy to open a PR

maxpatiiuk · 2025-02-23T05:16:11Z

Although, I still get an exception if the input is exactly MAX_PHONEME_LENGTH long. Here is the updated _split_phonemes to reduce max_length by 1:

class FixedKokoro(Kokoro):
    # Workaround for https://github.com/thewh1teagle/kokoro-onnx/issues/115
    def _split_phonemes(self, phonemes: str) -> list[str]:
        max_length = MAX_PHONEME_LENGTH - 1
        batched_phonemes = []
        while len(phonemes) > max_length:
            # Find best split point within limit
            split_idx = max_length
            
            # Try to find the last period before max_length
            period_idx = phonemes.rfind('.', 0, max_length)
            if period_idx != -1:
                split_idx = period_idx + 1  # Include period
            
            else:
                # Try other punctuation
                match = re.search(r'[!?;,]', phonemes[:max_length][::-1])  # Search backwards
                if match:
                    split_idx = max_length - match.start()
                
                else:
                    # Try last space
                    space_idx = phonemes.rfind(' ', 0, max_length)
                    if space_idx != -1:
                        split_idx = space_idx
            
            # If no good split point is found, force split at max_length
            chunk = phonemes[:split_idx].strip()
            batched_phonemes.append(chunk)
            
            # Move to the next part
            phonemes = phonemes[split_idx:].strip()
        
        # Add remaining phonemes
        if phonemes:
            batched_phonemes.append(phonemes)
        
        return batched_phonemes

freddyaboulton · 2025-03-04T23:34:48Z

Yes I got the same error !

maxpatiiuk added the bug Something isn't working label Feb 23, 2025

freddyaboulton mentioned this issue Mar 6, 2025

Fix kokoro batch issue freddyaboulton/fastrtc#128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index out of bounds in _create_audio() #115

Index out of bounds in _create_audio() #115

maxpatiiuk commented Feb 23, 2025

maxpatiiuk commented Feb 23, 2025

maxpatiiuk commented Feb 23, 2025

freddyaboulton commented Mar 4, 2025

Index out of bounds in _create_audio() #115

Index out of bounds in _create_audio() #115

Comments

maxpatiiuk commented Feb 23, 2025

What happened?

Steps to reproduce

What OS are you seeing the problem on?

Package version

Relevant log output

maxpatiiuk commented Feb 23, 2025

maxpatiiuk commented Feb 23, 2025

freddyaboulton commented Mar 4, 2025