Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS Results Occasionally Truncated #1992

Open
KangSquad opened this issue Jan 22, 2025 · 0 comments
Open

TTS Results Occasionally Truncated #1992

KangSquad opened this issue Jan 22, 2025 · 0 comments

Comments

@KangSquad
Copy link

We have observed that the TTS results are sometimes truncated. The key characteristics of this issue are as follows:

  1. Incomplete Output: Only part of target_text is spoken.
  2. Longer Processing Time : This seems to occur because the following conditional statements in t2s_model.py is not met, causing the loop to run all 1500 iterations:
if torch.argmax(logits, dim=-1)[0] == self.EOS or samples[0, 0] == self.EOS:
    stop = True
if stop:
    if y.shape[1] == 0:
        y = torch.concat([y, torch.zeros_like(samples)], dim=1)
        print("bad zero prediction")
    print(f"T2S Decoding EOS [{prefix_len} -> {y.shape[1]}]")
    break
  1. Poor Audio Quality : The generated audio has degraded quality and less similar to the target voice.
  2. Fixed Output Length : The output .wav file is always 1 minute long, with some text read and the rest padded with spaces.
  3. Anomalous Log Values : When inspecting the logs in the get_tts_wav function of inference_webui.py, the pred_semantic shape and idx_value differ from the normal operation:
pred_semantic shape: torch.Size([1, 1, 1498])
value: 1498

My guess is that in the inference_webui.py file, the infer_panel function is called, and in the infer_panel_naive function, the value of torch.argmax(logits, dim=-1)[0] or samples[0, 0] is not equal to self.EOS.

The experiment was conducted under identical conditions for the parameters of infer_panel, including all_phoneme_ids, all_phoneme_len, None if ref_free else prompt, bert, top_k, top_p, temperature, and early_stop_num, as well as the same target_text, reference_text, and reference_audio.

Related Code (inference_webui.py)
Image

Normal Output
Image

Abnormal Output
Image

Thank you for taking the time to look into this issue. I truly appreciate your efforts in maintaining and improving this project, and I am happy to provide additional details or conduct further testing if needed. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant