Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word boundary event not working for online voices #16

Closed
PaulBlenkhorn opened this issue Aug 8, 2024 · 15 comments
Closed

Word boundary event not working for online voices #16

PaulBlenkhorn opened this issue Aug 8, 2024 · 15 comments

Comments

@PaulBlenkhorn
Copy link

The word boundary event works well for the offline (Narrator) voices but isn't working properly for the online Edge voices. I'm pretty sure the online voices do send word boundary information as I have this working in an Edge extension that uses the voices directly.

@gexgd0419
Copy link
Owner

This engine supports word boundary events for Edge voices. Viseme events are also supported, so that you can see the animated mic in the ttsapplication.

Could you tell me which TTS client application and Edge voice you are using, what text you want it to speak, and what isn't working properly, so that I could try to reproduce the problem?

@PaulBlenkhorn
Copy link
Author

Many thanks for your very quick response.
I am using my own programs in c# on Windows 11. I think the easiest way to see the problem is to see the attached video which shows the word boundary events not working/synchronising for the Edge voice "Microsoft Clara online" and then working for the voice "Microsoft Jenny". This happens on all of my programs which are using SAPI which are all Winforms but have rather different architectures. The simple example shown is sending the boundary information to a WebView control, but other programs are not using the Webview control.

My code captures the word boundary event with:

   ...
    GlobalR.synthesizer.SpeakProgress += new EventHandler<SpeakProgressEventArgs>(synth_SpeakProgress);
    ...

    void synth_SpeakProgress(object sender, SpeakProgressEventArgs e)
    {
        int cPosition = e.CharacterPosition;
        string s = ">" + cPosition.ToString();
        Console.WriteLine("Position: " + s);
        this.webView21.CoreWebView2.PostWebMessageAsString(s);
    }
Screen.Recording.2024-08-08.140916.online-video-cutter.com.mp4

@gexgd0419
Copy link
Owner

Confirmed that this happens when using System.Speech.Synthesis.SpeechSynthesizer in C#. But it's still weird that TtsApplication, which is written in C++ and uses the COM API directly, doesn't seem to have this issue.

Also I found a more serious problem: calling SelectVoice to change the voice to a NaturalVoiceSAPIAdapter voice would often throw ArgumentException that said Cannot set voice. No matching voice is installed or the voice was disabled.. Does this often happen on your system?

@PaulBlenkhorn
Copy link
Author

PaulBlenkhorn commented Aug 8, 2024

I'm glad you can reproduce the problem. It's strange that it is only with the online voices.

Yes, I also have your serious problem (with offline and online voices). I have found that the software always needs 7-10 seconds on my development machine after first starting before the voice selection works (other sapi voices do not need this delay). I have worked around the problem by repeatedly trying to initialise the voice until it succeeds at which point I break out of my loop.
Once the voice has been set the error does not seem to repeat. Here is my c# code that does this:
for (int i = 0; i < 100; i++)
{
try
{
synthesizer.SelectVoice(s);
// btnLoad.Visible = false;
Console.WriteLine(s);
break;
}
catch
{
// if (i == 0)
// ShowLoading();
Console.WriteLine("Failed " + i);
await Task.Delay(TimeSpan.FromMilliseconds(500));
}
}

gexgd0419 added a commit that referenced this issue Aug 10, 2024
…r when selecting voice in C#. (Found in #16)

Previously, we cache the whole CVoiceToken and its CVoiceKey COM objects. Now we cache only the data strings necessary for constructing the tokens and keys (in a DataKeyData tree struct), and construct the corresponding COM objects only when necessary.

We use one shared_ptr for each CVoiceToken. All the token's CVoiceKey's share the data block with their root CVoiceToken. The shared data block will only be released when neither CVoiceToken nor its subkeys exist.

When the cache is still valid, we create brand new COM CVoiceToken's for the client, but they will share the same data blocks already in the cache.
@PaulBlenkhorn
Copy link
Author

PaulBlenkhorn commented Aug 10, 2024

I can confirm that the synthesizer.SelectVoice() function is now working well on some of my programs but the delay is still needed on some of the others. I will investigate more and try to identify the issue and give you more information.

(I presume that this fix does not address the Word boundary event issue for online voices which is still not working for me in any of my programs.)

@PaulBlenkhorn
Copy link
Author

PaulBlenkhorn commented Aug 11, 2024

I have written a small program to show the current issues and a video to show how it works.

In the video:

  1. I start the program and then fairly quickly click on Microsoft Jenny and you can see that the program takes several calls to synthesizer.SelectVoice(); to initialise the speech. The words are shown as they are spoken.
  2. I click on Microsoft Willem Online and the voice is initialised without issue. However, the Word position is incorrect and you can see a number of errors displayed in the output window of the program.

I hope this is helpful.

Screen.Recording.2024-08-11.071337.online-video-cutter.com.mp4

@PaulBlenkhorn
Copy link
Author

... and the code
App.zip

@gexgd0419
Copy link
Owner

I can confirm that the synthesizer.SelectVoice() function is now working well on some of my programs but the delay is still needed on some of the others.

So what version did you use for testing? Did you clone my repo and compile it? Because I haven't released a new version yet.

I have written a small program to show the current issues and a video to show how it works.

According to the video, the program output some websocketpp logs (the [frame_payload] Payload bytes: lines in the Debug output). I have made it not output any websocketpp logs by default in commit 44d8620, so this seemed to be an older version.

I tried your program. On my system and with my newest version, Microsoft Jenny voice could be used with no delay.

@PaulBlenkhorn
Copy link
Author

PaulBlenkhorn commented Aug 11, 2024

I did clone your repo and rebuilt it, but forgot to register NaturalVoiceSAPIAdapter.dll - mea culpa. Now I have done that Jenny loads with no delay. Any idea why the online voices are not sending the word boundary event correctly?

(As you are making a tool to make the "Narrator" Natural voices accessible to SAPI, you may be amused to know that I wrote the original Narrator for Microsoft over 20 years ago.)

@gexgd0419
Copy link
Owner

Seems that the C# System.Speech module uses its own mechanism to access the COM SAPI voices. Usually clients create instances of SpVoice objects, and let the SAPI framework handle the interactions with the TTS engine. But System.Speech doesn't use SpVoice. Instead, it has a whole set of different COM interop classes, and uses these to interact with TTS engines directly. Although System.Speech tries to replicate the SAPI framework's behavior, the differences in their implementations cause some problems.

For example, this TTS engine sends event information with the correct timestamps during speaking. The SAPI framework respects the timestamps, and will deliver the events to the client at the correct time. System.Speech, however, seems to just ignore the timestamps, and deliver the events to the client at the time the event are generated.

Local Narrator voices use Azure Speech SDK as the backend, and it will do the synchronization for us. Edge voices use my own implementation (as they are not supported by SDK), and my engine will parse the information received from the server. The server sends all event information with their timestamps first, followed by the actual audio data. My engine will immediately pass the received events to the SAPI framework, so all the events will come before the audio, which I guess may be the reason why the word boundary events (and maybe all events) are out of sync when System.Speech is acting as the SAPI framework.

If my guess is correct, synchronizing the events in the engine myself would fix this issue.

For reference, here's part of the implementation of the engine site object in System.Speech, that will be passed into SAPI TTS engines to let the engine pass the synthesized audio and events back to the SAPI framework and then to the client app.

internal class EngineSite : ITtsEngineSite, ITtsEventSink
{
	// ...
	public void AddEvents([MarshalAs(UnmanagedType.LPArray, SizeParamIndex = 1)] SpeechEventInfo[] events, int ulCount)
	{
		try
		{
			for (int i = 0; i < events.Length; i++)
			{
				SpeechEventInfo sapiEvent = events[i];
				int num = 1 << (int)sapiEvent.EventId;
				if (sapiEvent.EventId == 2 && _eventMapper != null)
				{
					_eventMapper.FlushEvent();
				}
				if ((num & _eventInterest) != 0)
				{
					TTSEvent evt = CreateTtsEvent(sapiEvent);
					if (_eventMapper == null)
					{
						AddEvent(evt);
					}
					else
					{
						_eventMapper.AddEvent(evt);
					}
				}
			}
		}
		catch (Exception exception)
		{
			_exception = exception;
			_actions |= SPVESACTIONS.SPVES_ABORT;
		}
	}
	// ...
	private TTSEvent CreateTtsEvent(SpeechEventInfo sapiEvent)
	{
		switch ((TtsEventId)sapiEvent.EventId)
		{
		case TtsEventId.Phoneme:
			return TTSEvent.CreatePhonemeEvent(((char)((uint)(int)sapiEvent.Param2 & 0xFFFFu)).ToString() ?? "", ((char)((uint)sapiEvent.Param1 & 0xFFFFu)).ToString() ?? "", TimeSpan.FromMilliseconds(sapiEvent.Param1 >> 16), (SynthesizerEmphasis)((int)sapiEvent.Param2 >>> 16), _prompt, _audio.Duration);
		case TtsEventId.Bookmark:
		{
			string bookmark = Marshal.PtrToStringUni(sapiEvent.Param2);
			return new TTSEvent((TtsEventId)sapiEvent.EventId, _prompt, null, null, _audio.Duration, _audio.Position, bookmark, (uint)sapiEvent.Param1, sapiEvent.Param2);
		}
		default:
			return new TTSEvent((TtsEventId)sapiEvent.EventId, _prompt, null, null, _audio.Duration, _audio.Position, null, (uint)sapiEvent.Param1, sapiEvent.Param2);
		}
	}
}

AddEvents implements ISpEventSink::AddEvents, which is what TTS engines should call to tell SAPI about their events. But this implementation just assumes that _audio.Duration is the time position of the event, which is calculated based on the written byte count:

	internal override TimeSpan Duration
	{
		get
		{
			if (_nAvgBytesPerSec == 0)
			{
				return new TimeSpan(0L);
			}
			return new TimeSpan((long)_bytesWritten * 10000000L / _nAvgBytesPerSec);
		}
	}

@PaulBlenkhorn
Copy link
Author

PaulBlenkhorn commented Aug 11, 2024

"If my guess is correct, synchronizing the events in the engine myself would fix this issue."
Yes, I think that should fix it. The only way I've accessed the Edge voices is through a Chrome/Web extension where the timing of the word boundary event is fine. Here's some of my (javascript) code which is in a background script, but don't think it is that relevant for you. [I was actually sending the information to a C# host using Native Messaging but have abandoned that now as your solution will be much better.)
chrome.tts.speak(response.speak, {
voiceName: voice,
pitch: params.voicePitch,
rate: params.voiceRate,
volume: params.voiceVolume,
requiredEventTypes: ['end', 'word'],
onEvent: function (event) {
if (event.type === 'end') {
console.log("Speech ended.");
port.postMessage({ text: "End" });
// port.postMessage({ index: "-1", length: "-1" });
}
if (event.type === 'word') {
console.log("W: " + event.charIndex.toString(), "L:" + event.length.toString());
port.postMessage({ index: event.charIndex.toString(), length: event.length.toString() });
}
}
});

"I guess may be the reason why the word boundary events (and maybe all events) are out of sync" - that would seem very plausible.

@PaulBlenkhorn
Copy link
Author

PaulBlenkhorn commented Aug 11, 2024

FYI: My free web extension can be found here: https://microsoftedge.microsoft.com/addons/detail/readableweb/pfagdimehoadoklbcbheaahkeamhohbp

It is free but not open source. However, if you email me privately at [email protected] I will be happy to share any of the code with you.

gexgd0419 added a commit that referenced this issue Aug 12, 2024
Changed SpeechRestAPI so that it can store all received events in a sorted queue, then deliver the events at the correct audio time.

Usually the server sends event data before audio data, so before sending each audio block, we check if there's any event in the duration of this audio block. If there is an event, first the audio part before the event is sent, then the event is sent, finally the rest of the audio is sent. This can make sure that the client receives the event at the exact written wave byte count.

Because occasionally audio data can go before its event data, we introduce a short delay (10ms) when there's currently no event for an audio block. If after 10ms there's still no event for the audio block, we allow the process to continue.

Sometimes ISpTtsEngineSite::Write can fail without outputting any audio. We should check its return value to see if it succeeded or not.
@PaulBlenkhorn
Copy link
Author

That's great. I've downloaded and compiled your new source and that seems to work. Will do some more testing but think you are there :)

@PaulBlenkhorn
Copy link
Author

I think this is now fixed.

@gexgd0419
Copy link
Owner

A new version v0.2 has been released!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants