-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word boundary event not working for online voices #16
Comments
This engine supports word boundary events for Edge voices. Viseme events are also supported, so that you can see the animated mic in the ttsapplication. Could you tell me which TTS client application and Edge voice you are using, what text you want it to speak, and what isn't working properly, so that I could try to reproduce the problem? |
Many thanks for your very quick response. My code captures the word boundary event with:
Screen.Recording.2024-08-08.140916.online-video-cutter.com.mp4 |
Confirmed that this happens when using Also I found a more serious problem: calling |
I'm glad you can reproduce the problem. It's strange that it is only with the online voices. Yes, I also have your serious problem (with offline and online voices). I have found that the software always needs 7-10 seconds on my development machine after first starting before the voice selection works (other sapi voices do not need this delay). I have worked around the problem by repeatedly trying to initialise the voice until it succeeds at which point I break out of my loop. |
…r when selecting voice in C#. (Found in #16) Previously, we cache the whole CVoiceToken and its CVoiceKey COM objects. Now we cache only the data strings necessary for constructing the tokens and keys (in a DataKeyData tree struct), and construct the corresponding COM objects only when necessary. We use one shared_ptr for each CVoiceToken. All the token's CVoiceKey's share the data block with their root CVoiceToken. The shared data block will only be released when neither CVoiceToken nor its subkeys exist. When the cache is still valid, we create brand new COM CVoiceToken's for the client, but they will share the same data blocks already in the cache.
I can confirm that the synthesizer.SelectVoice() function is now working well on some of my programs but the delay is still needed on some of the others. I will investigate more and try to identify the issue and give you more information. (I presume that this fix does not address the Word boundary event issue for online voices which is still not working for me in any of my programs.) |
I have written a small program to show the current issues and a video to show how it works. In the video:
I hope this is helpful. Screen.Recording.2024-08-11.071337.online-video-cutter.com.mp4 |
... and the code |
So what version did you use for testing? Did you clone my repo and compile it? Because I haven't released a new version yet.
According to the video, the program output some websocketpp logs (the I tried your program. On my system and with my newest version, |
I did clone your repo and rebuilt it, but forgot to register NaturalVoiceSAPIAdapter.dll - mea culpa. Now I have done that Jenny loads with no delay. Any idea why the online voices are not sending the word boundary event correctly? (As you are making a tool to make the "Narrator" Natural voices accessible to SAPI, you may be amused to know that I wrote the original Narrator for Microsoft over 20 years ago.) |
Seems that the C# For example, this TTS engine sends event information with the correct timestamps during speaking. The SAPI framework respects the timestamps, and will deliver the events to the client at the correct time. Local Narrator voices use Azure Speech SDK as the backend, and it will do the synchronization for us. Edge voices use my own implementation (as they are not supported by SDK), and my engine will parse the information received from the server. The server sends all event information with their timestamps first, followed by the actual audio data. My engine will immediately pass the received events to the SAPI framework, so all the events will come before the audio, which I guess may be the reason why the word boundary events (and maybe all events) are out of sync when If my guess is correct, synchronizing the events in the engine myself would fix this issue. For reference, here's part of the implementation of the engine site object in internal class EngineSite : ITtsEngineSite, ITtsEventSink
{
// ...
public void AddEvents([MarshalAs(UnmanagedType.LPArray, SizeParamIndex = 1)] SpeechEventInfo[] events, int ulCount)
{
try
{
for (int i = 0; i < events.Length; i++)
{
SpeechEventInfo sapiEvent = events[i];
int num = 1 << (int)sapiEvent.EventId;
if (sapiEvent.EventId == 2 && _eventMapper != null)
{
_eventMapper.FlushEvent();
}
if ((num & _eventInterest) != 0)
{
TTSEvent evt = CreateTtsEvent(sapiEvent);
if (_eventMapper == null)
{
AddEvent(evt);
}
else
{
_eventMapper.AddEvent(evt);
}
}
}
}
catch (Exception exception)
{
_exception = exception;
_actions |= SPVESACTIONS.SPVES_ABORT;
}
}
// ...
private TTSEvent CreateTtsEvent(SpeechEventInfo sapiEvent)
{
switch ((TtsEventId)sapiEvent.EventId)
{
case TtsEventId.Phoneme:
return TTSEvent.CreatePhonemeEvent(((char)((uint)(int)sapiEvent.Param2 & 0xFFFFu)).ToString() ?? "", ((char)((uint)sapiEvent.Param1 & 0xFFFFu)).ToString() ?? "", TimeSpan.FromMilliseconds(sapiEvent.Param1 >> 16), (SynthesizerEmphasis)((int)sapiEvent.Param2 >>> 16), _prompt, _audio.Duration);
case TtsEventId.Bookmark:
{
string bookmark = Marshal.PtrToStringUni(sapiEvent.Param2);
return new TTSEvent((TtsEventId)sapiEvent.EventId, _prompt, null, null, _audio.Duration, _audio.Position, bookmark, (uint)sapiEvent.Param1, sapiEvent.Param2);
}
default:
return new TTSEvent((TtsEventId)sapiEvent.EventId, _prompt, null, null, _audio.Duration, _audio.Position, null, (uint)sapiEvent.Param1, sapiEvent.Param2);
}
}
}
internal override TimeSpan Duration
{
get
{
if (_nAvgBytesPerSec == 0)
{
return new TimeSpan(0L);
}
return new TimeSpan((long)_bytesWritten * 10000000L / _nAvgBytesPerSec);
}
} |
"If my guess is correct, synchronizing the events in the engine myself would fix this issue." "I guess may be the reason why the word boundary events (and maybe all events) are out of sync" - that would seem very plausible. |
FYI: My free web extension can be found here: https://microsoftedge.microsoft.com/addons/detail/readableweb/pfagdimehoadoklbcbheaahkeamhohbp It is free but not open source. However, if you email me privately at [email protected] I will be happy to share any of the code with you. |
Changed SpeechRestAPI so that it can store all received events in a sorted queue, then deliver the events at the correct audio time. Usually the server sends event data before audio data, so before sending each audio block, we check if there's any event in the duration of this audio block. If there is an event, first the audio part before the event is sent, then the event is sent, finally the rest of the audio is sent. This can make sure that the client receives the event at the exact written wave byte count. Because occasionally audio data can go before its event data, we introduce a short delay (10ms) when there's currently no event for an audio block. If after 10ms there's still no event for the audio block, we allow the process to continue. Sometimes ISpTtsEngineSite::Write can fail without outputting any audio. We should check its return value to see if it succeeded or not.
That's great. I've downloaded and compiled your new source and that seems to work. Will do some more testing but think you are there :) |
I think this is now fixed. |
A new version v0.2 has been released! |
The word boundary event works well for the offline (Narrator) voices but isn't working properly for the online Edge voices. I'm pretty sure the online voices do send word boundary information as I have this working in an Edge extension that uses the voices directly.
The text was updated successfully, but these errors were encountered: