-
Hi, I am building an Android App trying to use the ONNX VAD model to determine speech in the AudioRecord stream. I got the ONNX model integrated but the probability numbers are way off. I have to set the threshold to 0.15 to be more accurate. I realize that the audio recording amplitude was a bit too low, so I multiply by a gain, but it doesn't seem to solve the issue. here is the integration point and please let me know if there is anything I have done wrong. I can also provide a sample recording. voice-0e0d0a5c-4d51-4209-a51d-c0e32b39ede3.pcm.zip The zip file contains a raw PCM mono recording, 16000Hz sample rate, big-endian float format. This audio file is already multiplied by a |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Hi, I cannot really comment on the language used for Android, but looks like your code is not stateful. I.e. the VAD should keep the state during the session. With ONNX the best illustration is here: Lines 63 to 68 in e7c4539 Looks like you always re-create these tensors: What it means in practical terms, on each chunk the VAD "thinks" that this is a new audio. Also note that there is quite a bit of logic in What I also recommend as debugging tool is to run you audio using the provided utils with |
Beta Was this translation helpful? Give feedback.
-
For anyone that is trying to integrate the ONNX model into Android, this is a working version I put together in a ViewModel that takes a live stream of audio PCM input and returns the speech buffers. It is a bit rough since I don't have too much Kotlin experience, but it should show how the ONNX model is used in Kotlin/Java/Android. I will spend some time to set up an Android Library just for Silero VAD integration and incorporate the |
Beta Was this translation helpful? Give feedback.
Hi,
I cannot really comment on the language used for Android, but looks like your code is not stateful. I.e. the VAD should keep the state during the session. With ONNX the best illustration is here:
silero-vad/utils_vad.py
Lines 63 to 68 in e7c4539
Looks like you always re-create these tensors:
https://github.com/lhr0909/live-subtitles-rokid-ar/blob/f19ecc197d3bee6484fa7145f73607bb90d77869/app/src/main/java/chat/senses/lives…