Request for further fine-tuning on Romanian language #336
-
🚀 FeatureFurther fine-tuning on Romanian datasets / latin languages + more transparency with the architecture of the model and comparisons with other VAD detectors (at this time in 2023). I would greatly appreciate if in future releases, you would take these aspects in consideration. MotivationI have integrated the VAD component in a Diarization system. It is a crucial component in order to extract good speaker representations, without noise/silence. Up until now, I have used the vad_multilingual_marblenet model (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/vad_multilingual_marblenet) and it was good enough. However, I am using a CPU only environment and it is pretty slow. It is also not finetuned on any Romanian dataset. Silero runs 10-15 times faster than Marblenet, but the FA+MISS / DER are much higher.. For my own custom dataset in Romanian:
Note: i have finetuned the threshold for silero on the dataset, while for nemo I haven't done any finetuning. Is there any reason for the discrepancies in performance? Even though marblenet was not trained on romanian and Silero was? PitchThanks to its speed, it is a really viable option, especially for commercial purposes. Could you consider fine-tuning, in future releases, on more audio in romanian/other latin languages? Could you provide some information about the quantity of audio for example: romanian, spanish, italian, french, etc? AlternativesThe capability to fine-tune the model would be amazing. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Hi, Let's separate these numerous questions into buckets:
Since you are developing a commercial application, you are welcome to DM us in telegram (preferable) or email.
We provide our VAD with "batteries included", so this is basically out of scope for us.
These metrics were updated EOF 2022 - https://github.com/snakers4/silero-vad/wiki/Quality-Metrics. Naturally we tested only streaming performance:
We have not tested and / or optimized our VAD to be used for diarization.
I am not familiar with your dataset and / or diarization metrics, but my guess is that for diarization having longer chunks may be beneficial. We cannot really tell without looking at your dataset and benchmark code. If you would like us to help you tune the params for optimal performance on your domain (or just check that you are using our VAD correctly), we can discuss it commercially, please DM us in telegram (preferably) or in email.
Most likely we will not be focusing on these languages for reasons that are out of scope for a technical discussion.
This is planned this year, but the decision to invest time in this depends on reasons out of our control. |
Beta Was this translation helpful? Give feedback.
-
As a first step - we released the dataset - https://github.com/snakers4/silero-vad/tree/master/datasets |
Beta Was this translation helpful? Give feedback.
Hi,
Let's separate these numerous questions into buckets:
Since you are developing a commercial application, you are welcome to DM us in telegram (preferable) or email.
We provide our VAD with "batteries included", so this is basically out of scope for us.
These metrics were updated EOF 2022 - https://github.com/snakers4/silero-vad/wiki/Quality-Metrics.
Naturally we tested only streaming performance: