A pytorch implement of Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech. This paper has been submitted to ICASSP2024.
Arxiv: https://arxiv.org/pdf/2309.08408.pdf
This project aims at real-world speech scenarios where conversations are sparsely overlapped.
There are three stages to train ActiveExtract
- Pretrain an ASD module using TalkSet.
You can train it by yourself according to https://github.com/TaoRuijie/TalkNet-ASD or just load it from a pretrained model (Checkpoint/TalkNet_TalkSet.model).
- Pretrain ActiveExtract on highly overlapped speech dataset VoxCeleb2-2Mix.
The ASD module is fixed during this stage
- Finetune ActiveExtract on sparsely overlapped speech dataset IEMOCAP-2Mix.
The ASD module is fixed during this stage
You can find trained models in 'Checkpoint' folder.
You can find audio samples from this link: https://activeextract.github.io/
Contact Email: [email protected]