In this work, we present Recursive fusion of Joint Cross-Attention across audio and visual modalities for person verification.
If you find this work useful in your research, please consider citing our work 📝 and giving a star 🌟 :
@article{praveen2024audio,
title={Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention},
author={Praveen, R Gnana and Alam, Jahangir},
journal={arXiv preprint arXiv:2403.04654},
year={2024}
}
There are three major blocks in this repository to reproduce the results of our paper. This code uses Mixed Precision Training (torch.cuda.amp). The dependencies and packages required to reproduce the environment of this repository can be found in the environment.yml
file.
Create an environment using the environment.yml
file
conda env create -f environment.yml
The pre-trained models of audio and visual backbones are obtained here
The fusion models trained using our fusion approach can be found here
The text files can be found here
train_list : Train list
val_trials : Validation trials list
val_list : Validation list
test_trials : VoX1-O trials list
test_list : Vox 1-O list
Return to Table of Content Please download the following.
- The images of Voxceleb1 dataset can be downloaded here
- The downloaded images are not properly aligned. So the images are aligned using Insightface The preprocessing scripts are provided in preprocessing folder
- sbatch run_train.sh
- sbatch run_eval.sh
Our code is based on AVCleanse