Skip to content

Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

Notifications You must be signed in to change notification settings

slinusc/speaker_identification_evaluation

Repository files navigation

Abstract

This study evaluates the performance of three advanced speech encoder models—Wav2Vec 2.0, XLS-R, and Whisper—in speaker identification tasks. By fine-tuning these models and analyzing their layer-wise representations using SVCCA, k-means clustering, and t-SNE visualizations, we found that Wav2Vec 2.0 and XLSR capture speaker-specific features effectively in their early layers, with fine-tuning improving stability and performance. Whisper showed better performance in deeper layers. Additionally, we determined the optimal number of transformer layers for each model when fine-tuned for speaker identification tasks.

About

Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published