Skip to content

JaesungHuh/VoxSRC2021

 
 

Repository files navigation

VoxSRC-21

In this repository we provide the validation toolkit for the VoxCeleb Speaker Recognition Challenge 2021. The challenge consists of two main tasks, speaker verification and speaker diarisation.

This repository contains submodules. Please use the following command to clone the repository:

git clone https://github.com/JaesungHuh/VoxSRC2021.git --recursive

Dependencies

pip install -r requirements.txt

Speaker Verification

Within speaker verification, we have 3 tracks (open, closed, and self-supervised), however the validation and test data is the same for all three tracks. The validation and test data for the challenge both consist of pairs of audio segments, and the task is to determine whether they are from the same speaker or from different speakers. Teams are invited to create a system that takes the test data and produces a list of floating-point scores, with a single score for each pair of segments.

Validation Data

The validation data consists of trial pairs of speech from the identities in the VoxCeleb1 dataset. Each trial pair consists of two single-speaker audio segments and can be found in data/verif/trials.txt.

File Format

Your ouput file for scoring should be a single space-delimited text files containing pair of audio segments per line, each line containing three fields:

  • Score -- score; must be a floating point number between the closed interval [0, 1], where 1 means the pair of segments correspond to the same speaker and 0 means the pair of segments correspond to different speakers.
  • FILE1 -- name of audio segment; should correspond to trials provided in data/verif/trials.txt
  • FILE2 -- name of audio segment; should correspond to trials provided in data/verif/trials.txt

For example:

0.503 id10912/H5qe-mHhOyQ/00007.wav id10912/nDpaVYtKQUo/00004.wav
0.431 id10560/p_V0oeCcc0w/00011.wav id10560/_SIZKabFLAM/00001.wav
0.351 id10792/La6IDPsWJHE/00007.wav id11044/BVzEyYdVLXg/00001.wav

Also see data/verif/scores.txt for an example file.

The leaderboard for the challenge will compute two metrics, the Detection Cost and the Equal Error Rate (EER). Further details about the metrics are provided below. The file compute_min_dcf.py computes the detection cost. With the random scores provided you should get a Detection Cost of 0.2919.

python compute_min_dcf.py --p-target 0.05 --c-miss 1 --c-fa 1 data/verif/scores.txt data/verif/trials.txt

The file compute_EER.py computes the EER. With an example file provided you should get an EER of 5.177%

python compute_EER.py --ground_truth data/verif/trials.txt --prediction data/verif/scores.txt

While the leaderboard will display both metrics, the winners of the challenge will be determined using the Detection Cost ALONE, which is the primary metric.

Metrics

Equal Error Rate

This is the rate used to determine the threshold value for a system when its false acceptance rate (FAR) and false rejection rate (FRR) are equal. FAR = FP/(FP+TN) and FRR = FN/(TP+FN)

where

FN is the number of false negatives

FP is the number of false positives

TN is the number of true negatives

TP is the number of true positives

Minimum Detection Cost

Compared to equal error-rate, which assigns equal weight to false negatives and false positives, this error-rate is usually used to assess performance in settings where achieving a low false positive rate is more important than achieving a low false negative rate. We follow the procedure outlined in Sec 3.1 of the NIST 2018 Speaker Recognition Evaluation Plan for AfV trials, and use the following parameters for the cost function:

  1. CMiss (cost of a missed detection) = 1
  2. CFalseAlarm (cost of a spurious detection) = 1
  3. PTarget (a priori probability of the specified target speaker) = 0.05

This is the PRIMARY metric for the challenge.

Speaker Diarisation

For speaker diarisation, we only have a single track. The goal is to break up multispeaker segments into sections of "who spoke when". In our case, each multispeaker audio file is independant (i.e. we will treat the sets of speakers in each file as disjoint), and the audio files will be of variable length. Our scoring code is obtained from the excellent DSCORE repo.

Validation Data

The validation data can be obtained following the instructions here. Please note that you could use both Voxconverse dev & test set as your validation set.

File Format

Your output file for scoring (as well as the ground truth labels for the validation set which we provide) must be a Rich Transcription Time Marked (RTTM) file.

Rich Transcription Time Marked (RTTM) files are space-delimited text files containing one turn per line, each line containing ten fields:

  • Type -- segment type; should always by SPEAKER
  • File ID -- file name; basename of the recording minus extension (e.g., abcde)
  • Channel ID -- channel (1-indexed) that turn is on; should always be 1
  • Turn Onset -- onset of turn in seconds from beginning of recording
  • Turn Duration -- duration of turn in seconds
  • Orthography Field -- should always by <NA>
  • Speaker Type -- should always be <NA>
  • Speaker Name -- name of speaker of turn; should be unique within scope of each file
  • Confidence Score -- system confidence (probability) that information is correct; should always be <NA>
  • Signal Lookahead Time -- should always be <NA>

For instance:

SPEAKER abcde 1   0.240   0.300 <NA> <NA> 3 <NA> <NA>
SPEAKER abcde 1   0.600   1.320 <NA> <NA> 3 <NA> <NA>
SPEAKER abcde 1   1.950   0.630 <NA> <NA> 3 <NA> <NA>

If you would like to confirm that your output RTTM file is valid, use the included validate_rttm.py script. We provide an example in data/diar/baseline_dev.rttm for Voxconverse dev set.

 python validate_rttm.py data/diar/baseline_dev.rttm

The file compute_diarisation_metrics.py computes both DER and JER. If you want to calculate the metrics with our baseline rttm file:

python compute_diarisation_metrics.py -r voxconverse/dev/*.rttm -s data/diar/baseline_dev.rttm

Metrics

Diarization Error Rate (DER)

The leaderboard will be ranked using the diarization error rate (DER), which is the sum of

  • speaker error -- percentage of scored time for which the wrong speaker id is assigned within a speech region
  • false alarm speech -- percentage of scored time for which a nonspeech region is incorrectly marked as containing speech
  • missed speech -- percentage of scored time for which a speech region is incorrectly marked as not containing speech

We use a collar of 0.25 seconds and include overlapping speech in the scoring. For more details, consult section 6.1 of the NIST RT-09 evaluation plan.

Jaccard error rate (JER)

We also report Jaccard error rate (JER), a metric introduced for DIHARD II that is based on the Jaccard index. The Jaccard index is a similarity measure typically used to evaluate the output of image segmentation systems and is defined as the ratio between the intersection and union of two segmentations.

Further Details

Please refer to the Challenge Webpage for more information about the challenge.

Acknowledgements

The code from computing the DetectionCost is largely obtained from the excellent KALDI toolkit, and for computing the DER is from the excellent DSCORE repo. Please read their licenses carefully before redistributing.

About

Development Toolkit for the VoxCeleb Speaker Recognition Challenge 2021

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 52.3%
  • Python 47.7%