📢 [09/02/23] We released an official test dataset used in our paper. For the fair comparison, you may use this dataset.
"LibriPhrase" is an open-source dataset tailored for tasks like user-defined keyword spotting (UDKWS), open-vocabulary keyword spotting (OV-KWS), and spoken-term detection (STD). It is constructed based on the LibriSpeech corpus, and we adhere to the license terms of LibriSpeech.
In this recipe, we have taken advantage of exceptional tools and libraries, listed as follows:
- Source : LibriSpeech (ASR corpus) [1]
- Alignment : Montreal Forced Aligner [2][3]
- String distance metric : Levenshtein distance [4]
- Unit : Phoneme-level (extract phoneme information from G2P [5])
Anchor text | Easy negative text | Hard negative text |
---|---|---|
friend | guard comfort superior |
frind rend trend |
the river | every morning town with not occurred |
the giver the liver the rigor |
i mean to | and be made be a banner no less than |
i seen to i mean you we mean to |
at the right time | began the kissing and rubbing two bits of conseil and land spent |
at the same time at the one time knew the right time |
./libriphrase.py
./utils.py
./run.sh
./requirements.txt
./data/
├── librispeech_clean_dev_all_utt.csv
├── ...
├── librispeech_other_train_500h_all_utt.csv
├── ...
├── libriphrase_diffspk_all_1word.csv
├── ...
└── libriphrase_diffspk_all_4word.csv
/LibriSpeech_ASR_corpus/
├── dev-clean/
├── ...
└── train-other-500/
├── BOOKS.TXT
├── CHAPTERS.TXT
├── ...
└── train-other-500/
├── 1006/
├── ...
└── 985/
├── 126224/
├── ...
└── 126228/
├── 985-126228-0000.wav
...
/LibriPhrase_diffspk_all/
├── dev-clean/
├── ...
└── train-other-500/
└── train-other-500/
├── 1006/
├── ...
└── 985/
├── 126224/
├── ...
└── 126228/
├── 985-126228-0004_1word_0.wav
...
This work is performed in this environment.
- Python 3.6.9
- Linux Ubuntu 18.04
Before you start, make sure to prepare the LibriSpeech ASR corpus.
Once the download is complete, proceed to clone this repository and install the package dependencies using the following steps:
git clone https://github.com/gusrud1103/LibriPhrase.git
cd LibriPhrase
pip install -r requirements.txt
Afterwards, download alignment CSV files from Google Link and place the files into data
folder.
mkdir data
cd data # locate CSV files in this folder
This recipe is designed to extract short phrases (consisting of 1~4 words) from the LibriSpeech dataset and organize them into three categories: anchor, positive, and negative.
By controlling the mode
argument in the script, you can customize the dataset's difficulty (easy/hard) and speaker type (same/different).
The script will create a hierarchical folder structure based on both the LibriSpeech dataset and manual arguments (Please refer the data pipeline).
You can use this recipe to generate test dataset. (We will also release the recipe for the training set soon.)
To ensure a fair comparison with our paper, we offer the official test dataset.
We generated the test data from “train-other-500” of LibriSpeech.
./run.sh
or
python3 libriphrase.py --libripath 'your path(librispeech wav files)' --newpath 'new path(libriphrase wav files)' --wordalign './data/librispeech_other_train_500h_all_utt.csv' --output './data/testset_librispeech_other_train_500h_short_phrase.csv' --numpair 3 --maxspk 1611 --maxword 4 --mode 'diffspk_all'
--libripath
: Folder for LibriSpeech ASR corpus (wav files)--newpath
: Folder to save generated LibriPhrase wav files--wordalign
: Word alignment information for LibriSpeech ASR corpus (Download csv files todata
folder)--output
: Output filename that containing the information about generated LibriPhrase in csv format--numpair
: The number of samples in each case--maxspk
: The number of speakers (for reducing computation)--maxword
: The maximum number of words to construct short phrase (Select integer from 1 to 4)--mode
: The mode for comparison type (samespk/diffspk denote the consistency of the speaker between anchor and comparison and easy/hard/all denote negative type [samespk_easy
,diffspk_easy
,diffspk_hard
,diffspk_all
]
anchor
: The file path of the anchor wav fileanchor_spk
: The speaker of the anchor wav fileanchor_text
: The text of the anchor wav fileanchor_dur
: The duration of the anchor wav filecomparison
: The file path of the comparison wav filecomparison_spk
: The speaker of the comparison wav filecomparison_text
: The text of the comparison wav filecomparison_dur
: The duration of the comparison wav filetype
: The category of the comparison (it depends onmode
, so ifmode
issamespk_easy
, thensamespk_positive
,samespk_easyneg
are showed in the type column.)
mode type category samespk_easy
samespk_positive
,samespk_easyneg
diffspk_easy
diffspk_positive
,diffspk_easyneg
diffspk_hard
diffspk_positive
,diffspk_hardneg
diffspk_all
diffspk_positive
,diffspk_easyneg
,diffspk_hardneg
target
: If a test sample (comparison) matches with the given anchor (text), the value is1
, otherwise0
.class
: The number of words in the phrase
[1] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus based onpublic domain audio books,” in ICASSP, 2015.
[2] Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio, "Speech Model Pre-training for End-to-End Spoken Language Understanding", INTERSPEECH 2019.
[3] Michael McAuliffe, Michaela Socolof, Sarah Mihuc,Michael Wagner, and Morgan Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in INTERSPEECH, 2017.
[4] Vladimir I Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady. Soviet Union, 1966, vol. 10, pp. 707–710.
[5] Jongseok Park, Kyubyong Kim, “g2pe,”https://github.com/Kyubyong/g2p, 2019.
Distributed under the MIT License. See LICENSE
for more information.
If you use this code, please cite:
@inproceedings{Shin22LibriPhrase,
author={Hyeon-Kyeong Shin and Hyewon Han and Doyeon Kim and Soo-Whan Chung and Hong-Goo Kang},
title={Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting},
year=2022,
booktitle={INTERSPEECH},
pages={1871--1875},
doi={10.21437/Interspeech.2022-580}
}