Wearable SELD dataset is a dataset to develop a sound event localization and detection (SELD) system with wearable devices. The dataset contains recordings collected by using wearable devices such as an earphone, a neck speaker, a headphone, and glasses. Wearable SELD dataset has three types of datasets as below.
-
Earphone type dataset
It contains recordings collected by 12 microphones placed around the ears mimicking microphones as in the image below.
-
Mounting type dataset
It contains recordings collected by 12 microphones placed around the head with some accessories mimicking glasses, a headphone, and a neck speaker as in the image below.
-
FOA format dataset
It contains 4 channels recordings collected by an ambisonic microphone to allow comparison with conventional methods using FOA format and those using the above datasets.
Each dataset type has three versions of the sub-dataset; Anechoic, Reverberation, and Reverberation + noise. FOA format dataset has only the Anechoic version. In total, Wearable SELD dataset contains 7 sub-datasets. This correspondence follows in the table below.
Earphone type | Mounting type | FOA format | |
---|---|---|---|
Anechoic | ○ | ○ | ○ |
Reverberation | ○ | ○ | × |
Reverberation + noise | ○ | ○ | × |
You can download the dataset here. Earphone/Mounting type datasets are split into 8-9 zip files by 7-zip. FOA format dataset is not split. Download the split zip files corresponding to the dataset of interest and use your favorite compression tool to unzip. As a sample, we also provide a smaller version of the dataset, named Small wearable SELD dataset. The sizes of each dataset follow in the table below.
Zipped size (GB) | Unzipped size (GB) | |
---|---|---|
Earphone type | 62.3 | 96.5 |
Mounting type | 64.1 | 96.5 |
FOA format | 5.4 | 10.7 |
Smaller version | 1.1 | 1.7 |
Each sub-dataset has wav_dev and metadata_dev for the training set, and wav_eval and metadata_eval for the evaluation set. wav* includes wav format recordings, and metadata* includes csv format reference labels. wav_dev consists of 400 wav files, and metadata_dev consists of 400 csv files. wav_eval consists of 100 wav files, and metadata_eval consists of 100 csv files. The directory structure for each sub-dataset is as follows.
Earphone_type_dataset/
|--- Earphone_type_dataset_anechoic/
|--- wav_dev/
|--- split0_ov1_1.wav
|--- split1_ov2_1.wav
|--- ...
|--- wav_eval/
|--- split0_ov1_1.wav
|--- split0_ov2_1.wav
|--- ...
|--- metadata_dev/
|--- split0_ov1_1.csv
|--- split1_ov2_1.csv
|--- ...
|--- metadata_eval/
|--- split0_ov1_1.csv
|--- split0_ov2_1.csv
|--- ...
|--- Earphone_type_dataset_reverberation/
|--- wav_dev/
|--- split0_roomB_ov1_1.wav
|--- split1_roomF_ov2_1.wav
|--- ...
|--- wav_eval/
|--- split0_roomB_ov1_1.wav
|--- split0_roomF_ov2_1.wav
|--- ...
|--- metadata_dev/
|--- split0_roomB_ov1_1.csv
|--- split1_roomF_ov2_1.csv
|--- ...
|--- metadata_eval/
|--- split0_roomB_ov1_1.csv
|--- split0_roomF_ov2_1.csv
|--- ...
|--- Earphone_type_dataset_reverberation_noise/
|--- wav_dev/
|--- split0_roomB_ov1_1.wav
|--- split1_roomF_ov2_1.wav
|--- ...
|--- wav_eval/
|--- split0_roomB_ov1_1.wav
|--- split0_roomF_ov2_1.wav
|--- ...
|--- metadata_dev/
|--- split0_roomB_ov1_1.csv
|--- split1_roomF_ov2_1.csv
|--- ...
|--- metadata_eval/
|--- split0_roomB_ov1_1.csv
|--- split0_roomF_ov2_1.csv
|--- ...
Mounting_type_dataset/
|--- Mounting_type_dataset_anechoic/
|--- wav_dev/
|--- split0_ov1_1.wav
|--- split1_ov2_1.wav
|--- ...
|--- wav_eval/
|--- split0_ov1_1.wav
|--- split0_ov2_1.wav
|--- ...
|--- metadata_dev/
|--- split0_ov1_1.csv
|--- split1_ov2_1.csv
|--- ...
|--- metadata_eval/
|--- split0_ov1_1.csv
|--- split0_ov2_1.csv
|--- ...
|--- Mounting_type_dataset_reverberation/
|--- wav_dev/
|--- split0_roomB_ov1_1.wav
|--- split1_roomF_ov2_1.wav
|--- ...
|--- wav_eval/
|--- split0_roomB_ov1_1.wav
|--- split0_roomF_ov2_1.wav
|--- ...
|--- metadata_dev/
|--- split0_roomB_ov1_1.csv
|--- split1_roomF_ov2_1.csv
|--- ...
|--- metadata_eval/
|--- split0_roomB_ov1_1.csv
|--- split0_roomF_ov2_1.csv
|--- ...
|--- Mounting_type_dataset_reverberation_noise
|--- wav_dev/
|--- split0_roomB_ov1_1.wav
|--- split1_roomF_ov2_1.wav
|--- ...
|--- wav_eval/
|--- split0_roomB_ov1_1.wav
|--- split0_roomF_ov2_1.wav
|--- ...
|--- metadata_dev/
|--- split0_roomB_ov1_1.csv
|--- split1_roomF_ov2_1.csv
|--- ...
|--- metadata_eval/
|--- split0_roomB_ov1_1.csv
|--- split0_roomF_ov2_1.csv
|--- ...
FOA format dataset/
|--- FOA_format_dataset_anechoic/
|--- wav_dev/
|--- split0_ov1_1.wav
|--- split1_ov2_1.wav
|--- ...
|--- wav_eval/
|--- split0_ov1_1.wav
|--- split0_ov2_1.wav
|--- ...
|--- metadata_dev/
|--- split0_ov1_1.csv
|--- split1_ov2_1.csv
|--- ...
|--- metadata_eval/
|--- split0_ov1_1.csv
|--- split0_ov2_1.csv
|--- ...
The names of the data refer to its condition. The recordings and the corresponding label files have the same name. For example, if the name of the recording is split0_ov1_1.wav, the name of the correspondence label file is split0_0_ov1_1.csv. The naming convention is different for Anechoic version and Reverberation/Reverberation + noise version.
split[split number]_ov[number of overlapping sound events]_[recording number]
split number
can be used for cross-validation. Each split contains 100 data. Since the development set contains 400 data and the evaluation set contains 100 data,split number
is 0 to 3.number of overlapping sound events
refers to the maximum number of temporally overlapping sound events for each data. Since the maximum number is 2 in the dataset,number of overlapping sound events
is 1 or 2.recording number
is a serial number for each condition.
split[split number]_room[room name]_ov[number of overlapping sound events]_[recording number]
room name
refers to the condition of reverberation. AtroomB
, the reverberation time (T60 at 500 Hz) is 0.41 sec, atroomF
it is 0.12 sec.
Recordings in this dataset are 60 sec long at a sampling rate of 48000 Hz. Recordings were synthesized by convolving collected IRs with collected sound events. These spatialized sound events were placed at a randomly generated time as in the image below. The maximum number of overlapping sound events is 2. Impulse responses were recorded at 36 azimuth angles for 3 elevation angles for each sub-dataset. The azimuth angle was discretized by 10 degrees, and hence 36 azimuth angles starting from 0 degrees were collected. The elevation angle was set to -20, 0, 20 degrees. Sound events were collected 20 samples for the 12 sound events in an anechoic room. The sound events are organ, piano, toy train, toy gun shot, metallophone, bicycle bell, security buzzer, shaker, handclap, woodblock , shaking bell, hit drum. The images of recording environments and devices are shown in details.pdf.
The example of the reference label is shown in the below table.
start_time/end_time
refers to onset/offset of its sound event.ele/azi
stands for elevation and azimuth.ov
refers to whether the sound event is overlapping or not.
sound_event | start_time | end_time | ele | azi | ov |
---|---|---|---|---|---|
piano | 1.6524 | 2.6240 | 0 | 350 | 1 |
shaker | 1.9943 | 3.5475 | -20 | 180 | 2 |
handclap | 3.6039 | 4.0972 | 20 | 0 | 1 |
: | : | : | : | : | : |
Recordings in Earphone/Mounting type dataset contain 12 channels signals. You can select arbitrary channel numbers to use from 12 channels. For the case of using PyTorch, if you want to use channels number 1, 5, 6, 7, you can extract these channels as follows.
import torchaudio
ch_list = [0,4,5,6]
tmp_wav, fs = torchaudio.load('./split0_roomF_ov2_1.wav')
# tmp_wav.shape => (12, 2880000)
wav = tmp_wav[ch_list, :]
# wav.shape => (4, 2880000)
All files and folders in Earphone type dataset
, Mounting type dataset
, FOA format dataset
, Small wearable SELD dataset
, and this repository are under this license.
- Kento Nagatomo (Email: [email protected])
- Masahiro Yasuda (Email: [email protected])
- Kohei Yatabe
- Shoichiro Saito
- Yasuhiro Oikawa
If you'd like to cite this work, you may use the following.
Kento Nagatomo, Masahiro Yasuda, Kohei Yatabe, Shoichiro Saito, Yasuhiro Oikawa, “Wearable SELD dataset: Dataset for sound event localization and detection using wearable devices around head,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2022.
Paper: arXiv