Wearable SELD dataset

Wearable SELD dataset is a dataset to develop a sound event localization and detection (SELD) system with wearable devices. The dataset contains recordings collected by using wearable devices such as an earphone, a neck speaker, a headphone, and glasses. Wearable SELD dataset has three types of datasets as below.

Earphone type dataset
It contains recordings collected by 12 microphones placed around the ears mimicking microphones as in the image below.
Mounting type dataset
It contains recordings collected by 12 microphones placed around the head with some accessories mimicking glasses, a headphone, and a neck speaker as in the image below.
FOA format dataset
It contains 4 channels recordings collected by an ambisonic microphone to allow comparison with conventional methods using FOA format and those using the above datasets.

Sub-datasets

Each dataset type has three versions of the sub-dataset; Anechoic, Reverberation, and Reverberation + noise. FOA format dataset has only the Anechoic version. In total, Wearable SELD dataset contains 7 sub-datasets. This correspondence follows in the table below.

	Earphone type	Mounting type	FOA format
Anechoic	○	○	○
Reverberation	○	○	×
Reverberation + noise	○	○	×

Download

You can download the dataset here. Earphone/Mounting type datasets are split into 8-9 zip files by 7-zip. FOA format dataset is not split. Download the split zip files corresponding to the dataset of interest and use your favorite compression tool to unzip. As a sample, we also provide a smaller version of the dataset, named Small wearable SELD dataset. The sizes of each dataset follow in the table below.

	Zipped size (GB)	Unzipped size (GB)
Earphone type	62.3	96.5
Mounting type	64.1	96.5
FOA format	5.4	10.7
Smaller version	1.1	1.7

Folder structure

Each sub-dataset has wav_dev and metadata_dev for the training set, and wav_eval and metadata_eval for the evaluation set. wav* includes wav format recordings, and metadata* includes csv format reference labels. wav_dev consists of 400 wav files, and metadata_dev consists of 400 csv files. wav_eval consists of 100 wav files, and metadata_eval consists of 100 csv files. The directory structure for each sub-dataset is as follows.

Earphone_type_dataset/

|--- Earphone_type_dataset_anechoic/

|--- wav_dev/

|--- split0_ov1_1.wav

|--- split1_ov2_1.wav

|--- ...

|--- wav_eval/

|--- split0_ov1_1.wav

|--- split0_ov2_1.wav

|--- ...

|--- metadata_dev/

|--- split0_ov1_1.csv

|--- split1_ov2_1.csv

|--- ...

|--- metadata_eval/

|--- split0_ov1_1.csv

|--- split0_ov2_1.csv

|--- ...

|--- Earphone_type_dataset_reverberation/

|--- wav_dev/

|--- split0_roomB_ov1_1.wav

|--- split1_roomF_ov2_1.wav

|--- ...

|--- wav_eval/

|--- split0_roomB_ov1_1.wav

|--- split0_roomF_ov2_1.wav

|--- ...

|--- metadata_dev/

|--- split0_roomB_ov1_1.csv

|--- split1_roomF_ov2_1.csv

|--- ...

|--- metadata_eval/

|--- split0_roomB_ov1_1.csv

|--- split0_roomF_ov2_1.csv

|--- ...

|--- Earphone_type_dataset_reverberation_noise/

|--- wav_dev/

|--- split0_roomB_ov1_1.wav

|--- split1_roomF_ov2_1.wav

|--- ...

|--- wav_eval/

|--- split0_roomB_ov1_1.wav

|--- split0_roomF_ov2_1.wav

|--- ...

|--- metadata_dev/

|--- split0_roomB_ov1_1.csv

|--- split1_roomF_ov2_1.csv

|--- ...

|--- metadata_eval/

|--- split0_roomB_ov1_1.csv

|--- split0_roomF_ov2_1.csv

|--- ...

Mounting_type_dataset/

|--- Mounting_type_dataset_anechoic/

|--- wav_dev/

|--- split0_ov1_1.wav

|--- split1_ov2_1.wav

|--- ...

|--- wav_eval/

|--- split0_ov1_1.wav

|--- split0_ov2_1.wav

|--- ...

|--- metadata_dev/

|--- split0_ov1_1.csv

|--- split1_ov2_1.csv

|--- ...

|--- metadata_eval/

|--- split0_ov1_1.csv

|--- split0_ov2_1.csv

|--- ...

|--- Mounting_type_dataset_reverberation/

|--- wav_dev/

|--- split0_roomB_ov1_1.wav

|--- split1_roomF_ov2_1.wav

|--- ...

|--- wav_eval/

|--- split0_roomB_ov1_1.wav

|--- split0_roomF_ov2_1.wav

|--- ...

|--- metadata_dev/

|--- split0_roomB_ov1_1.csv

|--- split1_roomF_ov2_1.csv

|--- ...

|--- metadata_eval/

|--- split0_roomB_ov1_1.csv

|--- split0_roomF_ov2_1.csv

|--- ...

|--- Mounting_type_dataset_reverberation_noise

|--- wav_dev/

|--- split0_roomB_ov1_1.wav

|--- split1_roomF_ov2_1.wav

|--- ...

|--- wav_eval/

|--- split0_roomB_ov1_1.wav

|--- split0_roomF_ov2_1.wav

|--- ...

|--- metadata_dev/

|--- split0_roomB_ov1_1.csv

|--- split1_roomF_ov2_1.csv

|--- ...

|--- metadata_eval/

|--- split0_roomB_ov1_1.csv

|--- split0_roomF_ov2_1.csv

|--- ...

FOA format dataset/

|--- FOA_format_dataset_anechoic/

|--- wav_dev/

|--- split0_ov1_1.wav

|--- split1_ov2_1.wav

|--- ...

|--- wav_eval/

|--- split0_ov1_1.wav

|--- split0_ov2_1.wav

|--- ...

|--- metadata_dev/

|--- split0_ov1_1.csv

|--- split1_ov2_1.csv

|--- ...

|--- metadata_eval/

|--- split0_ov1_1.csv

|--- split0_ov2_1.csv

|--- ...

Naming convention

The names of the data refer to its condition. The recordings and the corresponding label files have the same name. For example, if the name of the recording is split0_ov1_1.wav, the name of the correspondence label file is split0_0_ov1_1.csv. The naming convention is different for Anechoic version and Reverberation/Reverberation + noise version.

Anechoic version

split[split number]_ov[number of overlapping sound events]_[recording number]

split number can be used for cross-validation. Each split contains 100 data. Since the development set contains 400 data and the evaluation set contains 100 data, split number is 0 to 3.
number of overlapping sound events refers to the maximum number of temporally overlapping sound events for each data. Since the maximum number is 2 in the dataset, number of overlapping sound events is 1 or 2.
recording number is a serial number for each condition.

Reverberation / Reverberation_noise version

split[split number]_room[room name]_ov[number of overlapping sound events]_[recording number]

room name refers to the condition of reverberation. At roomB, the reverberation time (T60 at 500 Hz) is 0.41 sec, at roomF it is 0.12 sec.

Detailed description of data

Recordings in this dataset are 60 sec long at a sampling rate of 48000 Hz. Recordings were synthesized by convolving collected IRs with collected sound events. These spatialized sound events were placed at a randomly generated time as in the image below. The maximum number of overlapping sound events is 2. Impulse responses were recorded at 36 azimuth angles for 3 elevation angles for each sub-dataset. The azimuth angle was discretized by 10 degrees, and hence 36 azimuth angles starting from 0 degrees were collected. The elevation angle was set to -20, 0, 20 degrees. Sound events were collected 20 samples for the 12 sound events in an anechoic room. The sound events are organ, piano, toy train, toy gun shot, metallophone, bicycle bell, security buzzer, shaker, handclap, woodblock , shaking bell, hit drum. The images of recording environments and devices are shown in details.pdf.

The example of the reference label is shown in the below table.

start_time/end_time refers to onset/offset of its sound event.
ele/azi stands for elevation and azimuth.
ov refers to whether the sound event is overlapping or not.

sound_event	start_time	end_time	ele	azi	ov
piano	1.6524	2.6240	0	350	1
shaker	1.9943	3.5475	-20	180	2
handclap	3.6039	4.0972	20	0	1
:	:	:	:	:	:

Usage

Recordings in Earphone/Mounting type dataset contain 12 channels signals. You can select arbitrary channel numbers to use from 12 channels. For the case of using PyTorch, if you want to use channels number 1, 5, 6, 7, you can extract these channels as follows.

import torchaudio

ch_list = [0,4,5,6]

tmp_wav, fs = torchaudio.load('./split0_roomF_ov2_1.wav')
# tmp_wav.shape => (12, 2880000)

wav = tmp_wav[ch_list, :]
# wav.shape => (4, 2880000)

License

All files and folders in Earphone type dataset, Mounting type dataset, FOA format dataset, Small wearable SELD dataset, and this repository are under this license.

Authors and Contact

Kento Nagatomo (Email: [email protected])
Masahiro Yasuda (Email: [email protected])
Kohei Yatabe
Shoichiro Saito
Yasuhiro Oikawa

Citing this work

If you'd like to cite this work, you may use the following.

Kento Nagatomo, Masahiro Yasuda, Kohei Yatabe, Shoichiro Saito, Yasuhiro Oikawa, “Wearable SELD dataset: Dataset for sound event localization and detection using wearable devices around head,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2022.

Link

Paper: arXiv

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
DETAILS.pdf		DETAILS.pdf
LICENSE.pdf		LICENSE.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wearable SELD dataset

Sub-datasets

Download

Folder structure

Naming convention

Anechoic version

Reverberation / Reverberation_noise version

Detailed description of data

Usage

License

Authors and Contact

Citing this work

Link

About

Releases

Packages

License

nttrd-mdlab/wearable-seld-dataset

Folders and files

Latest commit

History

Repository files navigation

Wearable SELD dataset

Sub-datasets

Download

Folder structure

Naming convention

Anechoic version

Reverberation / Reverberation_noise version

Detailed description of data

Usage

License

Authors and Contact

Citing this work

Link

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages