MM-office is a multi-view and multi-modal dataset in an office environment (MM-Office) that records events, e.g., 'enter' to the office room, 'sit down' on the chair, and 'take out' something from a shelf, in the room assuming the daily work. These events are recorded simultaneously using eight non-directional microphones and four cameras. The audio and video clips are divided into scenes, each about 30 to 90 seconds. The amount of data was 880 clips per point and sensor. The labels available for training are given as multi-labels that indicate which each clip contains what event. Only the test data is annotated with a strong label containing the onset/offset time of each event.
You can download the dataset here.
The dataset has following folder structure:
MM_Office_Dataset ├── audio │ ├── test │ └── train ├── video │ ├── test │ └── train └──label ├── testlabel └── trainlabel ├── classinfo.csv ├── eventinfo.csv └── recinfo.csv
Audio and video were recorded synchronously using four cameras (GoPro HERO8) and eight non-directional microphones (HOSIDEN KUB4225) installed in the office, as shown in the room setup figure below. The audio was recorded at 48kHz/16bit. The video was recorded at 1920×1440/30fps, and then resized to 480
The naming convention for these recordings is as follows.
split[split index]_id[sensor index]_s[scene index]_recid[recording id]_[division].[wav or mp4]
The MM-Office dataset is split into 10 splits for convenience, and the split index (0 to 9) is the index of that. The sensor index is the sensor number of the camera and microphone and corresponds to the room setup figure above (but starts with 0). The scene index is an index that shows the scenario pattern of actions performed by the actors. Refer to eventinfo.csv
to see what kind of actions and events each scene contains. The recording id is the serial number of the recording, but after recording, we decided to split each recording in half to make a single clip, so each recording id has two duplicates. The division indicates this, where the first half is 0 and the second half is 1.
index | eventclass | starttime | endtime |
---|---|---|---|
0 | 8 | 6 | 14 |
1 | 11 | 20 | 35 |
recid | sceneid | patternid |
---|---|---|
0 | 1 | 1 |
... | ||
679 | 11 | 1 |
sceneid | patternid | division | class1 | class2 | class3 | class4 | ... | class12 |
---|---|---|---|---|---|---|---|---|
5 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | |
... | ... | |||||||
3 | 6 | 1 | 0 | 1 | 0 | 0 | 1 |
It contains the event name (e.g. 'stand up,' 'phone') of each of the event classes shown in eventinfo.csv
and a description of what kind of event it is.
The following is the execution environment in which the expected operation of this program has been verified.
- Ubuntu 22.04.4 LTS
- NVIDIA GPU V100 32GB (x4)
- NVIDIA Driver Version == 470.239.06
- CUDA Version == 10.1
pytorch == 1.7.1
torchaudio == 0.7.2
torchvision == 0.8.2
numpy == 1.18.1
pandas == 1.0.0
glob2 == 0.7
tqdm == 4.42.0
- Prepare the above environment
- Download and put MM-Office Dataset at
mm-office/
- Run
. training.sh
See this license file.
- Masahiro Yasuda (Email: [email protected])
- Yasunori Ohishi
- Shoichiro Saito
- Noboru Harada
If you'd like to cite this work, you may use the following.
Masahiro Yasuda, Yasunori Ohishi, Shoichiro Saito, Noboru Harada “Multi-view and Multi-modal Event Detection Utilizing Transformer-based Multi-sensor fusion,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2022.
Paper: arXiv