We provide the multilingual captions for the HowTo100M dataset in the following languages:
Language | code | link |
---|---|---|
Englsish | en | link |
German | de | link |
French | fr | link |
Czech | cs | link |
Swahili | sw | link |
Russian | ru | link |
Vietnamese | vi | link |
Spanish | es | link |
Chinese | zh | link |
The how2_[lang].json file contains the captions for the HowTo100M videos. It can be read into a python dictionary where video_id as the key. Each value of the dictionary is another dictionary with the keys ['text', 'start', 'end']. The value of 'text' is a list of all the captions from the given video_id, and 'start' and 'end' are arrays correspondings to the start and end time timestamp of the captions (in second).
Please refer to here for the list of HowTo100M videos and the video meta data
The translated VTT in 9 languages for evaluation is available here
@inproceedings{huang2021multilingual,
title={Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models},
author={Huang, Po-Yao and Patrick, Mandela and Hu, Junjie and Neubig, Graham and Metze, Florian and Hauptmann, Alexander G},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={2443--2459},
year={2021},
url = {https://arxiv.org/abs/2103.08849},
}
Please feel free to contact Bernie Huang ([email protected] or [email protected]) if you have any questions. Thanks for your interest!