This is the code for the paper "Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos". We appreciate the contribution of 2D-TAN.
- python 3
- pytorch 1.6.0
- torchvision 0.7.0
- torchtext 0.7.0
- easydict
- terminaltables
Please download the visual features from box drive and save it to the data/
folder.
Use the following commands for training:
# For ActivityNet Captions
python moment_localization/train.py --cfg experiments/activitynet/MSAT-32.yaml --verbose
# For TACoS
python moment_localization/train.py --cfg experiments/tacos/MSAT-128.yaml --verbose
Our trained model are provided in Baidu Yun(access code:rc2m). Please download them to the checkpoints
folder.
Then, run the following commands for evaluation:
# For ActivityNet Captions
python moment_localization/test.py --cfg experiments/activitynet/MSAT-32.yaml --verbose --split test
# For TACoS
python moment_localization/test.py --cfg experiments/tacos/MSAT-128.yaml --verbose --split test
If any part of our paper and code is helpful to your work, please generously cite with:
@inproceedings{zhang2021multi,
title={Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos},
author={Zhang, Mingxing and Yang, Yang and Chen, Xinghan and Ji, Yanli and Xu, Xing and Li, Jingjing and Shen, Heng Tao},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={12669--12678},
year={2021}
}