By Dongze Hao, Qunbo Wang and Jing Liu
This is the official implementation of the paper. In this paper, we propose a Semantic-Visual Graph Reasoning framework (SVG) for VisDial. Specifically, we first construct a semantic graph to capture the semantic rela- tionships between different entities in the current question and the dialog history. Secondly, we construct a semantics-aware visual graph to capture high-level visual semantics including key objects of the image and their visual relationships. Exten- sive experimental results on the VisDial v0.9 and v1.0 show that our method has shown superior performance compared to the state-of-the-art models across most evaluation metrics.
conda create -n svg python=3.8
conda activate svg
conda conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch
pip install tqdm pyyaml nltk setproctitle
- Download the data
- Download the VisDial v0.9 and v1.0 dialog json files from here and keep it under
$PROJECT_ROOT/data/v0.9
and$PROJECT_ROOT/data/v1.0
directory, respectively. - batra-mlp-lab provides the word counts for VisDial v1.0 train split
visdial_1.0_word_counts_train.json
. They are used to build the vocabulary. Keep it under$PROJECT_ROOT/data/v1.0
directory. - batra-mlp-lab provides Faster-RCNN image features pre-trained on Visual Genome. Keep it under
$PROJECT_ROOT/data/visdial_1.0_img
and set argumentimg_feature_type
tofaster_rcnn_x101
inconfig/hparams.py
file.features_faster_rcnn_x101_train.h5
: Bottom-up features of 36 proposals from images oftrain
split.features_faster_rcnn_x101_val.h5
: Bottom-up features of 36 proposals from images ofval
split.features_faster_rcnn_x101_test.h5
: Bottom-up features of 36 proposals from images oftest
split.
- gicheonkang provides pre-extracted Faster-RCNN image features, which contain bounding boxes information. Set argument
img_feature_type
todan_faster_rcnn_x101
inconfig/hparams.py
file.train_btmup_f.hdf5
: Bottom-up features of 10 to 100 proposals from images oftrain
split (32GB).train_imgid2idx.pkl
:image_id
to bbox index file fortrain
splitval_btmup_f.hdf5
: Bottom-up features of 10 to 100 proposals from images ofvalidation
split (0.5GB).val_imgid2idx.pkl
:image_id
to bbox index file forval
splittest_btmup_f.hdf5
: Bottom-up features of 10 to 100 proposals from images oftest
split (2GB).test_imgid2idx.pkl
:image_id
to bbox index file fortest
split
- Preprocess the data
- Download the GloVe pretrained word vectors from here, and keep
glove.6B.300d.txt
under$PROJECT_ROOT/data/word_embeddings/glove
directory. Runpython data/preprocess/init_glove.py
- Preprocesse textual inputs
python data/data_utils.py
- Train the model
python main.py --model svg --version 1.0
- Evaluate the model
python main.py --model svg --evaluate /path/to/checkpoint.pth --eval_split val --version 1.0
This code is reimplemented as a fork of batra-mlp-lab/visdial-challenge-starter-pytorch and yuleiniu/rva.