Skip to content

This repository contains a PyTorch implementation of the work: Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Notifications You must be signed in to change notification settings

best-of-acrv/bottom-up-attention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

This repository contains a PyTorch implementation of the work Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, which is available here

Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

This repository is designed to provide out-of-the-box functionality for evaluation and training of bottom-up attention models for both visual question answering and captioning task, with as little overhead as possible. Code was adapted from the following repositories here and here.

Setup

Conda needs to be updated to the latest stable version:

$ conda update conda
$ conda update --all

From the root directory of this repository, to create the Conda environment to run code from this repository:

$ conda config --set channel_priority strict
$ conda env create -f requirements.yml

This should set up the conda environment with all prerequisites for running this code. Activate this Conda environment using the following command:

$ conda activate pytorch-bua

Install NLG-EVAL

For evaluating captioning models, install NLG-EVAL:

$ pip install git+https://github.com/Maluuba/nlg-eval.git@master
$ nlg-eval --setup

Data Setup

For the datasets required for this project, please refer to the Best-Of-ACRV repository. Use this repository to download and prepare the COCO and GloVe datasets required for this project. The data directories should appear in the following structure:

root_dir
|--- deploy.py
|--- eval.py
|--- train.py
acrv-datasets
|--- datasets
|------- coco
|------- glove
|------- trainval36

trainval36 are the 36 features per image set provided by the original authors (refer to this repository). After all data has been downloaded, process the data into the correct format for captioning and visual question answering tasks using process.sh, located in the root directory of this repository.

$ sh process.sh

Evaluation

To evaluate with one of the pretrained models, run eval.py. There are pretrained models for both captioning and visual question (VQA) answering tasks.

You can specify the desired task (captioning or VQA). For example, to perform captioning with the provided pretrained model, run the following command from the root directory:

$ python eval.py --task=captioning

Pretrained models will be automatically downloaded and stored in the pretrained/models directory. Alternatively, if you wish to load your own pretrained model, you can do this by specifying a load directory (e.g.):

$ python eval.py --task=captioning --load_directory=runs/mymodel

Will load a pretrained captioning model, from the directory runs/mymodel.

Training

To train your own model, run train.py.

Use --task to choose between captioning for image captioning and vqa for visual question answering. For example, to train on the image captioning task, run the following command from the root directory of this repository:

$ python train.py --task=captioning

About

This repository contains a PyTorch implementation of the work: Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Resources

Stars

Watchers

Forks

Packages

No packages published