Disclaimer: I am not working on this anymore. I will be happy to answer questions and review & merge PRs though.
This is a Keras & Tensorflow implementation of a captioning model. In particular, it uses the attention models described in this paper, which is depicted below:
where V are the K local features from the last convolutional layer of a ConvNet (e.g. ResNet-50), xt is the input (composed of the embedding of the previous word and the average image feature). ht is the hidden state of the LSTM at time t, which is used to compute the attention weights to apply to V in order to obtain the context vector ct. ct and ht are combined to predict the current word yt. In (b), an additional gate is incorporated into the LSTM to produce the additional st output, which is combined with V to compute the attention weights. st is used as an alternative feature to look at rather than the image features in V.
- Clone this repository
# Make sure to clone with --recursive
git clone --recursive https://github.com/amaiasalvador/sat_keras.git
- Install python 2.7.
- Install tensorflow 0.12.
pip install -r requirements.txt
- (Optional )Install this Keras PR with support for layer-wise learning rate multipliers:
git clone https://github.com/amaiasalvador/keras.git
cd keras
git checkout lr_mult
python setup.py install
This option is disabled by default, so you can use "regular" keras 1.2.2 if you don't want to set a different learning rate to the base model.
- Set tensorflow as the keras backend in
~/.keras/keras.json
:
{
"image_dim_ordering": "tf",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "tensorflow"
}
- Download MS COCO Caption Challenge 2015 dataset. Note that test images are not required for this code to work.
- After extraction, the dataset folder must have the following structure:
$coco/ # dataset dir
$coco/annotations/ # annotations directory
$coco/annotations/captions_train2014.json # caption anns for training set
$coco/annotations/captions_val2014.json # ...
$coco/images/ # image dir
$coco/images/train2014 # train image dir
$coco/images/val2014 # ...
- Navigate to
imcap/utils
and run:
python prepro_coco.py --output_json path_to_json --output_h5 path_to_h5 --images_root path_to_coco_images
this will create the vocabulary and HDF5 file with data.
- [Coming soon] Download pretrained model here.
Unless stated otherwise, run all commands from ./imcap
:
Run sample_captions.ipynb
to test the trained network on some images and visualize attention maps.
Run python train.py
. Run python args.py --help
for a list of the available arguments to pass.
- Run
python test.py
to forward all validation images through a trained network and create json file with results. Use--cnntrain
flag if evaluating a model with fine tuned convnet. - Navigate to
./imcap/coco_caption/
. - From there run:
to get METEOR, Bleu, ROUGE_L & CIDEr scores for the previous json file with generated captions.
python eval_caps.py -results_file results.json -ann_file gt_file.json
For the sake of comparison, the data processing script follows the one in NeuralTalk2 and AdaptiveAttention.
- Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015.
- Lu et al. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. CVPR 2017 (original code here).
- Caption evaluation code from this repository.
For questions and suggestions either use the issues section or send an e-mail to [email protected].