EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
EVE is accepted by AAAI 2024.
Release preprint in arxiv.
Release pre-trained weight and fine-tuning code.
Pre-training code is coming soon.
Base Model with 4M Images and 10M Image-Text Pairs.
Large Model with 4M Images and 10M Image-Text Pairs.
Large Model with 16M Images and 21M Image-Text Pairs.
Base Model with 4M Images and 10M Image-Text Pairs.
Large Model with 4M Images and 10M Image-Text Pairs.
Large Model with 16M Images and 21M Image-Text Pairs.
We use a bpe tokenizer pre-trained on bookcorpus and wiki.
- Install python3 environment
pip3 install -r requirements.txt
- We use pytorch v1.12.1 in implementation and fairscale for gradient checkpoint
- Download raw images from corresponding websites
- Use scripts in data_utils to preprocess different datasets (fix data path and how to read image). Similar with data preparation in ViLT
Download fine-tune json files, the same with XVLM
We use <output_dir> and <output_hdfs_dir> to save checkpoints, and they could be the same path.
<aug> must be in [None, color, rand].
<vlmo_config> stands for different models
- config_vlmoB_base64k.json for base model
- config_vlmoL_base64k.json for large model.
<bs> stands for total batch size (batch size per gpu * gpu_nums).
<lr> stands for learning rate.
<lr_mult> stands for multiplying factor on learning rate for parameters except backbone.
<k_test> stands for selected top_k samples for rerank in Retrieval task.
Use --evaluate in scripts to conduct evaluation only.
python3 run.py --task=itr_coco --dist=all --checkpoint=<checkpoint_dir> --output_dir=<output_dir> --output_hdfs=<output_hdfs_dir> --augmentation=<aug> --bs=<bs> --lr=<lr> --k_test=<top_k> --lr_mult=<lr_mult>
augmentation=color, bs=256, k_test=128, lr_mult=10 is set for COCO Retrieval.
lr=3e-5 for base model and lr=5e-5 for large model.
python3 run.py --task=itr_flickr --dist=all --checkpoint=<checkpoint_dir> --output_dir=<output_dir> --output_hdfs=<output_hdfs_dir> --augmentation=<aug> --bs=<bs> --lr=<lr> --k_test=<k_test> --lr_mult=<lr_mult> --vlmo_config=<vlmo_config>
augmentation=color, bs=128, lr=1e-5, k_test=128, lr_mult=5 for Flickr Retrieval.
python3 run.py --task=vqa --dist=all --config=configs/VQA_beit_480_vg.yaml --checkpoint=<checkpoint_dir> --output_dir=<output_dir> --output_hdfs=<output_hdfs_dir> --vlmo_config=<vlmo_config> --bs=<bs> --lr_mult=<lr_mult> --lr=<lr>
bs=128, lr_mult=10, lr=3e-5 for VQA.
python3 run.py --task=NLVR_vit_lrtest_5e-5= --dist=all --config=configs/VQA_beit_480_vg.yaml --checkpoint=<checkpoint_dir> --output_dir=<output_dir> --output_hdfs=<output_hdfs_dir> --beit --augmentation=<aug> --vlmo_config=<vlmo_config> --bs=<bs> --lr_mult=<lr_mult> --lr=<lr> --biattn
aug=rand, bs=128, lr=3.5e-5, lr_mult=15 for NLVR.
Use --biattn to enbale bi-attention module.
If you find this repository useful, please considering giving ⭐ or citing:
author = {Junyi Chen and Longteng Guo and Jia Sun and Shuai Shao and Zehuan Yuan and Liang Lin and Dongyu Zhang},
title = {EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE},
journal = {arXiv preprint arXiv:2308.11971},
year = {2023},