Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here. Put the downloaded data under the folder playground/data
.
playground/
└── data
└── pretrain
├── blip_laion_cc_sbu_558k.json
├── blip_laion_cc_sbu_558k_meta.json
└── images
Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017, val2014
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data
,
playground/
└── data
├── llava_v1_5_mix665k.json
├── coco
│ ├── val2014
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2