InternLM · pppppM · Dec 26, 2023 · Nov 1, 2023 · Nov 1, 2023 · Nov 1, 2023
diff --git a/README.md b/README.md
@@ -23,7 +23,8 @@ English | [简体中文](README_zh-CN.md)
 
 ## 🎉 News
 
-- **\[2023/12\]** 🔥 Support [Mixtral 8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model! To get started, please check out the [docs](xtuner/configs/mixtral/README.md)!
+- **\[2023/12\]** 🔥 Support multi-modal VLM pretraining and fine-tuning with [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA) architecture! Click [here](xtuner/configs/llava/README.md) for details!
+- **\[2023/12\]** 🔥 Support [Mixtral 8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model! Click [here](xtuner/configs/mixtral/README.md) for details!
 - **\[2023/11\]** Support [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b) model!
 - **\[2023/10\]** Support [MSAgent-Bench](https://modelscope.cn/datasets/damo/MSAgent-Bench) dataset, and the fine-tuned LLMs can be applied by [Lagent](https://github.com/InternLM/lagent)!
 - **\[2023/10\]** Optimize the data processing to accommodate `system` context. More information can be found on [Docs](docs/en/user_guides/dataset_format.md)!
@@ -267,6 +268,18 @@ We appreciate all contributions to XTuner. Please refer to [CONTRIBUTING.md](.gi
 - [Llama 2](https://github.com/facebookresearch/llama)
 - [QLoRA](https://github.com/artidoro/qlora)
 - [LMDeploy](https://github.com/InternLM/lmdeploy)
+- [LLaVA](https://github.com/haotian-liu/LLaVA)
+
+## 🖊️ Citation
+
+```bibtex
+@misc{2023xtuner,
+    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
+    author={XTuner Contributors},
+    howpublished = {\url{https://github.com/InternLM/xtuner}},
+    year={2023}
+}
+```
 
 ## License
 

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -23,6 +23,7 @@
 
 ## 🎉 更新
 
+- **\[2023/12\]** 🔥 支持多模态模型 VLM（[LLaVA-v1.5](https://github.com/haotian-liu/LLaVA)）预训练和指令微调！快速开始请查阅此[文档](xtuner/configs/llava/README_zh.md)！
 - **\[2023/12\]** 🔥 支持 [Mixtral 8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) 模型！快速开始请查阅此[文档](xtuner/configs/mixtral/README.md)！
 - **\[2023/11\]** 支持 [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b) 模型！
 - **\[2023/10\]** 支持 [MSAgent-Bench](https://modelscope.cn/datasets/damo/MSAgent-Bench) 数据集，并且微调所得大语言模型可应用至 [Lagent](https://github.com/InternLM/lagent) 框架！
@@ -267,6 +268,18 @@ xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-
 - [Llama 2](https://github.com/facebookresearch/llama)
 - [QLoRA](https://github.com/artidoro/qlora)
 - [LMDeploy](https://github.com/InternLM/lmdeploy)
+- [LLaVA](https://github.com/haotian-liu/LLaVA)
+
+## 🖊️ 引用
+
+```bibtex
+@misc{2023xtuner,
+    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
+    author={XTuner Contributors},
+    howpublished = {\url{https://github.com/InternLM/xtuner}},
+    year={2023}
+}
+```
 
 ## 开源许可证
 

diff --git a/docs/en/user_guides/dataset_prepare.md b/docs/en/user_guides/dataset_prepare.md
@@ -5,6 +5,7 @@
   - [Arxiv Gentitle](#arxiv-gentitle)
   - [MOSS-003-SFT](#moss-003-sft)
   - [Chinese Lawyer](#chinese-lawyer)
+  - [LLaVA dataset](#llava-dataset)
 
 ## HuggingFace datasets
 
@@ -55,3 +56,78 @@ unzip moss-003-sft-with-tools-no-text2image.zip
 Chinese Lawyer dataset has two sub-dataset, and can be downloaded form https://github.com/LiuHC0428/LAW-GPT.
 
 All lawyer configs assume the dataset path to be `./data/CrimeKgAssitant清洗后_52k.json` and `./data/训练数据_带法律依据_92k.json`. You can move and rename your data, or make changes to these configs.
+
+### LLaVA dataset
+
+#### File structure
+
+```
+./data/llava_data
+├── LLaVA-Pretrain
+│   ├── blip_laion_cc_sbu_558k.json
+│   ├── blip_laion_cc_sbu_558k_meta.json
+│   └── images
+├── LLaVA-Instruct-150K
+│   └── llava_v1_5_mix665k.json
+└── llava_images
+    ├── coco
+    │   └── train2017
+    ├── gqa
+    │   └── images
+    ├── ocr_vqa
+    │   └── images
+    ├── textvqa
+    │   └── train_images
+    └── vg
+        ├── VG_100K
+        └── VG_100K_2
+```
+
+#### Pretrain
+
+LLaVA-Pretrain
+
+```shell
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
+```
+
+#### Finetune
+
+1. Text data
+
+   1. LLaVA-Instruct-150K
+
+      ```shell
+      # Make sure you have git-lfs installed (https://git-lfs.com)
+      git lfs install
+      git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
+      ```
+
+2. Image data
+
+   1. COCO (coco): [train2017](http://images.cocodataset.org/zips/train2017.zip)
+
+   2. GQA (gqa): [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
+
+   3. OCR-VQA (ocr_vqa): [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)
+
+      1. ⚠️ Modify the name of OCR-VQA's images to keep the extension as `.jpg`!
+
+         ```shell
+         #!/bin/bash
+         ocr_vqa_path="<your-directory-path>"
+
+         find "$target_dir" -type f | while read file; do
+             extension="${file##*.}"
+             if [ "$extension" != "jpg" ]
+             then
+                 cp -- "$file" "${file%.*}.jpg"
+             fi
+         done
+         ```
+
+   4. TextVQA (textvqa): [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
+
+   5. VisualGenome (VG): [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
diff --git a/docs/zh_cn/user_guides/dataset_prepare.md b/docs/zh_cn/user_guides/dataset_prepare.md
@@ -5,6 +5,7 @@
   - [Arxiv Gentitle 生成题目](#arxiv-gentitle-生成题目)
   - [MOSS-003-SFT](#moss-003-sft)
   - [Chinese Lawyer](#chinese-lawyer)
+  - [LLaVA dataset](#llava-dataset)
 
 ## HuggingFace 数据集
 
@@ -55,3 +56,78 @@ unzip moss-003-sft-with-tools-no-text2image.zip
 Chinese Lawyer 数据集有两个子数据集，它们可以在 https://github.com/LiuHC0428/LAW-GPT 下载。
 
 所有的 Chinese Lawyer 配置文件都假设数据集路径为 `./data/CrimeKgAssitant清洗后_52k.json` 和 `./data/训练数据_带法律依据_92k.json`。用户可以移动并重命名数据，或者在配置文件中重新设置数据路径。
+
+### LLaVA dataset
+
+#### 文件结构
+
+```
+./data/llava_data
+├── LLaVA-Pretrain
+│   ├── blip_laion_cc_sbu_558k.json
+│   ├── blip_laion_cc_sbu_558k_meta.json
+│   └── images
+├── LLaVA-Instruct-150K
+│   └── llava_v1_5_mix665k.json
+└── llava_images
+    ├── coco
+    │   └── train2017
+    ├── gqa
+    │   └── images
+    ├── ocr_vqa
+    │   └── images
+    ├── textvqa
+    │   └── train_images
+    └── vg
+        ├── VG_100K
+        └── VG_100K_2
+```
+
+#### 预训练 Pretrain
+
+LLaVA-Pretrain
+
+```shell
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
+```
+
+#### 微调 Finetune
+
+1. 文本数据
+
+   1. LLaVA-Instruct-150K
+
+      ```shell
+      # Make sure you have git-lfs installed (https://git-lfs.com)
+      git lfs install
+      git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
+      ```
+
+2. 图片数据
+
+   1. COCO (coco): [train2017](http://images.cocodataset.org/zips/train2017.zip)
+
+   2. GQA (gqa): [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
+
+   3. OCR-VQA (ocr_vqa): [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)
+
+      1. ⚠️ OCR-VQA 所下载的图片命名需要进行修改，以确保所有图片后缀为 `.jpg`！
+
+         ```shell
+         #!/bin/bash
+         ocr_vqa_path="<your-directory-path>"
+
+         find "$target_dir" -type f | while read file; do
+             extension="${file##*.}"
+             if [ "$extension" != "jpg" ]
+             then
+                 cp -- "$file" "${file%.*}.jpg"
+             fi
+         done
+         ```
+
+   4. TextVQA (textvqa): [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
+
+   5. VisualGenome (VG): [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
diff --git a/requirements/runtime.txt b/requirements/runtime.txt
@@ -8,6 +8,7 @@ lagent>=0.1.2
 # Minimum 0.10.1 to support exclude_frozen_parameters for DeepSpeedStrategy,
 # see https://github.com/open-mmlab/mmengine/pull/1415, https://github.com/open-mmlab/mmengine/pull/1424
 mmengine>=0.10.1
+openpyxl
 # Minimum 0.4.0 to support QLoRA, see https://github.com/huggingface/peft/pull/476
 peft>=0.4.0
 scipy

diff --git a/xtuner/configs/internlm/internlm_7b/internlm_7b_full_intern_repo_dataset_template.py b/xtuner/configs/internlm/internlm_7b/internlm_7b_full_intern_repo_dataset_template.py
@@ -100,7 +100,9 @@
 #######################################################################
 # Log the dialogue periodically during the training process, optional
 custom_hooks = [
-    dict(type=DatasetInfoHook, tokenizer=tokenizer),
+    dict(
+        type=DatasetInfoHook, tokenizer=tokenizer,
+        is_intern_repo_dataset=True),
     dict(type=ThroughputHook)
 ]
 

diff --git a/xtuner/configs/llava/README.md b/xtuner/configs/llava/README.md
@@ -0,0 +1,92 @@
+# LLaVA Full Pipeline
+
+## Data Preparation
+
+Please refer to the [docs](../../../docs/en/user_guides/dataset_prepare.md#llava-dataset).
+
+## Training
+
+The training of LLaVA consists of two steps: alignment module (i.e., MLP) pretraining and instruction following fine-tuning
+
+Note: this guide takes 8-card training LLaVA-InternLM as an example, if there are insufficient GPU resources or memory during actual use, you can reduce the batchsize appropriately to decrease memory consumption. The Pretrained projector is saved and re-loaded by default in `./work_dirs/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/epoch_1.pth`.
+
+1. Alignment module pretraining (saved by default in `./work_dirs/`)
+
+```bash
+NPROC_PER_NODE=8 xtuner train llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2
+```
+
+2. Instruction following fine-tuning (saved by default in `./work_dirs/`)
+
+```bash
+NPROC_PER_NODE=8 xtuner train llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2
+```
+
+## Model Convert (and Merge)
+
+After training, we will obtain a set of weights (*i.e.*, `epoch_1.pth`), which are not in the universal HuggingFace format. We first need to convert them.
+
+```bash
+xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
+# e.g., xtuner convert pth_to_hf llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune ./epoch_1.pth ./epoch_1_hf
+```
+
+At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
+
+Afterwards, if you want to merge LoRA into LLM or CLIP-ViT, please use the following command:
+
+```bash
+(LLM) xtuner convert merge $LLM $LLM_ADAPTER $SAVE_PATH
+(CLIP) xtuner convert merge $CLIP $CLIP_ADAPTER $SAVE_PATH --is-clip
+```
+
+## Chat
+
+You can download the released LLaVA-InternLM-7B model from 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b) and 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b), and achieve image-text question answering with the following command!
+
+```bash
+xtuner chat internlm/internlm-chat-7b \
+  --visual-encoder openai/clip-vit-large-patch14-336 \
+  --llava xtuner/llava-internlm-7b \
+  --prompt-template internlm_chat \
+  --image $IMAGE_PATH
+```
+
+Here, `--llava` is the converted weight from the above step (in our example, it is `./epoch_1_hf` ).
+
+## Evaluation
+
+XTuner's LLaVA models can be evaluated using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit).
+
+For convenience, XTuner also integrates the [MMBench](https://mmbench.opencompass.org.cn/home) evaluation.
+
+User can download the MMBench dataset with
+
+```
+wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
+wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
+wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
+wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
+wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
+```
+
+After that, the evaluations can be run with
+
+```bash
+xtuner mmbench internlm/internlm-chat-7b \
+  --visual-encoder openai/clip-vit-large-patch14-336 \
+  --llava xtuner/llava-internlm-7b \
+  --prompt-template internlm_chat \
+  --data-path $DATA_PATH \
+  --work-dir $RESULT_PATH
+```
+
+Here, `$DATA_PATH` refers to one of the datasets downloaded as mentioned above, such as `MMBench_DEV_EN.tsv`.
+
+After the evaluation is completed, if it's a development set, it will directly print out the results; If it's a test set, you need to submit `mmbench_result.xlsx` to the official MMBench for final evaluation to obtain precision results!
+
+| Model                      | MMBench Test (EN) | MMBench Dev (EN) | MMBench Test (CN) | MMBench Dev (CN) | CCBench Dev | MME  | MMVet | SEEDBench_IMG |                                                                                                                                     Configs                                                                                                                                     |                                                                   Pretrained Projector Checkpoints                                                                   |                                                            Fine-tuned LLaVA Checkpoints                                                            |
+| :------------------------- | :---------------: | :--------------: | :---------------: | :--------------: | :---------: | :--: | :---: | :-----------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------: |
+| LLaVA-v1.5-7B (XTuner)     |       67.7        |       69.2       |       61.0        |       59.7       |    27.6     | 1702 | 66.4  |     32.3      |       [Pretrain](./vicuna_7b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./vicuna_7b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py)       |  🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-7b-xtuner-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-7b-xtuner-pretrain)  |  🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-7b-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-7b-xtuner)  |
+| LLaVA-v1.5-13B (XTuner)    |       68.9        |       69.5       |       64.7        |       63.1       |    32.2     | 1771 | 68.1  |     35.5      |     [Pretrain](./vicuna_13b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./vicuna_13b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py)     | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-13b-xtuner-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-13b-xtuner-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-13b-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-13b-xtuner) |
+| LLaVA-InternLM-7B (XTuner) |       69.0        |       68.5       |       66.7        |       63.8       |    35.8     | 1671 | 65.8  |     33.8      | [Pretrain](./internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./internlm_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) |     🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b-pretrain)     |     🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b)     |