Skip to content

Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



52 Commits

Repository files navigation


his is a repository for organizing articles related to Multimodal Large Language Models, Large Language Models, and Diffusion Models; Most papers are linked to my reading notes. Feel free to visit my personal homepage and contact me for collaboration and discussion.

About Me 🔆

I'm a third-year Ph.D. student at the State Key Laboratory of Pattern Recognition, the University of Chinese Academy of Sciences, advised by Prof. Tieniu Tan. I have also spent time at Microsoft, advised by Prof. Jingdong Wang, alibaba DAMO Academy, work with Prof. Rong Jin.

🔥 Updated 2024-12-22

We have presented a comprehensive survey on the evaluation of large multi-modality models, jointly with Opencompass Team and LMMs-Lab 🔥🔥🔥

Table of Contents (ongoing)

Survey and Outlook

  1. 万字长文总结多模态大模型评估最新进展
  2. 万字长文总结多模态大模型最新进展(Modality Bridging篇)
  3. 万字长文总结多模态大模型最新进展(Video篇)
  4. Aligning Large Language Models with Human

Multimodal Large Language Models

  1. (Meta,Stanford) Apollo: An Exploration of Video Understanding in Large Multimodal Models(什么是MLLM视频理解的关键因素)
  2. (Shanghai AI Lab) Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling(InternVL2.5技术细节-让开源多模态模型再进一步)
  3. (NVIDIA) NVLM: Open Frontier-Class Multimodal LLMs(三种不同的特征融合框架深度探索)
  4. (Allen Institute for AI) Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models(本文的改进集中在数据侧,包括了一些数据合成的方法,开放了更高质量得多模态数据等)
  5. (MixtralAI) Pixtral 12B(12B接近Qwen2-VL 72B和Llama-3.2 90B水平)
  6. (Rhymes AI) Aria: An Open Multimodal Native Mixture-of-Experts Mode(细粒度混合专家(MoE)架构)
  7. (Apple) MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning(apple:多模态大模型炼丹指南)
  8. (Hugging Face) Building and better understanding vision-language models: insights and future directions(Hugging Face:探索多模态大模型的最佳技术路线)
  9. (Alibaba) Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution(精细的动态分辨率策略+多模态旋转位置嵌入)
  10. LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture(在单个A100 80GB GPU上可以处理近千张图像)
  11. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?(最难多模态Benchmark. QwenVL-2第一但未及格!)
  12. VITA: Towards Open-Source Interactive Omni Multimodal LLM(VITA : 首个开源支持自然人机交互的全能多模态大语言模型)
  13. Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models(高效处理高分辨率图像的多模态大模型)
  14. Matryoshka Multimodal Models(如何在正确回答视觉问题的同时使用最少的视觉标记?)
  15. Chameleon: Mixed-Modal Early-Fusion Foundation Models(meta: 所有模态都回到token regreesion以达到灵活的理解/生成)
  16. Flamingo: a Visual Language Model for Few-Shot Learning(LLM每一层创建额外的block处理视觉信息)
  17. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models(q-former融合视觉-语言信息)
  18. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning(qformer+instruction tuning)
  19. Visual Instruction Tuning(MLP对齐特征,gpt4v生成instruction tuning数据)
  20. Improved Baselines with Visual Instruction Tuning(对于llava数据集以及模型大小的初步scaling)
  21. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge(分辨率*4,数据集更大)
  22. Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models(一种端到端的优化方案,通过轻量级适配器连接图像编码器和LLM)
  23. MIMIC-IT: Multi-Modal In-Context Instruction Tuning( MIMIC-IT包含多个图片或视频的输入数据,并支持多模态上下文信息)
  24. LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding(使用公开可用的OCR工具在LAION数据集的422K个文本丰富的图像上收集结果)
  25. SVIT: Scaling up Visual Instruction Tuning(一个包含420万个视觉指导调整数据点的数据集)
  26. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond(cross attention对齐特征,更大的第一阶段训练数据)
  27. NExT-GPT: Any-to-Any Multimodal LLM(端到端通用的任意对任意MM-LLM(Multimodal-Large Language Model)系统)
  28. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition(视觉信息的压缩采样)
  29. CogVLM: Visual Expert for Pretrained Language Models(在LLM的各层添加visual expert,它具有独立的QKV和FFN相关的参数)
  30. OtterHD: A High-Resolution Multi-modality Model(专门设计用于以细粒度精度解释高分辨率视觉输入)
  31. Monkey : Image Resolution and Text Label Are Important Things for Large Multi-modal Models(Monkey模型提出了一种有效地提高输入分辨率的方法,最高可达 896 x 1344 像素)
  32. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models(LLaMA-VID赋予现有框架支持长达一小时的视频,并通过额外的上下文标记推动了它们的上限)
  33. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models(解决了多模态稀疏学习中的性能下降问题)
  34. LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images(高效处理任何纵横比和高分辨率的图像)
  35. Yi-VL(Yi-VL采用了LLaVA架构,经过全面的三阶段训练过程,以将视觉信息与Yi LLM的语义空间良好对齐:)
  36. Mini-Gemini(双视觉编码器,使用低分辨率的视觉编码器特征作为query,将高分辨率特征作为key 和value进行token mining)
  37. Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding(采用了一组动态视觉tokens来统一表示图像和视频。使模型能够高效利用有限数量的视觉tokens,同时捕捉图像所需的空间细节和视频所需的全面时间关系。)
  38. VILA: On Pre-training for Visual Language Models(交错的预训练数据是有益的,而单纯的图像-文本对并非最佳选择。)
  39. ST-LLM: Large Language Models Are Effective Temporal Learners(ST-LLM提出了一种动态掩码策略,并设计了定制的训练目标。此外,针对特别长的视频,设计了一个全局-局部输入模块,以平衡效率和效果。)
  40. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection(用视频特有的encoder提升视频理解能力而非image encoder)

BenchMark and Dataset

  1. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?(最难多模态Benchmark. QwenVL-2第一但未及格!)
  2. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark(MMMU的进阶版,更注重图像的感知对问题的影响)
  3. From Pixels to Prose: A Large Dataset of Dense Image Captions(1600万生成的image-text pair,利用尖端的视觉语言模型(Gemini 1.0 Pro Vision)进行详细和准确的描述。)
  4. ShareGPT4Video: Improving Video Understanding and Generation with Better Captions(40k from gpt4-v, 4814k生成于自己训练的模型)
  5. OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents(141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens)
  6. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning(在数据层面,以细粒度片段级更正的形式收集人类反馈;在方法层面,我们提出了密集直接偏好优化(DDPO))
  7. Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model(在数据层面, 通过代码作为媒介合成抽象图表,并且 benchmarking 了当前多模态模型在抽象图的理解上的不足.)

Unify Multimodal Understanding and Generation

  1. Chameleon: Mixed-Modal Early-Fusion Foundation Models(Meta FAIR:“早期融合”的方法使得模型能够跨模态推理和生成真正的混合文档。)
  2. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation(NUS&ByteDance:文本作为离散标记进行自回归建模,而连续图像像素则使用去噪扩散建模。)
  3. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model(Meta:采用了文本的下一个标记预测和图像的扩散作为目标函数,在不增加计算成本的前提下,实现了更好的模态整合与生成效果。)
  4. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation(清华&MIT:统一视频理解与生成)
  5. MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts(META:MOE是混合模态理解/生成的最佳选择)
  6. MIO: A Foundation Model on Multimodal Tokens(01AI: 四模态理解/生成大一统)
  7. Harmonizing Visual Text Comprehension and Generation(ECNU&ByteDance:结合视觉编码器、LLM、图像解码器实现多模态输入输出)
  8. SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (Tencent AI Lab:采用预训练的视觉分词器(如ViT)来统一图像理解和生成任务)
  9. NExT-GPT: Any-to-Any Multimodal LLM(NUS:使用预训练的编码器、扩散解码器和LLM,结合模态对齐训练和Lora指令微调实现any2any模态任务)
  10. Any-to-Any Generation via Composable Diffusion(Microsoft:组合各种模态的扩散模型,实现多模态并行生成)
  11. X-VILA: Cross-Modality Alignment for Large Language Model(Nvidia&HKUST:将单模编码器与大型语言模型(LLM)的输入对齐,以及将单模扩散解码器与LLM的输出对齐,实现跨模态的理解、推理和生成)
  12. DreamLLM: Synergistic Multimodal Comprehension and Creation(XJU&IIISCT:解决MLLMs在多模态理解与创造中的协同问题,直接在原始多模态空间中采样,生成语言和图像后验)
  13. Jointly Training Large Autoregressive Multimodal Models(Meta AI:融合了现有的文本和图像生成模型,并引入了一种专门的、数据高效的指令调整策略)
  14. VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation(XJU&Tencent AI Lab:使用一个新的图像分词器-解码器框架将原始图像转换为连续的视觉嵌入序列,使用NTP训练目标实现图像文本统一预训练)
  15. Emu:Generative pretraining in multimodality(BAAI&THU:一个基于Transformer的多模态基础模型采用统一的自回归训练目标,通过预测多模态序列中的下一个元素(无论是文本标记还是视觉嵌入)进行训练)
  16. Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization(PKU&快手:将视频分解为关键帧和运动向量,视频、图像和文本数据统一为1D离散标记)
  17. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models(CUHK:使用视觉双编码器处理高分辨率图像,文本自回归生成,图像使用扩散模型生成)
  18. World Model on Million-Length Video And Language With Blockwise RingAttention(UC Berkeley:使用VQGAN将图像/视频离散化,理解生成统一为NTP任务,使用RingAttention、渐进式训练等技术将上下文窗口扩大到1M tokens)
  19. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action(AI2&UIUC:将不同模态的输入和输出(如图像、文本、音频、动作等)标记化(tokenize)到一个共享的语义空间中,然后使用单一的编码器-解码器变换器模型进行处理)
  20. AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling(复旦:使用离散的标记来表示不同的模态(如图像、音乐、语音和文本))
  21. Write and Paint: Generative Vision-Language Models are Unified Modal Learners(HKUST&ByteDance:结合前缀语言建模和前缀图像建模的Dacinci模型)
  22. Gemini: A family of highly capable multimodal models(Google Gemini Team:解决跨图像、音频、视频和文本理解的任务中的高级推理和语言理解问题)
  23. Minigpt-5: Interleaved vision-and-language generation via generative vokens(UCSC:引入生成性视觉标记(Generative Vokens))
  24. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer(Shanghai AI Lab:集成图像编码器、大型语言模型(LLM)和图像解码器)
  25. OMCAT: Omni Context Aware Transformer(NVIDIA:跨模态时间理解,利用RoTE(Rotary Time Embeddings)通过嵌入绝对和相对时间信息到音频和视觉特征中)
  26. Baichuan-Omni Technical Report(百川&西湖大学&浙大:全模态模型)
  27. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation(DeepSeek-AI&HKU:针对多模态理解和多模态生成解耦视觉编码)
  28. Emu3: Next-Token Prediction is All You Need(BAAI:视觉标记离散化,使用DPO进行对齐)
  29. VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing(NUS&NTU:离散文本和连续信号的混合指令传递方法,像素级时空视觉-语言对比学习)(Neurips2024)

Alignment With Human Preference (MLLM)

  1. (Apple) Understanding Alignment in Multimodal LLMs: A Comprehensive Study(通过独立分析各个因素,探索不同的对齐方法对MLLMs性能的影响)
  2. Aligning Large Multimodal Models with Factually Augmented RLHF
  3. CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs(使用预训练的 CLIP 模型对 LVLM 自生成的标题进行排序,以构建 DPO 的正负对)
  4. ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models(选择了一种动态生成方法来创建一个 open-set benchmark,引入了开放集动态评估协议(ODE),专门用于评估 MLLM 中的对象存在幻觉)
  5. Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization(本文将消除幻觉视为一种模型偏好,使模型偏向于无幻觉输出,于是提出了一种对幻觉敏感的多模态DPO 策略 —— HA-DPO。我们还引入了句子级幻觉比率(SHR),它不受固定类别和范围的限制,为多模态幻觉提供了广泛、细粒度和定量的测量)
  6. Detecting and Preventing Hallucinations in Large Vision Language Models(为了便于自动检测幻觉,我们首先使用 InstructBLIP 的 VQA 响应构建了一个多样化的人工标记数据集 M-HalDetect,专注于在详细图像描述的子句级别上进行细粒度注释。在这个数据集上训练不同密度(句子级,子句子级)的多个奖励模型,用于幻觉检测。我们也使用细粒度直接偏好优化(FDPO)直接优化 InstructBLIP)
  7. RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness(同一个大模型生成多个回复,将回复按句拆分,之后转化为问句让开源模型回复准确度,将所有准确度相加,得到偏好数据,用于迭代DPO)
  8. Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement(我们提出了 Self-Improvement Modality Alignment(SIMA),旨在通过自我完善机制进一步改善 LVLM 内视觉模态和语言模态之间的对齐)
  9. MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models(将无关的单图像数据拼接为序列、网格、画中画数据,通过注意力值在正确目标上的多少来选择偏好数据,经过过滤得到数据,用于DPO)
  10. CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs(为了使视觉信息对齐,引入了分层文本偏好优化模块,分别为回复级、片段级、token级偏好优化;同时引入了视觉偏好优化)
  11. 3D-CT-GPT++: Enhancing 3D Radiology Report Generation with Direct Preference Optimization and Large Vision-Language Models(将无关的单图像数据拼接为序列、网格、画中画数据,通过注意力值在正确目标上的多少来选择偏好数据,经过过滤得到数据,用于DPO)
  12. MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine(首先通过对比学习来微调数学特定的视觉编码器,随后将该编码器与LLM对齐,之后,采用MAVIS-Instruct进行指令调整,最后,在MAVIS-Instruct中使用带有注释的CoT基本原理的DPO)
  13. HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments(由100个复杂的日常任务组成,从Replica Challenge中抽取了100个不同的片段来构建场景并设计任务,只使用Replica Challenge的配置文件来构造场景。手动控制机器人完成所有任务,将执行过程分解为几个子任务,最终得到966个子任务。使用GPT-4将最终任务的文本描述和每个子任务的分析重新生成三次,将它们重写为具有相同含义但不同表达的文本,得到3720个SFT数据。通过替换部分内容得到10104个DPO数据)
  14. InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making(首先使用开源数据集LEVI-Project/sft-data对llava-v1.6-mistral-7b进行sft微调,然后使用模型与环境进行交互,在这些交互过程中优化其CoT能力,并在训练期间实时监控性能)
  15. vVLM: Exploring Visual Reasoning in VLMs against Language Priors(通过扰动来破坏图像,同时保持文本(问题和答案)不变,从而构建被选中和被拒绝的偏好对。应用于图像的扰动包括语义编辑、高斯模糊和像素化)
  16. AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization(通过PGD等迭代优化获得对抗图像(对抗性图像是通过在原始图像中引入微小的、几乎难以察觉的扰动来生成的),用原始图像与对抗图像生成对应的描述文本作为偏好数据进行DPO,同时引入了对抗性图像优化)
  17. Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization(首先在大型音频数据集上进行训练音频对齐器实现音频模态对齐,然后进行audio-visual SFT,之后应用基于mrDPO的RL,最后重生微调)
  18. Aligning Visual Contrastive learning models via Preference Optimization(Step 1: Response generation. Step 2: Scoring. Step 3: Reward Preference. Iterative Improvement.)
  19. SQuBa: Speech Mamba Language Model with Querying-Attention for Efficient Summarization(两阶段训练过程。在对准阶段,只有projector使用ASR任务进行训练。在微调阶段,LLM backbone and the projector都接受summarization任务的训练。微调结束后进行离线自生成DPO。)

Alignment With Human Preference (LLM)

  1. ChatGLM-Math:Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(ChatGLM-Math: Self-Critique迭代对齐显著提升数学能力)
  2. Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization(大语言模型的多目标对齐)
  3. Direct Preference Optimization: Your Language Model is Secretly a Reward Model(直接偏好优化克服RLHF不稳定的问题)
  4. KTO: Model Alignment as Prospect Theoretic Optimization(不需要成对数据的偏好优化)
  5. Direct Preference Optimization with an Offset(带偏移的DPO, 要求首选响应和不受欢迎响应之间的可能性差异大于一个偏移值)
  6. Contrastive preference learning: Learning from human feedback without reinforcement learning(对比偏好学习(CPL)算法,该算法用于从偏好中学习最优策略而无需学习奖励函数,从而避免了对RL的需求)
  7. Statistical Rejection Sampling Improves Preference Optimization(使用拒绝抽样从目标最优策略中获取偏好数据,从而更准确地估计最优策略)
  8. Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study(在所有实验中,PPO始终优于DPO。特别是在最具挑战性的代码竞赛任务中,PPO实现了最先进的结果)
  9. Fine-tuning Aligned Language Models Compromises Safety(微调对齐的语言模型会损害安全性)
  10. ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(reward model, Rejective Fine-tuning, then DPO迭代提升模型数学性能)
  11. SimPO: Simple Preference Optimization with a Reference-Free Reward(length reg+去掉ref model)
  12. towards analyzing and understanding the limitations of dpo: a theoretical perspective(DPO的实际优化过程对SFT后的LLMs对齐能力的初始条件为什么敏感)
  13. Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level(表明迭代 DPO (iDPO)可以通过精心设计将 7B 模型的 LC win rate 增强到 GPT-4 水平)
  14. Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs(出了一种有效且经济的 pipeline 来收集成对数学问题偏好数据。引入了 Step-DPO,最大化下一个推理步骤正确的概率,最小化其错误的概率)
  15. A Novel Soft Alignment Approach for Language Models with Explicit Listwise Rewards(通过在现有强大的LLM的指导下对比多个数据点,将生成建模问题转化为分类任务。SPO损失可以看作是k类交叉熵损失,带有更强大的教师LLM提供的软标签)
  16. Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning(教师模型根据使用Self-Instruct生成数据集,然后收集这些数据点的本地数据对学生模型的影响,收集到的数据偏好形成偏好数据集,然后用DPO更新教师模型,该过程可以迭代多轮,以根据学生更新的偏好不断改进教师)
  17. Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts(作者认为相似的问题生成的答案应该也可以用来偏好学习,于是借助对比矩阵来研究此问题,提出了3种可适用的算法)


Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models






No releases published


No packages published

Contributors 4
