-
Resolving "Answer Machine Phenomenon" with System Prompts
+
Alleviating the "Answer Machine Phenomenon" via System Prompts
- Here, we explore and analyze a phenomenon we term the "answer machine phenomenon."
- We observe that a well-trained MLLM excels in visual question answering
- but lacks basic conversational abilities (see examples in Figure 5).
-
+ Here, we investigate a phenomenon we term the "answer machine phenomenon."
+ We observe that a well-trained MLLM may excel at VQA benchmarks, but lack basic conversational abilities and default to outputting short, curt responses (see examples in Figure 5).
@@ -767,17 +751,12 @@
State of the Art MLLM Performance
Finally, we leverage the insights from all of our previous studies to train a high-performance Cambrian model.
We train with three different sizes of LLM backbones: LLaMA-3-Instruct-8B, Vicuna-1.5-13B, and Hermes-2-Yi-34B.
- We have a vision combination of four models—SigLIP, CLIP, DINOv2, and OpenCLIP ConvNeXt
+ We have a vision combination of four models—SigLIP, CLIP, DINOv2, and OpenCLIP ConvNeXt
(see Combining Multiple Vision Encoders) with Spatial Vision Aggregator.
We use 2.5M adapter data and Cambrian-7M instruction tuning data (see Data Curation).
- We evaluate our models on the categorized benchmarks.
- We show the results in Table 5. Cambrian-1 exceeds other open-source models such as LLaVA-NeXT and Mini-Gemini.
- Cambrian-1 also achieves comparable performance on a number of benchmarks with the best proprietary models such as GPT-4V, Gemini-Pro, and MM-1.
+ We evaluate our models on the categorized benchmarks, and tabulate the results in Table 5. Cambrian-1 exceeds other open-source models such as LLaVA-NeXT and Mini-Gemini, and achieves comparable performance on a number of benchmarks with the best proprietary models such as GPT-4V, Gemini-Pro, and MM-1.