From 249e280074a632fccca3b6cb39137efb3fa7b5c9 Mon Sep 17 00:00:00 2001 From: Ellis Brown Date: Mon, 24 Jun 2024 23:44:45 -0700 Subject: [PATCH] more cleanup --- index.html | 41 ++++++++++------------------------------- 1 file changed, 10 insertions(+), 31 deletions(-) diff --git a/index.html b/index.html index b6faa34..5f281ba 100644 --- a/index.html +++ b/index.html @@ -563,7 +563,7 @@

Data Collection

Cambrian-10M To this end, we create a large pool of instruction tuning data, which we refer to as Cambrian-10M. This pool contains approximately 9784k data points, offering a diverse range of data for our work and future research. - We also visualize its composition in Figure 7. + We visualize its composition in Figure 7.

@@ -579,12 +579,8 @@

Data Curation

Data Balancing We follow previous work to set thresholds t for the number of data points from a single data source. - We choose t = 150k, 250k, 350k, and 450k in this section and observe an - elbow effect in Table 3. - We find that a threshold between 250k and 350k works the best for Cambrian-10M. + elbow effect in Table 3—finding that a threshold between 250k and 350k work the best for Cambrian-10M.

@@ -596,9 +592,6 @@

Data Curation


-
@@ -655,19 +648,13 @@

Data Curation

Data Ratio - Given the various capabilities of different types of visual instruction tuning data, it is essential to balance the ratio of these data types. We conduct pilot experiments with a fixed dataset size of 1350k, examining the impact of different data ratios on downstream performance. We visualize the results in Figure 10 and summarize our findings as follows: (i) Balancing General, OCR, and Language data is crucial. - (ii) Performance on knowledge-intensive tasks is influenced by multiple factors, often requiring a mix of OCR, chart, reasoning, and general perception. -

@@ -681,9 +668,8 @@

Data Curation

Cambrian-7M - We follow the identified data ratio and apply the data filtering technique while curating to that ratio. - In the end, we obtain Cambrian-7M. In Table 4, - we observe improvements by scaling up and curating better data, even with less quantity. + By applying data filtering to Cambrian-10M with our identified data ratio, we create a smaller but higher-quality dataset called Cambrian-7M. + Table 4 showcases the benefits of a well-balanced and carefully curated dataset. Despite having fewer samples, Cambrian-7M demonstrates improved performance.

@@ -735,12 +721,10 @@

Data Curation

-

Resolving "Answer Machine Phenomenon" with System Prompts

+

Alleviating the "Answer Machine Phenomenon" via System Prompts

- Here, we explore and analyze a phenomenon we term the "answer machine phenomenon." - We observe that a well-trained MLLM excels in visual question answering - but lacks basic conversational abilities (see examples in Figure 5). - + Here, we investigate a phenomenon we term the "answer machine phenomenon." + We observe that a well-trained MLLM may excel at VQA benchmarks, but lack basic conversational abilities and default to outputting short, curt responses (see examples in Figure 5).

@@ -767,17 +751,12 @@

State of the Art MLLM Performance

Finally, we leverage the insights from all of our previous studies to train a high-performance Cambrian model. We train with three different sizes of LLM backbones: LLaMA-3-Instruct-8B, Vicuna-1.5-13B, and Hermes-2-Yi-34B. - We have a vision combination of four models—SigLIP, CLIP, DINOv2, and OpenCLIP ConvNeXt + We have a vision combination of four models—SigLIP, CLIP, DINOv2, and OpenCLIP ConvNeXt (see Combining Multiple Vision Encoders) with Spatial Vision Aggregator. We use 2.5M adapter data and Cambrian-7M instruction tuning data (see Data Curation). - We evaluate our models on the categorized benchmarks. - We show the results in Table 5. Cambrian-1 exceeds other open-source models such as LLaVA-NeXT and Mini-Gemini. - Cambrian-1 also achieves comparable performance on a number of benchmarks with the best proprietary models such as GPT-4V, Gemini-Pro, and MM-1. + We evaluate our models on the categorized benchmarks, and tabulate the results in Table 5. Cambrian-1 exceeds other open-source models such as LLaVA-NeXT and Mini-Gemini, and achieves comparable performance on a number of benchmarks with the best proprietary models such as GPT-4V, Gemini-Pro, and MM-1.

-
@@ -1205,7 +1184,7 @@

State of the Art MLLM Performance

Conclusion

- To conclude, Cambrian-1 introduces a family of state-of-the-art MLLM models that achieve top performance across diverse benchmarks + To conclude, Cambrian-1 is a family of state-of-the-art MLLMs that achieve top performance across diverse benchmarks and excel in visual-centric tasks. We provide model weights, open-source code, datasets, and detailed recipes for model training and evaluation. We hope our work will strengthen the open research community and accelerate research in both visual representation learning and multimodal systems.