InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed #5232

cmp-nct · 2024-01-31T03:23:31Z

xcomposer2 is like gpt4v/cogVLM - just 5 times faster and smaller
This is like llava-1.5 in terms of simplicity (plus a few sequential tensors to multiply) but GPT4V in quality.

Aside of the stunning visual performance, their use of lora looks interesting to me.
I wonder if a similar lora approach could be used on Mixtral - to cut the total mixtral size down to 1+ experts while giving full 8+ expert inference. Instead of precompiling 8 expert weights, one foundation could be used + 8 sets of conditional lora.

Difference of xcomposer2 vs llava:

the mm_proj weights are renamed to vision_proj - mlp2x_gelu
the same clip model is used but at resolution 490x490
The visual embeddings are handled by a dynamic lora (wo,w123,wqkv) which seems to be applied as a first step conditionally.
This results in the features of a full visual-expert but without the need to model it, it's still the 7B LLM that is used

The performance is quite stunning. It was able to solve the license demo (OCR) with no flaws (superior to GPT4V in that test)
I also asked it about the 3 cats, flawless as well.

I've uploaded the converted xcomposer-7B Q3K LLM and a Q5K mmproj/clip model here: https://huggingface.co/cmp-nct/xcomposer2_gguf_for_llama.cpp_development
These are converted by the changes here, so they contain the conditional tensors as they are created at the moment.
I'm doubtful if the qkv partial-lora tensor is good in that form, it likely needs a transform to be used.
My pytorch/python skills are minimal sadly

Status:

loading of all lora tensors in correct shape [done - the plora from pytorch are all taken without change and integrated into the model]
make clip conversion possible [done - projection was tiny and is differnt named]
make llm inference possible [works for text]

Todo:

dynamic shape for lora tensors [] (right now the plora tensors are just hardcoded in the llama loader)
add image token mask []
add conditional "lora" on tensors during inference (wo,w123 and wqkv)
One of the devs that are more experienced with pytorch->ggml conversions could get the plora multiplications into the architecture much faster than me

Reference:
https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/build_mlp.py
https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/modeling_internlm2.py

- loading of all lora tensors in correct shape [done] - make llm inference possible [done] - make clip conversion possible [done] - dynamic shape for lora tensors [] - add image token mask [] - add conditional "lora" on tensors during inference

cmp-nct · 2024-01-31T19:27:36Z

With the release of llava-1.6 we've to check which one is better. llava-1.6 is also pretty good.
In any case, their novel way of using lora-like tensors is something worth to look into

chigkim · 2024-02-01T05:02:30Z

Do you also need to modify llama.cpp to work with llava-1.6?, or just need to quantize the model?

cmp-nct · 2024-02-01T16:03:57Z

Do you also need to modify llama.cpp to work with llava-1.6?, or just need to quantize the model?

It has a very similar architecture as llava-1.5 but it uses refined preprocessing and comes with an additional tensor and integrated vit. I've quantized it but no time at the moment to get it integrated.

cmp-nct added 2 commits January 31, 2024 03:01

xcomposer2 support

5188ea0

- loading of all lora tensors in correct shape [done] - make llm inference possible [done] - make clip conversion possible [done] - dynamic shape for lora tensors [] - add image token mask [] - add conditional "lora" on tensors during inference

xx

0285b88

cmp-nct mentioned this pull request Jan 31, 2024

llama : support InternLM2 #5184

Merged

Merge branch 'master' into support-xcomposer2

77223b1

cmp-nct changed the title ~~xcomposer2 support - help needed~~ xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed Jan 31, 2024

cmp-nct mentioned this pull request Jan 31, 2024

Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387

Closed

cmp-nct changed the title ~~xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed~~ InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed Jan 31, 2024

cmp-nct closed this Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed #5232

InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed #5232

cmp-nct commented Jan 31, 2024 •

edited

Loading

cmp-nct commented Jan 31, 2024

chigkim commented Feb 1, 2024

cmp-nct commented Feb 1, 2024

InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed #5232

InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed #5232

Conversation

cmp-nct commented Jan 31, 2024 • edited Loading

cmp-nct commented Jan 31, 2024

chigkim commented Feb 1, 2024

cmp-nct commented Feb 1, 2024

cmp-nct commented Jan 31, 2024 •

edited

Loading