InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed #5232
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
xcomposer2 is like gpt4v/cogVLM - just 5 times faster and smaller
This is like llava-1.5 in terms of simplicity (plus a few sequential tensors to multiply) but GPT4V in quality.
Aside of the stunning visual performance, their use of lora looks interesting to me.
I wonder if a similar lora approach could be used on Mixtral - to cut the total mixtral size down to 1+ experts while giving full 8+ expert inference. Instead of precompiling 8 expert weights, one foundation could be used + 8 sets of conditional lora.
Difference of xcomposer2 vs llava:
This results in the features of a full visual-expert but without the need to model it, it's still the 7B LLM that is used
The performance is quite stunning. It was able to solve the license demo (OCR) with no flaws (superior to GPT4V in that test)
I also asked it about the 3 cats, flawless as well.
I've uploaded the converted xcomposer-7B Q3K LLM and a Q5K mmproj/clip model here: https://huggingface.co/cmp-nct/xcomposer2_gguf_for_llama.cpp_development
These are converted by the changes here, so they contain the conditional tensors as they are created at the moment.
I'm doubtful if the qkv partial-lora tensor is good in that form, it likely needs a transform to be used.
My pytorch/python skills are minimal sadly
Status:
Todo:
One of the devs that are more experienced with pytorch->ggml conversions could get the plora multiplications into the architecture much faster than me
Reference:
https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/build_mlp.py
https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/modeling_internlm2.py