Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed #5232

Closed
wants to merge 3 commits into from

Conversation

cmp-nct
Copy link
Contributor

@cmp-nct cmp-nct commented Jan 31, 2024

xcomposer2 is like gpt4v/cogVLM - just 5 times faster and smaller
This is like llava-1.5 in terms of simplicity (plus a few sequential tensors to multiply) but GPT4V in quality.

Aside of the stunning visual performance, their use of lora looks interesting to me.
I wonder if a similar lora approach could be used on Mixtral - to cut the total mixtral size down to 1+ experts while giving full 8+ expert inference. Instead of precompiling 8 expert weights, one foundation could be used + 8 sets of conditional lora.

Difference of xcomposer2 vs llava:

  1. the mm_proj weights are renamed to vision_proj - mlp2x_gelu
  2. the same clip model is used but at resolution 490x490
  3. The visual embeddings are handled by a dynamic lora (wo,w123,wqkv) which seems to be applied as a first step conditionally.
    This results in the features of a full visual-expert but without the need to model it, it's still the 7B LLM that is used

The performance is quite stunning. It was able to solve the license demo (OCR) with no flaws (superior to GPT4V in that test)
I also asked it about the 3 cats, flawless as well.

I've uploaded the converted xcomposer-7B Q3K LLM and a Q5K mmproj/clip model here: https://huggingface.co/cmp-nct/xcomposer2_gguf_for_llama.cpp_development
These are converted by the changes here, so they contain the conditional tensors as they are created at the moment.
I'm doubtful if the qkv partial-lora tensor is good in that form, it likely needs a transform to be used.
My pytorch/python skills are minimal sadly

Status:

  • loading of all lora tensors in correct shape [done - the plora from pytorch are all taken without change and integrated into the model]
  • make clip conversion possible [done - projection was tiny and is differnt named]
  • make llm inference possible [works for text]

Todo:

  • dynamic shape for lora tensors [] (right now the plora tensors are just hardcoded in the llama loader)
  • add image token mask []
  • add conditional "lora" on tensors during inference (wo,w123 and wqkv)
    One of the devs that are more experienced with pytorch->ggml conversions could get the plora multiplications into the architecture much faster than me

Reference:
https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/build_mlp.py
https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/modeling_internlm2.py

 - loading of all lora tensors in correct shape [done]
 - make llm inference possible [done]
 - make clip conversion possible [done]
 - dynamic shape for lora tensors []
 - add image token mask []
 - add conditional "lora" on tensors during inference
@cmp-nct cmp-nct changed the title xcomposer2 support - help needed xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed Jan 31, 2024
@cmp-nct cmp-nct changed the title xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed InternLM xcomposer2 support (at least on eye level of GPT4V and CogVLM) - help needed Jan 31, 2024
@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 31, 2024

With the release of llava-1.6 we've to check which one is better. llava-1.6 is also pretty good.
In any case, their novel way of using lora-like tensors is something worth to look into

@chigkim
Copy link

chigkim commented Feb 1, 2024

Do you also need to modify llama.cpp to work with llava-1.6?, or just need to quantize the model?

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Feb 1, 2024

Do you also need to modify llama.cpp to work with llava-1.6?, or just need to quantize the model?

It has a very similar architecture as llava-1.5 but it uses refined preprocessing and comes with an additional tensor and integrated vit. I've quantized it but no time at the moment to get it integrated.

@cmp-nct cmp-nct closed this Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants