Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : support InternLM2 #5184

Merged
merged 1 commit into from
Feb 1, 2024

Conversation

SolenoidWGT
Copy link
Contributor

Dear llama.cpp developer @ggerganov:

Hello, this PR is to support the InternLM2. InternLM2 is a GQA-based transform architecture and offers effective support for long contexts of up to 200,000 characters. Now more and more users want to explore the capabilities of InternLM2 using their familiar llama.cpp API. We sincerely hope that our PR can be accepted. Thank you.

For information about InternLM2, please visit https://github.com/InternLM/InternLM,

related issue: #3133, #3551, #4360, #5031

cc: @yhcc, @vansinhu, @sunpengsdu

llama.cpp Outdated
@@ -11220,7 +11409,8 @@ int32_t llama_tokenize(
int32_t n_max_tokens,
bool add_bos,
bool special) {
auto res = llama_tokenize_internal(model->vocab, std::string(text, text_len), add_bos, special);
bool add_space_prefix = false ? model->arch == LLM_ARCH_INTERNLM2 : true;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is wrong:

Suggested change
bool add_space_prefix = false ? model->arch == LLM_ARCH_INTERNLM2 : true;
bool add_space_prefix = model->arch == LLM_ARCH_INTERNLM2 ? true : false;

Is this change necessary? Seems strange to add prefix just for a specific model architecture. Can this be fixed outside of llama.cpp?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better,

bool add_space_prefix = model->arch == LLM_ARCH_INTERNLM2;

Copy link
Contributor Author

@SolenoidWGT SolenoidWGT Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion! In terms of deciding the value for add_dummy_prefix, I went through the SentencePiece code and found that the tokenizer's add_dummy_prefix value is stored there. And we can get the value from tokenizer.model through the python interface:

from sentencepiece import sentencepiece_model_pb2 as model
m = model.ModelProto()
m.ParseFromString(open(TOKENIZER_PATH, "rb").read())
print(m.normalizer_spec.add_dummy_prefix)

So, maybe when we're doing the checkpoint conversion in convert-hf-to-gguf.py, we could consider adding a new key-value field using gguf_writer. For example, we might use something like

gguf_writer.add_add_dummy_prefix(add_dummy_prefix: bool).

If you're good with it, I'll make a change so that we're not tying the decision on add_dummy_prefix to any specific model type @ggerganov @cebtenzzre

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SolenoidWGT Yes, I think this is a good solution. Please make the change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SolenoidWGT Yes, I think this is a good solution. Please make the change
Done.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 30, 2024

Dear llama.cpp developer @ggerganov:

Hello, this PR is to support the InternLM2. InternLM2 is a GQA-based transform architecture and offers effective support for long contexts of up to 200,000 characters. Now more and more users want to explore the capabilities of InternLM2 using their familiar llama.cpp API. We sincerely hope that our PR can be accepted. Thank you.

For information about InternLM2, please visit https://github.com/InternLM/InternLM,

related issue: #3133, #3551, #4360, #5031

cc: @yhcc, @vansinhu, @sunpengsdu

Do you intend to support Plora_A and B tensors ?

I've experimented with those a bit but my pytorch experience is low, so it's a pain applying it into ggml.
The qkv lora is taken without reshape, I wasn't sure how to do it.

constants.py

class MODEL_TENSOR(IntEnum):
    ATTN_QKV_LORA_A = auto()
    ATTN_QKV_LORA_B = auto()
    ATTN_OUT_LORA_A = auto()
    ATTN_OUT_LORA_B = auto()
    FFN_UP_LORA_A   = auto()
    FFN_UP_LORA_B   = auto()
    FFN_GATE_LORA_A = auto()
    FFN_GATE_LORA_B = auto()
    FFN_DOWN_LORA_A = auto()
    FFN_DOWN_LORA_B = auto()

constants.py

TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
..
    MODEL_TENSOR.ATTN_QKV_LORA_A : "blk.{bid}.attn_qkv_lora_a",
    MODEL_TENSOR.ATTN_QKV_LORA_B : "blk.{bid}.attn_qkv_lora_b",
    MODEL_TENSOR.ATTN_OUT_LORA_A : "blk.{bid}.attn_out_lora_a",
    MODEL_TENSOR.ATTN_OUT_LORA_B : "blk.{bid}.attn_out_lora_b",
    MODEL_TENSOR.FFN_UP_LORA_A   : "blk.{bid}.ffn_up_lora_a",
    MODEL_TENSOR.FFN_UP_LORA_B   : "blk.{bid}.ffn_up_lora_b",
    MODEL_TENSOR.FFN_GATE_LORA_A : "blk.{bid}.ffn_gate_lora_a",
    MODEL_TENSOR.FFN_GATE_LORA_B : "blk.{bid}.ffn_gate_lora_b",
    MODEL_TENSOR.FFN_DOWN_LORA_A : "blk.{bid}.ffn_down_lora_a",
    MODEL_TENSOR.FFN_DOWN_LORA_B : "blk.{bid}.ffn_down_lora_b",

constants.py

MODEL_ARCH.INTERNLM2: [
        MODEL_TENSOR.TOKEN_EMBD,
        MODEL_TENSOR.OUTPUT_NORM,
        MODEL_TENSOR.OUTPUT,
        MODEL_TENSOR.ATTN_NORM,
        MODEL_TENSOR.ATTN_Q,
        MODEL_TENSOR.ATTN_K,
        MODEL_TENSOR.ATTN_V,
        MODEL_TENSOR.ATTN_OUT,
        MODEL_TENSOR.ATTN_ROT_EMBD,
        MODEL_TENSOR.FFN_NORM,
        MODEL_TENSOR.FFN_GATE,
        MODEL_TENSOR.FFN_DOWN,
        MODEL_TENSOR.FFN_UP,

        MODEL_TENSOR.ATTN_QKV_LORA_A,
        MODEL_TENSOR.ATTN_QKV_LORA_B,
        MODEL_TENSOR.ATTN_OUT_LORA_A,
        MODEL_TENSOR.ATTN_OUT_LORA_B,
        MODEL_TENSOR.FFN_UP_LORA_A,
        MODEL_TENSOR.FFN_UP_LORA_B,
        MODEL_TENSOR.FFN_GATE_LORA_A,
        MODEL_TENSOR.FFN_GATE_LORA_B,
        MODEL_TENSOR.FFN_DOWN_LORA_A,
        MODEL_TENSOR.FFN_DOWN_LORA_B,
    ],

tensor_mapping.py:

        MODEL_TENSOR.ATTN_QKV_LORA_A: (
            "model.layers.{bid}.attention.wqkv.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_QKV_LORA_B: (
            "model.layers.{bid}.attention.wqkv.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_OUT_LORA_A: (
            "model.layers.{bid}.attention.wo.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_OUT_LORA_B: (
            "model.layers.{bid}.attention.wo.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.FFN_UP_LORA_A: (
            "model.layers.{bid}.feed_forward.w3.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_UP_LORA_B: (
            "model.layers.{bid}.feed_forward.w3.Plora_B",                 # internlm2
        ),
       MODEL_TENSOR.FFN_GATE_LORA_A: (
            "model.layers.{bid}.feed_forward.w1.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_GATE_LORA_B: (
            "model.layers.{bid}.feed_forward.w1.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.FFN_DOWN_LORA_A: (
            "model.layers.{bid}.feed_forward.w2.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_DOWN_LORA_B: (
            "model.layers.{bid}.feed_forward.w2.Plora_B",                 # internlm2
        ),

in convert_hf_to_ggml.py: (skipping qkv reshape)

    qkv_pattern = r"model\.layers\.(\d+)\.attention\.wqkv"
        for name, data_torch in model_kv.items():
            # we don't need these
            if name.endswith(".rotary_emb.inv_freq"):
                continue
            plora_tensor = True if "Plora" in name else False
            if re.match(qkv_pattern, name) and not plora_tensor:

That makes your lora models convertable, quantization works (though I guess that the LORA should get a higher quantization quality)

Now what would be missing is the conditional support to apply it ?

@SolenoidWGT
Copy link
Contributor Author

Dear llama.cpp developer @ggerganov:
Hello, this PR is to support the InternLM2. InternLM2 is a GQA-based transform architecture and offers effective support for long contexts of up to 200,000 characters. Now more and more users want to explore the capabilities of InternLM2 using their familiar llama.cpp API. We sincerely hope that our PR can be accepted. Thank you.
For information about InternLM2, please visit https://github.com/InternLM/InternLM,
related issue: #3133, #3551, #4360, #5031
cc: @yhcc, @vansinhu, @sunpengsdu

Do you intend to support Plora_A and B tensors ?

I've experimented with those a bit but my pytorch experience is low, so it's a pain applying it into ggml. The qkv lora is taken without reshape, I wasn't sure how to do it.

constants.py

class MODEL_TENSOR(IntEnum):
    ATTN_QKV_LORA_A = auto()
    ATTN_QKV_LORA_B = auto()
    ATTN_OUT_LORA_A = auto()
    ATTN_OUT_LORA_B = auto()
    FFN_UP_LORA_A   = auto()
    FFN_UP_LORA_B   = auto()
    FFN_GATE_LORA_A = auto()
    FFN_GATE_LORA_B = auto()
    FFN_DOWN_LORA_A = auto()
    FFN_DOWN_LORA_B = auto()

constants.py

TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
..
    MODEL_TENSOR.ATTN_QKV_LORA_A : "blk.{bid}.attn_qkv_lora_a",
    MODEL_TENSOR.ATTN_QKV_LORA_B : "blk.{bid}.attn_qkv_lora_b",
    MODEL_TENSOR.ATTN_OUT_LORA_A : "blk.{bid}.attn_out_lora_a",
    MODEL_TENSOR.ATTN_OUT_LORA_B : "blk.{bid}.attn_out_lora_b",
    MODEL_TENSOR.FFN_UP_LORA_A   : "blk.{bid}.ffn_up_lora_a",
    MODEL_TENSOR.FFN_UP_LORA_B   : "blk.{bid}.ffn_up_lora_b",
    MODEL_TENSOR.FFN_GATE_LORA_A : "blk.{bid}.ffn_gate_lora_a",
    MODEL_TENSOR.FFN_GATE_LORA_B : "blk.{bid}.ffn_gate_lora_b",
    MODEL_TENSOR.FFN_DOWN_LORA_A : "blk.{bid}.ffn_down_lora_a",
    MODEL_TENSOR.FFN_DOWN_LORA_B : "blk.{bid}.ffn_down_lora_b",

constants.py

MODEL_ARCH.INTERNLM2: [
        MODEL_TENSOR.TOKEN_EMBD,
        MODEL_TENSOR.OUTPUT_NORM,
        MODEL_TENSOR.OUTPUT,
        MODEL_TENSOR.ATTN_NORM,
        MODEL_TENSOR.ATTN_Q,
        MODEL_TENSOR.ATTN_K,
        MODEL_TENSOR.ATTN_V,
        MODEL_TENSOR.ATTN_OUT,
        MODEL_TENSOR.ATTN_ROT_EMBD,
        MODEL_TENSOR.FFN_NORM,
        MODEL_TENSOR.FFN_GATE,
        MODEL_TENSOR.FFN_DOWN,
        MODEL_TENSOR.FFN_UP,

        MODEL_TENSOR.ATTN_QKV_LORA_A,
        MODEL_TENSOR.ATTN_QKV_LORA_B,
        MODEL_TENSOR.ATTN_OUT_LORA_A,
        MODEL_TENSOR.ATTN_OUT_LORA_B,
        MODEL_TENSOR.FFN_UP_LORA_A,
        MODEL_TENSOR.FFN_UP_LORA_B,
        MODEL_TENSOR.FFN_GATE_LORA_A,
        MODEL_TENSOR.FFN_GATE_LORA_B,
        MODEL_TENSOR.FFN_DOWN_LORA_A,
        MODEL_TENSOR.FFN_DOWN_LORA_B,
    ],

tensor_mapping.py:

        MODEL_TENSOR.ATTN_QKV_LORA_A: (
            "model.layers.{bid}.attention.wqkv.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_QKV_LORA_B: (
            "model.layers.{bid}.attention.wqkv.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_OUT_LORA_A: (
            "model.layers.{bid}.attention.wo.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_OUT_LORA_B: (
            "model.layers.{bid}.attention.wo.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.FFN_UP_LORA_A: (
            "model.layers.{bid}.feed_forward.w3.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_UP_LORA_B: (
            "model.layers.{bid}.feed_forward.w3.Plora_B",                 # internlm2
        ),
       MODEL_TENSOR.FFN_GATE_LORA_A: (
            "model.layers.{bid}.feed_forward.w1.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_GATE_LORA_B: (
            "model.layers.{bid}.feed_forward.w1.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.FFN_DOWN_LORA_A: (
            "model.layers.{bid}.feed_forward.w2.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_DOWN_LORA_B: (
            "model.layers.{bid}.feed_forward.w2.Plora_B",                 # internlm2
        ),

in convert_hf_to_ggml.py: (skipping qkv reshape)

    qkv_pattern = r"model\.layers\.(\d+)\.attention\.wqkv"
        for name, data_torch in model_kv.items():
            # we don't need these
            if name.endswith(".rotary_emb.inv_freq"):
                continue
            plora_tensor = True if "Plora" in name else False
            if re.match(qkv_pattern, name) and not plora_tensor:

That makes your lora models convertable, quantization works (though I guess that the LORA should get a higher quantization quality)

Now what would be missing is the conditional support to apply it ?

I will try to support lora for internLM2, but I am not very familiar with the lora implementation of llama.cpp, and I need some time to read the code.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 31, 2024

I've put this PR together with xcomposer2 support here: #5232
It does not handle the partial LoRa yet, my pytorch knoweldge is minimal and I struggle to follow the flow and shapes.

Just fyi, if you find time to add a hand to get the plora working we would have full support for xcomposer2 - possibly the best visual multimodal model to date.

@SolenoidWGT SolenoidWGT force-pushed the feat/add_internlm2 branch 2 times, most recently from e0375e1 to d25bc84 Compare January 31, 2024 05:46
llama.cpp Outdated Show resolved Hide resolved
@SolenoidWGT SolenoidWGT force-pushed the feat/add_internlm2 branch 3 times, most recently from f7fb7be to 47824a5 Compare February 1, 2024 03:22
llama.cpp Outdated
Comment on lines 670 to 674
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of Orion's tensors have been deleted - LLM_TENSOR_FFN_GATE and LLM_TENSOR_FFN_UP. Please restore:

llama.cpp/llama.cpp

Lines 654 to 670 in 1cfb537

LLM_ARCH_ORION,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ROPE_FREQS, "rope_freqs" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_ATTN_ROT_EMBD, "blk.%d.attn_rot_embd" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was careless when rabase, it has been fixed.

  * support InternLM2 inference
  * add add_space_prefix KV pair
@ggerganov ggerganov merged commit ce32060 into ggerganov:master Feb 1, 2024
50 of 53 checks passed
@sweetcard
Copy link

python convert-hf-to-gguf.py ./internlm2-chat-20b
It doesn't work. Please check the following image:

image

@arch-btw
Copy link
Contributor

arch-btw commented Feb 1, 2024

I'm having the same issue as @sweetcard but on internlm2-chat-7b

Edit: internlm/internlm2-chat-1_8b-sft:

internlm2

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* support InternLM2 inference
  * add add_space_prefix KV pair
@SolenoidWGT
Copy link
Contributor Author

@arch-btw @sweetcard , I spent some time debugging to find the problem, and fixed it in this PR #5305.

@sweetcard
Copy link

@arch-btw @sweetcard , I spent some time debugging to find the problem, and fixed it in this PR #5305.

thank you for your excellent wok👍

@arch-btw
Copy link
Contributor

arch-btw commented Feb 4, 2024

Thank you @SolenoidWGT !

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* support InternLM2 inference
  * add add_space_prefix KV pair
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants