llama : support InternLM2 #5184

SolenoidWGT · 2024-01-29T04:39:05Z

Dear llama.cpp developer @ggerganov:

Hello, this PR is to support the InternLM2. InternLM2 is a GQA-based transform architecture and offers effective support for long contexts of up to 200,000 characters. Now more and more users want to explore the capabilities of InternLM2 using their familiar llama.cpp API. We sincerely hope that our PR can be accepted. Thank you.

For information about InternLM2, please visit https://github.com/InternLM/InternLM,

related issue: #3133, #3551, #4360, #5031

cc: @yhcc, @vansinhu, @sunpengsdu

ggerganov · 2024-01-29T07:48:37Z

llama.cpp

@@ -11220,7 +11409,8 @@ int32_t llama_tokenize(
                     int32_t   n_max_tokens,
                        bool   add_bos,
                        bool   special) {
-    auto res = llama_tokenize_internal(model->vocab, std::string(text, text_len), add_bos, special);
+    bool add_space_prefix = false ? model->arch == LLM_ARCH_INTERNLM2 : true;


The condition is wrong:

Suggested change

bool add_space_prefix = false ? model->arch == LLM_ARCH_INTERNLM2 : true;

bool add_space_prefix = model->arch == LLM_ARCH_INTERNLM2 ? true : false;

Is this change necessary? Seems strange to add prefix just for a specific model architecture. Can this be fixed outside of llama.cpp?

Or better,

bool add_space_prefix = model->arch == LLM_ARCH_INTERNLM2;

Thanks for your suggestion! In terms of deciding the value for add_dummy_prefix, I went through the SentencePiece code and found that the tokenizer's add_dummy_prefix value is stored there. And we can get the value from tokenizer.model through the python interface:

from sentencepiece import sentencepiece_model_pb2 as model m = model.ModelProto() m.ParseFromString(open(TOKENIZER_PATH, "rb").read()) print(m.normalizer_spec.add_dummy_prefix)

So, maybe when we're doing the checkpoint conversion in convert-hf-to-gguf.py, we could consider adding a new key-value field using gguf_writer. For example, we might use something like

gguf_writer.add_add_dummy_prefix(add_dummy_prefix: bool).

If you're good with it, I'll make a change so that we're not tying the decision on add_dummy_prefix to any specific model type @ggerganov @cebtenzzre

@SolenoidWGT Yes, I think this is a good solution. Please make the change

@SolenoidWGT Yes, I think this is a good solution. Please make the change
Done.

cmp-nct · 2024-01-30T02:29:38Z

Dear llama.cpp developer @ggerganov:

Hello, this PR is to support the InternLM2. InternLM2 is a GQA-based transform architecture and offers effective support for long contexts of up to 200,000 characters. Now more and more users want to explore the capabilities of InternLM2 using their familiar llama.cpp API. We sincerely hope that our PR can be accepted. Thank you.

For information about InternLM2, please visit https://github.com/InternLM/InternLM,

related issue: #3133, #3551, #4360, #5031

cc: @yhcc, @vansinhu, @sunpengsdu

Do you intend to support Plora_A and B tensors ?

I've experimented with those a bit but my pytorch experience is low, so it's a pain applying it into ggml.
The qkv lora is taken without reshape, I wasn't sure how to do it.

constants.py

class MODEL_TENSOR(IntEnum):
    ATTN_QKV_LORA_A = auto()
    ATTN_QKV_LORA_B = auto()
    ATTN_OUT_LORA_A = auto()
    ATTN_OUT_LORA_B = auto()
    FFN_UP_LORA_A   = auto()
    FFN_UP_LORA_B   = auto()
    FFN_GATE_LORA_A = auto()
    FFN_GATE_LORA_B = auto()
    FFN_DOWN_LORA_A = auto()
    FFN_DOWN_LORA_B = auto()

constants.py

TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
..
    MODEL_TENSOR.ATTN_QKV_LORA_A : "blk.{bid}.attn_qkv_lora_a",
    MODEL_TENSOR.ATTN_QKV_LORA_B : "blk.{bid}.attn_qkv_lora_b",
    MODEL_TENSOR.ATTN_OUT_LORA_A : "blk.{bid}.attn_out_lora_a",
    MODEL_TENSOR.ATTN_OUT_LORA_B : "blk.{bid}.attn_out_lora_b",
    MODEL_TENSOR.FFN_UP_LORA_A   : "blk.{bid}.ffn_up_lora_a",
    MODEL_TENSOR.FFN_UP_LORA_B   : "blk.{bid}.ffn_up_lora_b",
    MODEL_TENSOR.FFN_GATE_LORA_A : "blk.{bid}.ffn_gate_lora_a",
    MODEL_TENSOR.FFN_GATE_LORA_B : "blk.{bid}.ffn_gate_lora_b",
    MODEL_TENSOR.FFN_DOWN_LORA_A : "blk.{bid}.ffn_down_lora_a",
    MODEL_TENSOR.FFN_DOWN_LORA_B : "blk.{bid}.ffn_down_lora_b",

constants.py

MODEL_ARCH.INTERNLM2: [
        MODEL_TENSOR.TOKEN_EMBD,
        MODEL_TENSOR.OUTPUT_NORM,
        MODEL_TENSOR.OUTPUT,
        MODEL_TENSOR.ATTN_NORM,
        MODEL_TENSOR.ATTN_Q,
        MODEL_TENSOR.ATTN_K,
        MODEL_TENSOR.ATTN_V,
        MODEL_TENSOR.ATTN_OUT,
        MODEL_TENSOR.ATTN_ROT_EMBD,
        MODEL_TENSOR.FFN_NORM,
        MODEL_TENSOR.FFN_GATE,
        MODEL_TENSOR.FFN_DOWN,
        MODEL_TENSOR.FFN_UP,

        MODEL_TENSOR.ATTN_QKV_LORA_A,
        MODEL_TENSOR.ATTN_QKV_LORA_B,
        MODEL_TENSOR.ATTN_OUT_LORA_A,
        MODEL_TENSOR.ATTN_OUT_LORA_B,
        MODEL_TENSOR.FFN_UP_LORA_A,
        MODEL_TENSOR.FFN_UP_LORA_B,
        MODEL_TENSOR.FFN_GATE_LORA_A,
        MODEL_TENSOR.FFN_GATE_LORA_B,
        MODEL_TENSOR.FFN_DOWN_LORA_A,
        MODEL_TENSOR.FFN_DOWN_LORA_B,
    ],

tensor_mapping.py:

        MODEL_TENSOR.ATTN_QKV_LORA_A: (
            "model.layers.{bid}.attention.wqkv.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_QKV_LORA_B: (
            "model.layers.{bid}.attention.wqkv.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_OUT_LORA_A: (
            "model.layers.{bid}.attention.wo.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_OUT_LORA_B: (
            "model.layers.{bid}.attention.wo.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.FFN_UP_LORA_A: (
            "model.layers.{bid}.feed_forward.w3.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_UP_LORA_B: (
            "model.layers.{bid}.feed_forward.w3.Plora_B",                 # internlm2
        ),
       MODEL_TENSOR.FFN_GATE_LORA_A: (
            "model.layers.{bid}.feed_forward.w1.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_GATE_LORA_B: (
            "model.layers.{bid}.feed_forward.w1.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.FFN_DOWN_LORA_A: (
            "model.layers.{bid}.feed_forward.w2.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_DOWN_LORA_B: (
            "model.layers.{bid}.feed_forward.w2.Plora_B",                 # internlm2
        ),

in convert_hf_to_ggml.py: (skipping qkv reshape)

    qkv_pattern = r"model\.layers\.(\d+)\.attention\.wqkv"
        for name, data_torch in model_kv.items():
            # we don't need these
            if name.endswith(".rotary_emb.inv_freq"):
                continue
            plora_tensor = True if "Plora" in name else False
            if re.match(qkv_pattern, name) and not plora_tensor:

That makes your lora models convertable, quantization works (though I guess that the LORA should get a higher quantization quality)

Now what would be missing is the conditional support to apply it ?

SolenoidWGT · 2024-01-30T04:59:50Z

Dear llama.cpp developer @ggerganov:
Hello, this PR is to support the InternLM2. InternLM2 is a GQA-based transform architecture and offers effective support for long contexts of up to 200,000 characters. Now more and more users want to explore the capabilities of InternLM2 using their familiar llama.cpp API. We sincerely hope that our PR can be accepted. Thank you.
For information about InternLM2, please visit https://github.com/InternLM/InternLM,
related issue: #3133, #3551, #4360, #5031
cc: @yhcc, @vansinhu, @sunpengsdu

Do you intend to support Plora_A and B tensors ?

I've experimented with those a bit but my pytorch experience is low, so it's a pain applying it into ggml. The qkv lora is taken without reshape, I wasn't sure how to do it.

constants.py

class MODEL_TENSOR(IntEnum):
    ATTN_QKV_LORA_A = auto()
    ATTN_QKV_LORA_B = auto()
    ATTN_OUT_LORA_A = auto()
    ATTN_OUT_LORA_B = auto()
    FFN_UP_LORA_A   = auto()
    FFN_UP_LORA_B   = auto()
    FFN_GATE_LORA_A = auto()
    FFN_GATE_LORA_B = auto()
    FFN_DOWN_LORA_A = auto()
    FFN_DOWN_LORA_B = auto()

constants.py

TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
..
    MODEL_TENSOR.ATTN_QKV_LORA_A : "blk.{bid}.attn_qkv_lora_a",
    MODEL_TENSOR.ATTN_QKV_LORA_B : "blk.{bid}.attn_qkv_lora_b",
    MODEL_TENSOR.ATTN_OUT_LORA_A : "blk.{bid}.attn_out_lora_a",
    MODEL_TENSOR.ATTN_OUT_LORA_B : "blk.{bid}.attn_out_lora_b",
    MODEL_TENSOR.FFN_UP_LORA_A   : "blk.{bid}.ffn_up_lora_a",
    MODEL_TENSOR.FFN_UP_LORA_B   : "blk.{bid}.ffn_up_lora_b",
    MODEL_TENSOR.FFN_GATE_LORA_A : "blk.{bid}.ffn_gate_lora_a",
    MODEL_TENSOR.FFN_GATE_LORA_B : "blk.{bid}.ffn_gate_lora_b",
    MODEL_TENSOR.FFN_DOWN_LORA_A : "blk.{bid}.ffn_down_lora_a",
    MODEL_TENSOR.FFN_DOWN_LORA_B : "blk.{bid}.ffn_down_lora_b",

constants.py

MODEL_ARCH.INTERNLM2: [
        MODEL_TENSOR.TOKEN_EMBD,
        MODEL_TENSOR.OUTPUT_NORM,
        MODEL_TENSOR.OUTPUT,
        MODEL_TENSOR.ATTN_NORM,
        MODEL_TENSOR.ATTN_Q,
        MODEL_TENSOR.ATTN_K,
        MODEL_TENSOR.ATTN_V,
        MODEL_TENSOR.ATTN_OUT,
        MODEL_TENSOR.ATTN_ROT_EMBD,
        MODEL_TENSOR.FFN_NORM,
        MODEL_TENSOR.FFN_GATE,
        MODEL_TENSOR.FFN_DOWN,
        MODEL_TENSOR.FFN_UP,

        MODEL_TENSOR.ATTN_QKV_LORA_A,
        MODEL_TENSOR.ATTN_QKV_LORA_B,
        MODEL_TENSOR.ATTN_OUT_LORA_A,
        MODEL_TENSOR.ATTN_OUT_LORA_B,
        MODEL_TENSOR.FFN_UP_LORA_A,
        MODEL_TENSOR.FFN_UP_LORA_B,
        MODEL_TENSOR.FFN_GATE_LORA_A,
        MODEL_TENSOR.FFN_GATE_LORA_B,
        MODEL_TENSOR.FFN_DOWN_LORA_A,
        MODEL_TENSOR.FFN_DOWN_LORA_B,
    ],

tensor_mapping.py:

        MODEL_TENSOR.ATTN_QKV_LORA_A: (
            "model.layers.{bid}.attention.wqkv.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_QKV_LORA_B: (
            "model.layers.{bid}.attention.wqkv.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_OUT_LORA_A: (
            "model.layers.{bid}.attention.wo.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.ATTN_OUT_LORA_B: (
            "model.layers.{bid}.attention.wo.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.FFN_UP_LORA_A: (
            "model.layers.{bid}.feed_forward.w3.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_UP_LORA_B: (
            "model.layers.{bid}.feed_forward.w3.Plora_B",                 # internlm2
        ),
       MODEL_TENSOR.FFN_GATE_LORA_A: (
            "model.layers.{bid}.feed_forward.w1.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_GATE_LORA_B: (
            "model.layers.{bid}.feed_forward.w1.Plora_B",                 # internlm2
        ),
        MODEL_TENSOR.FFN_DOWN_LORA_A: (
            "model.layers.{bid}.feed_forward.w2.Plora_A",                 # internlm2
        ),
        MODEL_TENSOR.FFN_DOWN_LORA_B: (
            "model.layers.{bid}.feed_forward.w2.Plora_B",                 # internlm2
        ),

in convert_hf_to_ggml.py: (skipping qkv reshape)

    qkv_pattern = r"model\.layers\.(\d+)\.attention\.wqkv"
        for name, data_torch in model_kv.items():
            # we don't need these
            if name.endswith(".rotary_emb.inv_freq"):
                continue
            plora_tensor = True if "Plora" in name else False
            if re.match(qkv_pattern, name) and not plora_tensor:

That makes your lora models convertable, quantization works (though I guess that the LORA should get a higher quantization quality)

Now what would be missing is the conditional support to apply it ?

I will try to support lora for internLM2, but I am not very familiar with the lora implementation of llama.cpp, and I need some time to read the code.

cmp-nct · 2024-01-31T03:25:16Z

I've put this PR together with xcomposer2 support here: #5232
It does not handle the partial LoRa yet, my pytorch knoweldge is minimal and I struggle to follow the flow and shapes.

Just fyi, if you find time to add a hand to get the plora working we would have full support for xcomposer2 - possibly the best visual multimodal model to date.

llama.cpp

ggerganov · 2024-02-01T08:57:16Z

llama.cpp

            { LLM_TENSOR_FFN_NORM,        "blk.%d.ffn_norm" },
+            { LLM_TENSOR_FFN_DOWN,        "blk.%d.ffn_down" },
+        },


Some of Orion's tensors have been deleted - LLM_TENSOR_FFN_GATE and LLM_TENSOR_FFN_UP. Please restore:

llama.cpp/llama.cpp

Lines 654 to 670 in 1cfb537

LLM_ARCH_ORION,

{

{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },

{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },

{ LLM_TENSOR_OUTPUT, "output" },

{ LLM_TENSOR_ROPE_FREQS, "rope_freqs" },

{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },

{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },

{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },

{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },

{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },

{ LLM_TENSOR_ATTN_ROT_EMBD, "blk.%d.attn_rot_embd" },

{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },

{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },

{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },

{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },

},

Sorry, I was careless when rabase, it has been fixed.

* support InternLM2 inference * add add_space_prefix KV pair

sweetcard · 2024-02-01T11:07:55Z

python convert-hf-to-gguf.py ./internlm2-chat-20b
It doesn't work. Please check the following image:

arch-btw · 2024-02-01T18:06:37Z

I'm having the same issue as @sweetcard but on internlm2-chat-7b

Edit: internlm/internlm2-chat-1_8b-sft:

* support InternLM2 inference * add add_space_prefix KV pair

SolenoidWGT · 2024-02-03T14:24:30Z

@arch-btw @sweetcard , I spent some time debugging to find the problem, and fixed it in this PR #5305.

sweetcard · 2024-02-03T14:26:58Z

@arch-btw @sweetcard , I spent some time debugging to find the problem, and fixed it in this PR #5305.

thank you for your excellent wok👍

arch-btw · 2024-02-04T03:16:44Z

Thank you @SolenoidWGT !

* support InternLM2 inference * add add_space_prefix KV pair

ggerganov reviewed Jan 29, 2024

View reviewed changes

SolenoidWGT force-pushed the feat/add_internlm2 branch 2 times, most recently from e0375e1 to d25bc84 Compare January 31, 2024 05:46

ggerganov approved these changes Jan 31, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

SolenoidWGT force-pushed the feat/add_internlm2 branch 3 times, most recently from f7fb7be to 47824a5 Compare February 1, 2024 03:22

ggerganov reviewed Feb 1, 2024

View reviewed changes

llama : support InternLM2

ca13503

* support InternLM2 inference * add add_space_prefix KV pair

SolenoidWGT force-pushed the feat/add_internlm2 branch from 47824a5 to ca13503 Compare February 1, 2024 09:11

ggerganov merged commit ce32060 into ggerganov:master Feb 1, 2024
50 of 53 checks passed

martindevans mentioned this pull request Feb 2, 2024

[Feature Request] Support InternLM Deploy SciSharp/LLamaSharp#168

Closed

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024

llama : support InternLM2 (ggerganov#5184)

f77c963

* support InternLM2 inference * add add_space_prefix KV pair

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

llama : support InternLM2 (ggerganov#5184)

12fd8cf

* support InternLM2 inference * add add_space_prefix KV pair

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : support InternLM2 #5184

llama : support InternLM2 #5184

SolenoidWGT commented Jan 29, 2024

ggerganov Jan 29, 2024

cebtenzzre Jan 29, 2024

SolenoidWGT Jan 30, 2024 •

edited

Loading

ggerganov Jan 30, 2024

SolenoidWGT Jan 31, 2024

cmp-nct commented Jan 30, 2024 •

edited

Loading

SolenoidWGT commented Jan 30, 2024

cmp-nct commented Jan 31, 2024 •

edited

Loading

ggerganov Feb 1, 2024

SolenoidWGT Feb 1, 2024

sweetcard commented Feb 1, 2024

arch-btw commented Feb 1, 2024 •

edited

Loading

SolenoidWGT commented Feb 3, 2024

sweetcard commented Feb 3, 2024

arch-btw commented Feb 4, 2024

	bool add_space_prefix = false ? model->arch == LLM_ARCH_INTERNLM2 : true;
	bool add_space_prefix = model->arch == LLM_ARCH_INTERNLM2 ? true : false;

	LLM_ARCH_ORION,
	{
	{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
	{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
	{ LLM_TENSOR_OUTPUT, "output" },
	{ LLM_TENSOR_ROPE_FREQS, "rope_freqs" },
	{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
	{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
	{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
	{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
	{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
	{ LLM_TENSOR_ATTN_ROT_EMBD, "blk.%d.attn_rot_embd" },
	{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
	{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
	{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
	{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
	},

llama : support InternLM2 #5184

llama : support InternLM2 #5184

Conversation

SolenoidWGT commented Jan 29, 2024

ggerganov Jan 29, 2024

Choose a reason for hiding this comment

cebtenzzre Jan 29, 2024

Choose a reason for hiding this comment

SolenoidWGT Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Jan 30, 2024

Choose a reason for hiding this comment

SolenoidWGT Jan 31, 2024

Choose a reason for hiding this comment

cmp-nct commented Jan 30, 2024 • edited Loading

SolenoidWGT commented Jan 30, 2024

cmp-nct commented Jan 31, 2024 • edited Loading

ggerganov Feb 1, 2024

Choose a reason for hiding this comment

SolenoidWGT Feb 1, 2024

Choose a reason for hiding this comment

sweetcard commented Feb 1, 2024

arch-btw commented Feb 1, 2024 • edited Loading

SolenoidWGT commented Feb 3, 2024

sweetcard commented Feb 3, 2024

arch-btw commented Feb 4, 2024

SolenoidWGT Jan 30, 2024 •

edited

Loading

cmp-nct commented Jan 30, 2024 •

edited

Loading

cmp-nct commented Jan 31, 2024 •

edited

Loading

arch-btw commented Feb 1, 2024 •

edited

Loading