Add Self-Extend support? #1242

theaerotoad · 2024-03-01T23:34:00Z

I've been really enjoying using both llama.cpp-python and the original llama.cpp. These are amazing developments here, especially for folks without massively powerful GPUs.

There's a really nice feature that was implemented in llama.cpp in January to allow self-extend (ala LongLLM's approach)). It works well for the llama's main.cpp as well as server.cpp. It works really well, and plenty of folks have noted self-extend is especially useful with Mistral/Mixtral, Gemma, and Phi 2.

It appears someone else might have been asking about this earlier here. Right now, I'm having to move in and out of python when I want to run summarization on a 'just-slightly-too-long' article with self-extend. Would you consider implementing self-extend as an option in llama.cpp-python?

The text was updated successfully, but these errors were encountered:

sweetcard · 2024-03-06T15:03:38Z

any progress ?

sweetcard · 2024-03-08T11:54:17Z

I find that grp-attn-w and grp-attn-n are not include in llama.h.

Maybe help from llama.cpp will be perfect.
Any other idea?

ggerganov/llama.cpp#4815 (comment)

sweetcard · 2024-03-08T11:55:48Z

#1090

This is a pr about this feature but it can not work because grp-attn-w and grp-attn-n are not include in llama.h.

theaerotoad · 2024-03-08T18:34:46Z

Right--it looks like both main.cpp and server.cpp implement self-extend not through anything exposed in llama.h. I think the simplest implementation of it appears in passkey.cpp

Something like:

   ...
    // fill the KV cache
    for (int i = 0; i < n_ctx; i += n_batch) {
        if (i > 0 && n_grp > 1) {
            // if SelfExtend is enabled, we compress the position from the last batch by a factor of n_grp
            const int ib = i/n_batch - 1;
            const int bd = n_batch_grp*(n_grp - 1);

            llama_kv_cache_seq_add (ctx, 0, n_past - n_batch,         n_past,         ib*bd);
            llama_kv_cache_seq_div (ctx, 0, n_past - n_batch + ib*bd, n_past + ib*bd, n_grp);
            llama_kv_cache_update  (ctx);

I've spent some time looking in llama.cpp-python routines, but couldn't find the equivalent place what happens when you exceed the current cache.

It looks like ggerganov may tackling this in the issue @sweetcard linked above. Maybe that's the faster route.

sweetcard · 2024-03-12T14:53:26Z

any update here? 😄

iamsaurabhgupt · 2024-09-07T16:44:00Z

any update pls?

abetlen added the enhancement New feature or request label Mar 2, 2024

sweetcard mentioned this issue Mar 8, 2024

main : add Self-Extend support ggerganov/llama.cpp#4815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Self-Extend support? #1242

Add Self-Extend support? #1242

theaerotoad commented Mar 1, 2024

sweetcard commented Mar 6, 2024

sweetcard commented Mar 8, 2024

sweetcard commented Mar 8, 2024

theaerotoad commented Mar 8, 2024

sweetcard commented Mar 12, 2024

iamsaurabhgupt commented Sep 7, 2024

Add Self-Extend support? #1242

Add Self-Extend support? #1242

Comments

theaerotoad commented Mar 1, 2024

sweetcard commented Mar 6, 2024

sweetcard commented Mar 8, 2024

sweetcard commented Mar 8, 2024

theaerotoad commented Mar 8, 2024

sweetcard commented Mar 12, 2024

iamsaurabhgupt commented Sep 7, 2024