-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Self-Extend support? #1242
Comments
any progress ? |
I find that grp-attn-w and grp-attn-n are not include in llama.h. Maybe help from llama.cpp will be perfect. |
Right--it looks like both Something like: ...
// fill the KV cache
for (int i = 0; i < n_ctx; i += n_batch) {
if (i > 0 && n_grp > 1) {
// if SelfExtend is enabled, we compress the position from the last batch by a factor of n_grp
const int ib = i/n_batch - 1;
const int bd = n_batch_grp*(n_grp - 1);
llama_kv_cache_seq_add (ctx, 0, n_past - n_batch, n_past, ib*bd);
llama_kv_cache_seq_div (ctx, 0, n_past - n_batch + ib*bd, n_past + ib*bd, n_grp);
llama_kv_cache_update (ctx); I've spent some time looking in llama.cpp-python routines, but couldn't find the equivalent place what happens when you exceed the current cache. It looks like ggerganov may tackling this in the issue @sweetcard linked above. Maybe that's the faster route. |
any update here? 😄 |
any update pls? |
I've been really enjoying using both
llama.cpp-python
and the originalllama.cpp
. These are amazing developments here, especially for folks without massively powerful GPUs.There's a really nice feature that was implemented in
llama.cpp
in January to allow self-extend (ala LongLLM's approach)). It works well for the llama's main.cpp as well as server.cpp. It works really well, and plenty of folks have noted self-extend is especially useful with Mistral/Mixtral, Gemma, and Phi 2.It appears someone else might have been asking about this earlier here. Right now, I'm having to move in and out of python when I want to run summarization on a 'just-slightly-too-long' article with self-extend. Would you consider implementing self-extend as an option in
llama.cpp-python
?The text was updated successfully, but these errors were encountered: