Support Self extend for server (main is already supported) #4886

duykhanhbk · 2024-01-12T04:09:43Z

Self extend is now supported for main: Link: #4815
Link paper: https://arxiv.org/pdf/2401.01325.pdf
It would be great if it was also supported for the server, any guidance or support is welcome!

x4080 · 2024-01-12T20:59:25Z

yes please

ggerganov · 2024-01-12T21:05:01Z

I'll look into implementing it. In the meantime, do you observe positive results when using main? The llama.cpp implementation tries to follow the one from the paper, but it is not exactly the same as it applies changes (shifts) to the KV cache instead of recomputing the RoPE. It should be similar (if not the same), though I still have some doubts. My quick tests indicate that it seems to work, but I don't want to make any conclusions

x4080 · 2024-01-12T23:18:24Z

@ggerganov I'm happy to announce that the problem with dolphin phi with long context is solved now, and even better with group attention. And also I tried the other model and it makes different, so I think it works

I just dont have understanding between

-c 4096 --grp-attn-n 4 --grp-attn-w 1024

How to calculate the context desired. Could you give enlightment here ? 😄

duykhanhbk · 2024-01-13T12:53:54Z

I'll look into implementing it. In the meantime, do you observe positive results when using main? The llama.cpp implementation tries to follow the one from the paper, but it is not exactly the same as it applies changes (shifts) to the KV cache instead of recomputing the RoPE. It should be similar (if not the same), though I still have some doubts. My quick tests indicate that it seems to work, but I don't want to make any conclusions

@ggerganov @x4080 yes, I tested with SeaLLM 7b chat (Llama2 architecture) and extend context to 16k, 26k, the result is also quite good, looks promising, I will test with mistral, phi,.. to see how it works and the same question like @x4080 How to calculate the context desired. Could you give enlightment here @ggerganov? It's a bit difference from here https://github.com/datamllab/LongLM

ggerganov · 2024-01-13T13:56:14Z

First, you set -c to the context that you want to achieve - let's say -c 8192.

Next, given that the original training context of the model is T (let's assume T = 2048), you want to set G >= 8192 / T, so in this case: --grp-attn-n 4 or --grp-attn-n 8.

The --grp-attn-w corresponds to W from the paper. I think the authors generally used 512, but I think you can go up to T/2 - so in this case --grp-attn-w 1024.

Additionally, G has to be multiple of W

x4080 · 2024-01-13T21:03:27Z

@ggerganov Thanks for detailed answer

duykhanhbk · 2024-01-15T04:45:40Z

Is there any update on implementing self-extend for the server @ggerganov?

ggerganov · 2024-01-15T12:46:14Z

If someone wants to give it a try - go ahead. When I get to this, will assign myself to the issue - for now other priorities

duykhanhbk · 2024-01-16T05:12:17Z

This is a pull request about this issue #4963
Plz check it @ggerganov!

ggerganov · 2024-01-16T09:16:59Z

It's just adding the cmd-line arguments - there is no actual implementation

duykhanhbk · 2024-01-16T09:23:22Z

Yes, I see too. Waiting for actual implementation!

Josh-XT · 2024-01-19T12:59:34Z

Sorry, I started to do it following what was done to main on the server example, then my kid made a big mess and I didn't get to finish up and haven't had a chance to get back to it still. Hopefully someone will have time to do it before I do again.

Maximilian-Winter · 2024-01-22T20:45:37Z

@ggerganov Is it ok when I do this? I will start working on it now.

Maximilian-Winter · 2024-01-22T20:47:06Z

I looked at your main implementation and it looks doable for me. Is there anything I need to look out for?

ggerganov · 2024-01-22T22:20:48Z

I suppose it would be a good idea to put this code behind a llama.h function to avoid these raw computations to be copy-pasted. I haven't thought deeply for the API, but maybe you can give it a try

Maximilian-Winter · 2024-01-22T22:45:56Z

@ggerganov Well, actually I just copy pasted it. But I will refactor it once I know I have the correct way of doing this!

Maximilian-Winter · 2024-01-24T01:05:03Z

I finished porting of self extend to the server.
It is in this pull request.
#5104

duykhanhbk · 2024-01-24T04:59:51Z

Hi @Maximilian-Winter , first thanks for your work. But I have found problem with KV cache #5104 (comment)

Green-Sky · 2024-01-29T12:29:07Z

closing as completed.

(@duykhanhbk i found the same issue, but the self extend is not the cause)

x4080 · 2024-02-27T01:09:19Z

Hi, I found that using server with --grp-attn-n -> in some situation it will stop inference prematurely, I tested using server with and without, without --grp-attn-n works flawlessly, then I tried using non server (cmd line inference with grp-attn) and works fine, so maybe there's problem with the implementation in server ?

phymbert · 2024-02-27T06:35:02Z

@x4080 it would really help if you added a scenario regarding group attention self extension using the server test framework.

x4080 · 2024-02-27T20:37:59Z

@phymbert thanks for replying, I'm currently using my own fine tune model with my private data, but what I did with the model is to translate to another language, I know its difficult to fix things without repeatable evidence, maybe I can find other example with public model - I'll share it

duykhanhbk mentioned this issue Jan 12, 2024

main : add Self-Extend support #4815

Merged

duykhanhbk mentioned this issue Jan 16, 2024

Add self extend support to server #4963

Closed

Green-Sky closed this as completed Jan 29, 2024

cognitivetech mentioned this issue Feb 4, 2024

Self-extend support ollama/ollama#1964

Open

theaerotoad mentioned this issue Mar 1, 2024

Add Self-Extend support? abetlen/llama-cpp-python#1242

Open

Latrasis mentioned this issue Apr 23, 2024

self extend support ollama/ollama#2486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Self extend for server (main is already supported) #4886

Support Self extend for server (main is already supported) #4886

duykhanhbk commented Jan 12, 2024

x4080 commented Jan 12, 2024

ggerganov commented Jan 12, 2024

x4080 commented Jan 12, 2024

duykhanhbk commented Jan 13, 2024

ggerganov commented Jan 13, 2024

x4080 commented Jan 13, 2024

duykhanhbk commented Jan 15, 2024

ggerganov commented Jan 15, 2024

duykhanhbk commented Jan 16, 2024

ggerganov commented Jan 16, 2024

duykhanhbk commented Jan 16, 2024

Josh-XT commented Jan 19, 2024

Maximilian-Winter commented Jan 22, 2024

Maximilian-Winter commented Jan 22, 2024

ggerganov commented Jan 22, 2024

Maximilian-Winter commented Jan 22, 2024

Maximilian-Winter commented Jan 24, 2024

duykhanhbk commented Jan 24, 2024

Green-Sky commented Jan 29, 2024

x4080 commented Feb 27, 2024

phymbert commented Feb 27, 2024

x4080 commented Feb 27, 2024

Support Self extend for server (main is already supported) #4886

Support Self extend for server (main is already supported) #4886

Comments

duykhanhbk commented Jan 12, 2024

x4080 commented Jan 12, 2024

ggerganov commented Jan 12, 2024

x4080 commented Jan 12, 2024

duykhanhbk commented Jan 13, 2024

ggerganov commented Jan 13, 2024

x4080 commented Jan 13, 2024

duykhanhbk commented Jan 15, 2024

ggerganov commented Jan 15, 2024

duykhanhbk commented Jan 16, 2024

ggerganov commented Jan 16, 2024

duykhanhbk commented Jan 16, 2024

Josh-XT commented Jan 19, 2024

Maximilian-Winter commented Jan 22, 2024

Maximilian-Winter commented Jan 22, 2024

ggerganov commented Jan 22, 2024

Maximilian-Winter commented Jan 22, 2024

Maximilian-Winter commented Jan 24, 2024

duykhanhbk commented Jan 24, 2024

Green-Sky commented Jan 29, 2024

x4080 commented Feb 27, 2024

phymbert commented Feb 27, 2024

x4080 commented Feb 27, 2024