Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Self extend for server (main is already supported) #4886

Closed
duykhanhbk opened this issue Jan 12, 2024 · 22 comments
Closed

Support Self extend for server (main is already supported) #4886

duykhanhbk opened this issue Jan 12, 2024 · 22 comments

Comments

@duykhanhbk
Copy link

Self extend is now supported for main: Link: #4815
Link paper: https://arxiv.org/pdf/2401.01325.pdf
It would be great if it was also supported for the server, any guidance or support is welcome!

@x4080
Copy link

x4080 commented Jan 12, 2024

yes please

@ggerganov
Copy link
Owner

I'll look into implementing it. In the meantime, do you observe positive results when using main? The llama.cpp implementation tries to follow the one from the paper, but it is not exactly the same as it applies changes (shifts) to the KV cache instead of recomputing the RoPE. It should be similar (if not the same), though I still have some doubts. My quick tests indicate that it seems to work, but I don't want to make any conclusions

@x4080
Copy link

x4080 commented Jan 12, 2024

@ggerganov I'm happy to announce that the problem with dolphin phi with long context is solved now, and even better with group attention. And also I tried the other model and it makes different, so I think it works

I just dont have understanding between

-c 4096 --grp-attn-n 4 --grp-attn-w 1024

How to calculate the context desired. Could you give enlightment here ? 😄

@duykhanhbk
Copy link
Author

I'll look into implementing it. In the meantime, do you observe positive results when using main? The llama.cpp implementation tries to follow the one from the paper, but it is not exactly the same as it applies changes (shifts) to the KV cache instead of recomputing the RoPE. It should be similar (if not the same), though I still have some doubts. My quick tests indicate that it seems to work, but I don't want to make any conclusions

@ggerganov @x4080 yes, I tested with SeaLLM 7b chat (Llama2 architecture) and extend context to 16k, 26k, the result is also quite good, looks promising, I will test with mistral, phi,.. to see how it works and the same question like @x4080 How to calculate the context desired. Could you give enlightment here @ggerganov? It's a bit difference from here https://github.com/datamllab/LongLM

@ggerganov
Copy link
Owner

First, you set -c to the context that you want to achieve - let's say -c 8192.

Next, given that the original training context of the model is T (let's assume T = 2048), you want to set G >= 8192 / T, so in this case: --grp-attn-n 4 or --grp-attn-n 8.

The --grp-attn-w corresponds to W from the paper. I think the authors generally used 512, but I think you can go up to T/2 - so in this case --grp-attn-w 1024.

Additionally, G has to be multiple of W

@x4080
Copy link

x4080 commented Jan 13, 2024

@ggerganov Thanks for detailed answer

@duykhanhbk
Copy link
Author

Is there any update on implementing self-extend for the server @ggerganov?

@ggerganov
Copy link
Owner

If someone wants to give it a try - go ahead. When I get to this, will assign myself to the issue - for now other priorities

@duykhanhbk
Copy link
Author

This is a pull request about this issue #4963
Plz check it @ggerganov!

@ggerganov
Copy link
Owner

It's just adding the cmd-line arguments - there is no actual implementation

@duykhanhbk
Copy link
Author

Yes, I see too. Waiting for actual implementation!

@Josh-XT
Copy link

Josh-XT commented Jan 19, 2024

Sorry, I started to do it following what was done to main on the server example, then my kid made a big mess and I didn't get to finish up and haven't had a chance to get back to it still. Hopefully someone will have time to do it before I do again.

@Maximilian-Winter
Copy link
Contributor

@ggerganov Is it ok when I do this? I will start working on it now.

@Maximilian-Winter
Copy link
Contributor

I looked at your main implementation and it looks doable for me. Is there anything I need to look out for?

@ggerganov
Copy link
Owner

I suppose it would be a good idea to put this code behind a llama.h function to avoid these raw computations to be copy-pasted. I haven't thought deeply for the API, but maybe you can give it a try

@Maximilian-Winter
Copy link
Contributor

@ggerganov Well, actually I just copy pasted it. But I will refactor it once I know I have the correct way of doing this!

@Maximilian-Winter
Copy link
Contributor

I finished porting of self extend to the server.
It is in this pull request.
#5104

@duykhanhbk
Copy link
Author

Hi @Maximilian-Winter , first thanks for your work. But I have found problem with KV cache #5104 (comment)

@Green-Sky
Copy link
Collaborator

closing as completed.

(@duykhanhbk i found the same issue, but the self extend is not the cause)

@x4080
Copy link

x4080 commented Feb 27, 2024

Hi, I found that using server with --grp-attn-n -> in some situation it will stop inference prematurely, I tested using server with and without, without --grp-attn-n works flawlessly, then I tried using non server (cmd line inference with grp-attn) and works fine, so maybe there's problem with the implementation in server ?

@phymbert
Copy link
Collaborator

@x4080 it would really help if you added a scenario regarding group attention self extension using the server test framework.

@x4080
Copy link

x4080 commented Feb 27, 2024

@phymbert thanks for replying, I'm currently using my own fine tune model with my private data, but what I did with the model is to translate to another language, I know its difficult to fix things without repeatable evidence, maybe I can find other example with public model - I'll share it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants