-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the policy to run llama model from the official repo #4313
Conversation
This PR is for the official llama repo, but named llama2. User will be confusing for the file and class names. |
Hi @mpjlu, Thanks for the comment. |
Btw, the models that HF and Llama repo use are a bit different! At least, I know that they use different rotary-embedding. |
I added some test for checking the performance and accuracy of this PR using a fork of the llama code-base.
|
…nto add-llama2-support
Does this PR support llama-2-70B model? "Llama-70B model" is llama 1 or llama 2? |
It supports Llama-2-70B. Of course, it is a llama-2 model |
the attention of llama-2-70b is GQA(KV-shared arch) , this PR support llama2 in a not KV-shared method, so the KV-cache memory is the same as MHA, right? |
right, that will be added next |
…nto add-llama2-support
* origin/master: Allow multiple inference engines in single script (deepspeedai#4384) adds triton flash attention2 kernel (deepspeedai#4337) Fix llama meta tensor loading in AutoTP and kernel injected inference (deepspeedai#3608) Fix min torch version (deepspeedai#4375) Fix multinode runner to properly append to PDSH_SSH_ARGS_APPEND (deepspeedai#4373) add the missing method (deepspeedai#4363) Openfold fix (deepspeedai#4368) deepspeed4science japanese blog (deepspeedai#4369) deepspeed4science chinese blog (deepspeedai#4366) Enable workflow dispatch on Torch 1.10 CI tests (deepspeedai#4361) Update conda env to have max pydantic version (deepspeedai#4362) add deepspeed4science blog link (deepspeedai#4364) added check to avoid undefined behavior when the input_id length is greater than max_tokens (deepspeedai#4349) Add the policy to run llama model from the official repo (deepspeedai#4313) fix deepspeed4science links (deepspeedai#4358) DeepSpeed4Science (deepspeedai#4357) Support InternLM (deepspeedai#4137) Pass base_dir to model files can be loaded for auto-tp/meta-tensor. (deepspeedai#4348)
@RezaYazdaniAminabadi How is the progress of adding GQA to support the LLaMA2-70B model? We would like to see if any help is needed to expedite it. |
This PR adds the support for Llama2 using the official implementation of llama using llama repo
This is now working for all the llama variant except the ones that require kv-sharing.
Next will add the support for the KV-shared architecture.