Add the policy to run llama model from the official repo #4313

RezaYazdaniAminabadi · 2023-09-12T21:06:56Z

This PR adds the support for Llama2 using the official implementation of llama using llama repo

This is now working for all the llama variant except the ones that require kv-sharing.
Next will add the support for the KV-shared architecture.

mpjlu · 2023-09-13T08:19:30Z

This PR is for the official llama repo, but named llama2. User will be confusing for the file and class names.
Since the 7B and 13B model arch of llama and llama2 is the same, now DS can run llama 2 7B and 13B HF model, just need to support 70B llama 2 model.
It is better to enable 70B llama2 model on the current llama code.

RezaYazdaniAminabadi · 2023-09-13T16:40:42Z

This PR is for the official llama repo, but named llama2. User will be confusing for the file and class names. Since the 7B and 13B model arch of llama and llama2 is the same, now DS can run llama 2 7B and 13B HF model, just need to support 70B llama 2 model. It is better to enable 70B llama2 model on the current llama code.

Hi @mpjlu,

Thanks for the comment.
Agreed on this, and this policy is just a secondary one for me to try with the official repo, as getting access to the llama models on HF takes a long processing time. I will name them better!
Also, I am working on adding the 70B support.
Best,
Reza

RezaYazdaniAminabadi · 2023-09-13T16:46:41Z

Btw, the models that HF and Llama repo use are a bit different! At least, I know that they use different rotary-embedding.

RezaYazdaniAminabadi · 2023-09-13T20:39:18Z

I added some test for checking the performance and accuracy of this PR using a fork of the llama code-base.
There are some script that you can use to run different model configuration, here.
I am seeing about 2.8x performance speedup (using 8 A100 GPUs) when using ds-inference for the Llama-70B model using same example used in the repo (will add more test to check the performance more extensively):

[2023-09-13 13:23:09,286] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 8192, 'intermediate_size': 28672, 'heads': 64, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 128, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': False, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': 8}
Loading extension module transformer_inference...
------------------------------------------------------
Free memory : 35.154175 (GigaBytes)  
Total memory: 79.169678 (GigaBytes)  
Requested memory: 3.875000 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f9798000000 
------------------------------------------------------
baseline: generation time 10.940 sec for generating x tokens.
ds-inference: generation time 3.938 sec for generating x tokens.
speeup: 2.778x

…nto add-llama2-support

mpjlu · 2023-09-18T02:57:48Z

I added some test for checking the performance and accuracy of this PR using a fork of the llama code-base. There are some script that you can use to run different model configuration, here. I am seeing about 2.8x performance speedup (using 8 A100 GPUs) when using ds-inference for the Llama-70B model using same example used in the repo (will add more test to check the performance more extensively):

[2023-09-13 13:23:09,286] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 8192, 'intermediate_size': 28672, 'heads': 64, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 128, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': False, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': 8}
Loading extension module transformer_inference...
------------------------------------------------------
Free memory : 35.154175 (GigaBytes)  
Total memory: 79.169678 (GigaBytes)  
Requested memory: 3.875000 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f9798000000 
------------------------------------------------------
baseline: generation time 10.940 sec for generating x tokens.
ds-inference: generation time 3.938 sec for generating x tokens.
speeup: 2.778x

Does this PR support llama-2-70B model? "Llama-70B model" is llama 1 or llama 2?

RezaYazdaniAminabadi · 2023-09-18T05:46:17Z

I added some test for checking the performance and accuracy of this PR using a fork of the llama code-base. There are some script that you can use to run different model configuration, here. I am seeing about 2.8x performance speedup (using 8 A100 GPUs) when using ds-inference for the Llama-70B model using same example used in the repo (will add more test to check the performance more extensively):

[2023-09-13 13:23:09,286] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 8192, 'intermediate_size': 28672, 'heads': 64, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 128, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': False, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': 8}
Loading extension module transformer_inference...
------------------------------------------------------
Free memory : 35.154175 (GigaBytes)  
Total memory: 79.169678 (GigaBytes)  
Requested memory: 3.875000 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f9798000000 
------------------------------------------------------
baseline: generation time 10.940 sec for generating x tokens.
ds-inference: generation time 3.938 sec for generating x tokens.
speeup: 2.778x

Does this PR support llama-2-70B model? "Llama-70B model" is llama 1 or llama 2?

It supports Llama-2-70B. Of course, it is a llama-2 model

mpjlu · 2023-09-18T06:35:17Z

I added some test for checking the performance and accuracy of this PR using a fork of the llama code-base. There are some script that you can use to run different model configuration, here. I am seeing about 2.8x performance speedup (using 8 A100 GPUs) when using ds-inference for the Llama-70B model using same example used in the repo (will add more test to check the performance more extensively):

[2023-09-13 13:23:09,286] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 8192, 'intermediate_size': 28672, 'heads': 64, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 128, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': False, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': 8}
Loading extension module transformer_inference...
------------------------------------------------------
Free memory : 35.154175 (GigaBytes)  
Total memory: 79.169678 (GigaBytes)  
Requested memory: 3.875000 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f9798000000 
------------------------------------------------------
baseline: generation time 10.940 sec for generating x tokens.
ds-inference: generation time 3.938 sec for generating x tokens.
speeup: 2.778x

Does this PR support llama-2-70B model? "Llama-70B model" is llama 1 or llama 2?

It supports Llama-2-70B. Of course, it is a llama-2 model

the attention of llama-2-70b is GQA(KV-shared arch) , this PR support llama2 in a not KV-shared method, so the KV-cache memory is the same as MHA, right?

RezaYazdaniAminabadi · 2023-09-18T22:54:41Z

I added some test for checking the performance and accuracy of this PR using a fork of the llama code-base. There are some script that you can use to run different model configuration, here. I am seeing about 2.8x performance speedup (using 8 A100 GPUs) when using ds-inference for the Llama-70B model using same example used in the repo (will add more test to check the performance more extensively):
[2023-09-13 13:23:09,286] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 8192, 'intermediate_size': 28672, 'heads': 64, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 128, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': False, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': 8}
Loading extension module transformer_inference...
------------------------------------------------------
Free memory : 35.154175 (GigaBytes)  
Total memory: 79.169678 (GigaBytes)  
Requested memory: 3.875000 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f9798000000 
------------------------------------------------------
baseline: generation time 10.940 sec for generating x tokens.
ds-inference: generation time 3.938 sec for generating x tokens.
speeup: 2.778x
Does this PR support llama-2-70B model? "Llama-70B model" is llama 1 or llama 2?
It supports Llama-2-70B. Of course, it is a llama-2 model
the attention of llama-2-70b is GQA(KV-shared arch) , this PR support llama2 in a not KV-shared method, so the KV-cache memory is the same as MHA, right?

right, that will be added next

…nto add-llama2-support

* origin/master: Allow multiple inference engines in single script (deepspeedai#4384) adds triton flash attention2 kernel (deepspeedai#4337) Fix llama meta tensor loading in AutoTP and kernel injected inference (deepspeedai#3608) Fix min torch version (deepspeedai#4375) Fix multinode runner to properly append to PDSH_SSH_ARGS_APPEND (deepspeedai#4373) add the missing method (deepspeedai#4363) Openfold fix (deepspeedai#4368) deepspeed4science japanese blog (deepspeedai#4369) deepspeed4science chinese blog (deepspeedai#4366) Enable workflow dispatch on Torch 1.10 CI tests (deepspeedai#4361) Update conda env to have max pydantic version (deepspeedai#4362) add deepspeed4science blog link (deepspeedai#4364) added check to avoid undefined behavior when the input_id length is greater than max_tokens (deepspeedai#4349) Add the policy to run llama model from the official repo (deepspeedai#4313) fix deepspeed4science links (deepspeedai#4358) DeepSpeed4Science (deepspeedai#4357) Support InternLM (deepspeedai#4137) Pass base_dir to model files can be loaded for auto-tp/meta-tensor. (deepspeedai#4348)

ghost · 2023-10-27T02:02:47Z

@RezaYazdaniAminabadi How is the progress of adding GQA to support the LLaMA2-70B model? We would like to see if any help is needed to expedite it.

Reza Yazdani added 3 commits September 12, 2023 14:06

Add the llama2 support from the official llama repo

476ca30

add back commented function

6cabd62

add new policy & implementation for llama2

7b2142c

RezaYazdaniAminabadi marked this pull request as ready for review September 12, 2023 21:48

RezaYazdaniAminabadi requested review from jeffra, mrwyattii, awan-10, cmikeh2 and arashb as code owners September 12, 2023 21:48

RezaYazdaniAminabadi changed the title ~~Add the llama2 support from the official llama repo~~ Add the policy to run llama model from the official repo Sep 13, 2023

Reza Yazdani added 3 commits September 13, 2023 12:38

add some changes to inject/run the 70b llama model

f5d987d

remove debugging code

c2c2d6b

remove more debugging code

165042d

RezaYazdaniAminabadi and others added 7 commits September 13, 2023 21:54

Merge branch 'master' into add-llama2-support

81d692d

formatting

24a3a0f

Merge branch 'master' into add-llama2-support

c0ca80e

Merge branch 'master' into add-llama2-support

3e3945e

Merge branch 'master' into add-llama2-support

b37f7d8

use num_kv only when it has positive value

c33bc4f

Merge branch 'add-llama2-support' of github.com:microsoft/DeepSpeed i…

ca61bd1

…nto add-llama2-support

mrwyattii mentioned this pull request Sep 18, 2023

Issue loading larger models such as Llama-2 70B for serving deepspeedai/DeepSpeed-MII#224

Open

use the num_kv param only if it is positive

db5a3b7

RezaYazdaniAminabadi closed this Sep 18, 2023

RezaYazdaniAminabadi reopened this Sep 18, 2023

awan-10 and others added 5 commits September 18, 2023 17:56

Merge branch 'master' into add-llama2-support

d0abfdd

fix syntax and format errors.

297a15c

fix an issue with the float32 transform kernel

a87860d

Merge branch 'add-llama2-support' of github.com:microsoft/DeepSpeed i…

10a1df2

…nto add-llama2-support

Merge branch 'master' into add-llama2-support

c72aa76

RezaYazdaniAminabadi enabled auto-merge September 19, 2023 06:31

awan-10 approved these changes Sep 19, 2023

View reviewed changes

RezaYazdaniAminabadi added this pull request to the merge queue Sep 19, 2023

mrwyattii approved these changes Sep 19, 2023

View reviewed changes

lekurile approved these changes Sep 19, 2023

View reviewed changes

Merged via the queue into master with commit 468882f Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the policy to run llama model from the official repo #4313

Add the policy to run llama model from the official repo #4313

RezaYazdaniAminabadi commented Sep 12, 2023 •

edited

Loading

mpjlu commented Sep 13, 2023

RezaYazdaniAminabadi commented Sep 13, 2023

RezaYazdaniAminabadi commented Sep 13, 2023

RezaYazdaniAminabadi commented Sep 13, 2023 •

edited

Loading

mpjlu commented Sep 18, 2023

RezaYazdaniAminabadi commented Sep 18, 2023

mpjlu commented Sep 18, 2023 •

edited

Loading

RezaYazdaniAminabadi commented Sep 18, 2023

ghost commented Oct 27, 2023

Add the policy to run llama model from the official repo #4313

Add the policy to run llama model from the official repo #4313

Conversation

RezaYazdaniAminabadi commented Sep 12, 2023 • edited Loading

mpjlu commented Sep 13, 2023

RezaYazdaniAminabadi commented Sep 13, 2023

RezaYazdaniAminabadi commented Sep 13, 2023

RezaYazdaniAminabadi commented Sep 13, 2023 • edited Loading

mpjlu commented Sep 18, 2023

RezaYazdaniAminabadi commented Sep 18, 2023

mpjlu commented Sep 18, 2023 • edited Loading

RezaYazdaniAminabadi commented Sep 18, 2023

ghost commented Oct 27, 2023

RezaYazdaniAminabadi commented Sep 12, 2023 •

edited

Loading

RezaYazdaniAminabadi commented Sep 13, 2023 •

edited

Loading

mpjlu commented Sep 18, 2023 •

edited

Loading