-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for Context Parallelism using Deepseed's DistributedAt… #1501
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
c494ea6
to
3cfc93f
Compare
3cfc93f
to
30d808e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be merged before SynapseAI v1.19 is released right?
optimum/habana/parallel_state.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not putting this file in optimum/habana.distributed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put it.
Can I do the restructuring later in a separate commit?
@@ -234,6 +234,7 @@ def to_test( | |||
"codellama/CodeLlama-13b-Instruct-hf", | |||
"MIT/ast-finetuned-speech-commands-v2", | |||
"meta-llama/LlamaGuard-7b", | |||
"huggyllama/llama-7b", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we perform the test with a more recent version of Llama? This one is Llama v1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add the test on any version of Llama.
We added v1 as for llama2 and 3 we need hugging face token authorization to access the model. If not, it will fail.
Is there a way to pass the token? (or) what do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance I use this script for the DeepSpeed CI: https://github.com/huggingface/optimum-habana/blob/main/tests/ci/slow_tests_deepspeed.sh
It takes a token as an argument to log in. But not sure you do it like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added this test to check the working of CP feature.
For llama 2 and 3.1, We can add one more for long sequence length.
Can I add that later along with restructuring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sure, but I would like to do that before release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@regisss I think we should add token similar to https://github.com/huggingface/optimum-habana/blob/main/tests/ci/slow_tests_deepspeed.sh for https://github.com/huggingface/optimum-habana/blob/main/tests/ci/slow_tests_8x.sh as well
I will update the test to Llama3.1. Can you add token to the slow_tests_8x.sh?
Will add it along with restructuring in separate commit.
If you agree for that. Can you get this merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's do that 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhargaveede I just pushed a commit to add HF login to the 8x slow tests: 9f9b41e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also run all of the llama CI tests to make sure this doesn't affect to the current number?
The code quality check failed, please run |
@regisss style check is failing in a different file (not part of this PR). |
Yeah that was happening because a PR was merged yesterday without passing the style check. I rebased your branch so everything should be fine now. |
…tention (#1501) Co-authored-by: regisss <[email protected]>
…tention (huggingface#1501) Co-authored-by: regisss <[email protected]>
Adding support for Context Parallelism using Deepseed's DistributedAttention
This PR adds support to enable Context Parallelism for Llama models using deepspeed.
This feature can be enabled by using context_parallel_size flag.
This feature enables us to train/eval longer context lengths by parallelizing inputs across Context Parallel group.
For Attention, It uses Deepspeed's DistributedAttention which gathers the sequences for all heads and distributes the heads across the Context Parallel Group so that Attention per head has entire context and are distributed within the Context Parallel group.
Once the attention is done, Outputs are scattered based on sequence length across the group.
Verified Llama3.1 8B and Llama 3.1 70B finetuning for 32K seq length on 8 ranks using this feature.
Llama 3.1-8B command
HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 python3 ./optimum-habana-fork/examples/gaudi_spawn.py --world_size 8 --use_deepspeed ./optimum-habana-fork/examples/language-modeling/run_lora_clm.py --dataset_name tatsu-lab/alpaca --bf16 True --output_dir /tmp/lora_out --max_seq_len 32768 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --save_strategy no --learning_rate 0.0004 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --dataset_concatenation --do_train --use_habana --throughput_warmup_steps 3 --lora_rank 8 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --attn_softmax_bf16 True --validation_split_percentage 4 --flash_attention_causal_mask True --evaluation_strategy epoch --pipelining_fwd_bwd --use_lazy_mode --use_flash_attention True --deepspeed ./optimum-habana-fork/examples/language-modeling/llama3_ds_zero1_config.json --num_train_epochs 3 --eval_delay 3 --do_eval --lora_alpha 16 --lora_dropout 0.05 --gradient_accumulation_steps 4 --flash_attention_recompute True --context_parallel_size 4 --model_name_or_path meta-llama/Llama-3.1-8B
Test added to verify the context parallelism:
https://github.com/huggingface/optimum-habana/pull/1501/files#diff-
0741a50beca4b08d354933485499f735f9b5493841e8f3af0e89b16ae1e04af4R978
Note:
DistributedAttention iIndices](https://github.com/huggingface/optimum-habana/pull/1501/files#diff-30aeee6868dd1de34878aca0583f57bb5b0dd9a2a8511a80e9a6b2645f39ce6bR490) are initialized as scatter_idx 1 and gather_idx 2 as for Llama Query States has shape [B,N,S,H] - [Batch Size, Num Heads, Seq Length, Head Dim]
As we want to gather on sequence length [Dim 2 of B,N,S,H] and scatter heads [Dim 1 of B,N,S,H]
Other models who want to Integrate DistributedAttention have to adjust the indices based on the shape.
What does this PR do?
Fixes # (issue)
Before submitting