Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for Context Parallelism using Deepseed's DistributedAt… #1501

Merged
merged 4 commits into from
Dec 3, 2024

Conversation

bhargaveede
Copy link
Collaborator

@bhargaveede bhargaveede commented Nov 20, 2024

Adding support for Context Parallelism using Deepseed's DistributedAttention

This PR adds support to enable Context Parallelism for Llama models using deepspeed.
This feature can be enabled by using context_parallel_size flag.

This feature enables us to train/eval longer context lengths by parallelizing inputs across Context Parallel group.
For Attention, It uses Deepspeed's DistributedAttention which gathers the sequences for all heads and distributes the heads across the Context Parallel Group so that Attention per head has entire context and are distributed within the Context Parallel group.
Once the attention is done, Outputs are scattered based on sequence length across the group.

Verified Llama3.1 8B and Llama 3.1 70B finetuning for 32K seq length on 8 ranks using this feature.

  • Could fit 32K sequence length Llama 3.1-8B LoRa finetuning on 4 ranks with selective recompute instead of gradient_checkpointing.
  • Could fit 32k sequence length for Llama 3.1-70B LoRa finetuning on 4 ranks and 8 ranks with gradient_checkpointing

Llama 3.1-8B command
HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 python3 ./optimum-habana-fork/examples/gaudi_spawn.py --world_size 8 --use_deepspeed ./optimum-habana-fork/examples/language-modeling/run_lora_clm.py --dataset_name tatsu-lab/alpaca --bf16 True --output_dir /tmp/lora_out --max_seq_len 32768 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --save_strategy no --learning_rate 0.0004 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --dataset_concatenation --do_train --use_habana --throughput_warmup_steps 3 --lora_rank 8 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --attn_softmax_bf16 True --validation_split_percentage 4 --flash_attention_causal_mask True --evaluation_strategy epoch --pipelining_fwd_bwd --use_lazy_mode --use_flash_attention True --deepspeed ./optimum-habana-fork/examples/language-modeling/llama3_ds_zero1_config.json --num_train_epochs 3 --eval_delay 3 --do_eval --lora_alpha 16 --lora_dropout 0.05 --gradient_accumulation_steps 4 --flash_attention_recompute True --context_parallel_size 4 --model_name_or_path meta-llama/Llama-3.1-8B

Test added to verify the context parallelism:
https://github.com/huggingface/optimum-habana/pull/1501/files#diff-
0741a50beca4b08d354933485499f735f9b5493841e8f3af0e89b16ae1e04af4R978

Note:

DistributedAttention iIndices](https://github.com/huggingface/optimum-habana/pull/1501/files#diff-30aeee6868dd1de34878aca0583f57bb5b0dd9a2a8511a80e9a6b2645f39ce6bR490) are initialized as scatter_idx 1 and gather_idx 2 as for Llama Query States has shape [B,N,S,H] - [Batch Size, Num Heads, Seq Length, Head Dim]
As we want to gather on sequence length [Dim 2 of B,N,S,H] and scatter heads [Dim 1 of B,N,S,H]
Other models who want to Integrate DistributedAttention have to adjust the indices based on the shape.

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@bhargaveede bhargaveede requested a review from vivekgoe November 20, 2024 05:34
@bhargaveede bhargaveede marked this pull request as ready for review November 21, 2024 07:56
@bhargaveede bhargaveede requested a review from a user November 21, 2024 07:56
@bhargaveede bhargaveede requested a review from regisss as a code owner November 21, 2024 07:56
@bhargaveede
Copy link
Collaborator Author

@regisss @libinta
Please review the changes which add support for Context Parallelism.
This is needed for 1.19 release

@bhargaveede bhargaveede added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Nov 21, 2024
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be merged before SynapseAI v1.19 is released right?

optimum/habana/accelerate/accelerator.py Outdated Show resolved Hide resolved
optimum/habana/accelerate/data_loader.py Outdated Show resolved Hide resolved
optimum/habana/accelerate/state.py Outdated Show resolved Hide resolved
optimum/habana/distributed/contextparallel.py Outdated Show resolved Hide resolved
optimum/habana/parallel_state.py Outdated Show resolved Hide resolved
optimum/habana/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not putting this file in optimum/habana.distributed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put it.
Can I do the restructuring later in a separate commit?

optimum/habana/transformers/training_args.py Show resolved Hide resolved
@@ -234,6 +234,7 @@ def to_test(
"codellama/CodeLlama-13b-Instruct-hf",
"MIT/ast-finetuned-speech-commands-v2",
"meta-llama/LlamaGuard-7b",
"huggyllama/llama-7b",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we perform the test with a more recent version of Llama? This one is Llama v1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add the test on any version of Llama.
We added v1 as for llama2 and 3 we need hugging face token authorization to access the model. If not, it will fail.
Is there a way to pass the token? (or) what do you suggest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance I use this script for the DeepSpeed CI: https://github.com/huggingface/optimum-habana/blob/main/tests/ci/slow_tests_deepspeed.sh
It takes a token as an argument to log in. But not sure you do it like that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added this test to check the working of CP feature.
For llama 2 and 3.1, We can add one more for long sequence length.
Can I add that later along with restructuring?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure, but I would like to do that before release

Copy link
Collaborator Author

@bhargaveede bhargaveede Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss I think we should add token similar to https://github.com/huggingface/optimum-habana/blob/main/tests/ci/slow_tests_deepspeed.sh for https://github.com/huggingface/optimum-habana/blob/main/tests/ci/slow_tests_8x.sh as well

I will update the test to Llama3.1. Can you add token to the slow_tests_8x.sh?
Will add it along with restructuring in separate commit.

If you agree for that. Can you get this merged?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's do that 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhargaveede I just pushed a commit to add HF login to the 8x slow tests: 9f9b41e

Copy link
Collaborator

@yeonsily yeonsily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also run all of the llama CI tests to make sure this doesn't affect to the current number?

optimum/habana/accelerate/state.py Show resolved Hide resolved
Copy link

github-actions bot commented Dec 2, 2024

The code quality check failed, please run make style.

@bhargaveede
Copy link
Collaborator Author

@regisss style check is failing in a different file (not part of this PR).
Can you check?

@regisss
Copy link
Collaborator

regisss commented Dec 3, 2024

@regisss style check is failing in a different file (not part of this PR). Can you check?

Yeah that was happening because a PR was merged yesterday without passing the style check. I rebased your branch so everything should be fine now.

@regisss regisss merged commit 0fbc457 into huggingface:main Dec 3, 2024
4 checks passed
regisss added a commit that referenced this pull request Dec 3, 2024
Liangyx2 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors synapse_1.19_dependency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants