Skip to content

Commit

Permalink
rewrite about flasn attn support
Browse files Browse the repository at this point in the history
  • Loading branch information
mchoi8739 authored Jun 23, 2023
1 parent 89d13dd commit 1053c23
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"\n",
"Sharded data parallelism is a distributed training technique that splits the model parameters, gradients, and optimizer states across GPUs in a data parallel group. It is purpose-built for extreme-scale models and leverages Amazon in-house [MiCS](https://arxiv.org/pdf/2205.00119.pdf) technology which achieves a near-linear scaling efficiency. For large models that cannot fit into a single GPU, we also recommend to use the sharded data parallelism technique with [Activation Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html) and [Activation Offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html) in SMP first, before leveraging other techniques such as tensor parallelism or pipeline parallelism.\n",
"\n",
"The SMP library also supports [FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html), which is only applicable for distributed transformer models (transformer models wrapped by `smp.DistributedModel()`) for model-parallel training. \n",
"The SMP library also supports [FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html) for both distributed and non-distributed transformer models. The FLAN-T5 model is a non-distributed transformer model, and this notebook and the accompanied scripts show how to set up FlashAttention. \n",
"\n",
"These two features are also compatible with [Tensor Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html). \n",
"\n",
Expand Down

0 comments on commit 1053c23

Please sign in to comment.