rewrite about flasn attn support

mchoi8739 · Jun 23, 2023 · 1053c23 · 1053c23
1 parent 89d13dd
commit 1053c23
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/...tributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb b/...tributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
@@ -28,7 +28,7 @@
     "\n",
     "Sharded data parallelism is a distributed training technique that splits the model parameters, gradients, and optimizer states across GPUs in a data parallel group. It is purpose-built for extreme-scale models and leverages Amazon in-house [MiCS](https://arxiv.org/pdf/2205.00119.pdf) technology which achieves a near-linear scaling efficiency. For large models that cannot fit into a single GPU, we also recommend to use the sharded data parallelism technique with [Activation Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html) and [Activation Offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html) in SMP first, before leveraging other techniques such as tensor parallelism or pipeline parallelism.\n",
     "\n",
-    "The SMP library also supports [FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html), which is only applicable for distributed transformer models (transformer models wrapped by `smp.DistributedModel()`) for model-parallel training. \n",
+    "The SMP library also supports [FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html) for both distributed and non-distributed transformer models. The FLAN-T5 model is a non-distributed transformer model, and this notebook and the accompanied scripts show how to set up FlashAttention. \n",
     "\n",
     "These two features are also compatible with [Tensor Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html). \n",
     "\n",