From 91e9756102c664415a239818d32802b9d813367e Mon Sep 17 00:00:00 2001
From: Miyoung Choi <cmiyoung@amazon.com>
Date: Fri, 23 Jun 2023 11:01:23 -0700
Subject: [PATCH 1/4] documentation

---
 .../smp-train-t5-sharded-data-parallel.ipynb  | 147 +++++++++++-------
 1 file changed, 87 insertions(+), 60 deletions(-)

diff --git a/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb b/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
index 6f0364008c..010d26a36e 100644
--- a/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
+++ b/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
@@ -20,6 +20,42 @@
     "---"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this notebook, you'll learn how to train the Hugging Face Transformers [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model with the [Sharded Data Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html) technique and FlashAttention supported by [SageMaker's Model Parallelism library (SMP)](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html) with PyTorch 1.13 and [GLUE/SST2 dataset](https://huggingface.co/datasets/glue/viewer/sst2/train) on SageMaker. \n",
+    "\n",
+    "Sharded data parallelism is a distributed training technique that splits the model parameters, gradients, and optimizer states across GPUs in a data parallel group. It is purpose-built for extreme-scale models and leverages Amazon in-house [MiCS](https://arxiv.org/pdf/2205.00119.pdf) technology which achieves a near-linear scaling efficiency. For large models that cannot fit into a single GPU, we also recommend to use the sharded data parallelism technique with [Activation Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html) and [Activation Offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html) in SMP first, before leveraging other techniques such as tensor parallelism or pipeline parallelism.\n",
+    "\n",
+    "The SMP library also supports [FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html), which is only applicable for distributed transformer models (transformer models wrapped by `smp.DistributedModel()`) for model-parallel training. \n",
+    "\n",
+    "These two features are also compatible with [Tensor Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html). \n",
+    "\n",
+    "This notebook is accompanied with the following files:\n",
+    "\n",
+    "- `train.py`: The entry point script that'll be passed to the SageMaker PyTorch estimator later in this notebook when launching the training job. This script is prepared to run an end-to-end training of the FLAN-T5 model with SMP, settings for sharded data parallelism applied, and implemented with code lines to save, load, and fine-tune the model. You can follow the comments throughout the script to learn where the SMP APIs and code modifications are implemented.\n",
+    "- `data_pipeline.py`: This has data pipeline functions to prepare the training dataset.\n",
+    "- `learining_rate.py`: This has functions for learning rate schedule.\n",
+    "- `requirements.txt`: This installs the dependencies, including huggingface transformers.\n",
+    "- `memory_tracker.py`: This has functions to track memory usage.\n",
+    "- `model_config.py`: This has functions to get model configuration information.\n",
+    "- `sdp_utils.py`: This has util functions for sharded data parallelism.\n",
+    "- `t5_flash_attn.py`: This has util functions for implementation of FlashAttention. The SMP library supports FlashAttention, and additional configuration tip is available at [Support for FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html) in the *SageMaker Developer Guide*.\n",
+    "\n",
+    "### Additional Resources\n",
+    "- To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).\n",
+    "\n",
+    "- To learn more about using the SageMaker Python SDK with PyTorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).\n",
+    "\n",
+    "- To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).\n",
+    "\n",
+    "- To learn more about sharded data parallelism, check [Sharded Data Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html) or the blog [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws).\n",
+    "\n",
+    "### Prerequisites\n",
+    "You must create an S3 bucket to store the input data for training. This bucket must be located in the same AWS Region that you choose to launch your training job. To learn how to create a S3 bucket, see [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) in the *Amazon S3 documentation*.\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -44,9 +80,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%%time\n",
@@ -83,15 +117,24 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Download and prepare glue-sst2 data\n",
-    "Here you will download, prepare the glue-sst2 dataset and then copy the files to S3."
+    "## Download and prepare GLUE/SST2 data\n",
+    "Here you will download, prepare the GLUE/SST2 dataset and then copy the files to S3."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 0. Import libraries and specify parameters"
+    "### Install the Hugging Face Transformers and Datasets libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install -q datasets transformers==4.21.0"
    ]
   },
   {
@@ -131,6 +174,14 @@
     "logger = logging.getLogger(__name__)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Load data\n",
+    "This section loads the [GLUE/SST2](https://huggingface.co/datasets/glue/viewer/sst2/train) dataset and splits it to training and validation datasets."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -146,14 +197,6 @@
     "}"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 1. Load data\n",
-    "This section loads the dataset and splits it to training and validation datasets."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -192,9 +235,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 2. Load tokenizer\n",
-    "Nearly every NLP task begins with a tokenizer. A tokenizer converts your input into a format that can be processed by the model.\n",
-    "The following cell loads a tokenizer with [AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/v4.19.4/en/autoclass_tutorial#autotokenizer)"
+    "### Load tokenizer\n",
+    "Nearly every NLP task begins with a tokenizer. A tokenizer converts your text data into a format (token) that can be processed by the NLP model.\n",
+    "The following cell loads a tokenizer for GPT-2 using [AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/v4.19.4/en/autoclass_tutorial#autotokenizer)."
    ]
   },
   {
@@ -214,7 +257,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 3. Preprocess data"
+    "### Preprocess data\n",
+    "\n",
+    "The following two cells set up a function to run the tokenizer and group texts into chunks smaller than the block size."
    ]
   },
   {
@@ -298,6 +343,13 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Set additional hyperparameters and S3 paths for mapping the train and validation datasets properly depending on the phase (training or validation) of the training job in each epoch."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -388,14 +440,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Specify Amazon S3 Bucket Paths"
+    "## Specify Amazon S3 bucket paths"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here you need to specify the paths for training data to be used by your job. The bucket used must be in the same region as where training will run. In the cells above you downloaded the glue-sst2 training and validation split datasets and uploaded the json files in an S3 bucket in your account. This example will train on those json files.\n",
+    "Here you need to specify the paths for training data to be used by your job. The bucket used must be in the same region as where training will run. In the cells above you downloaded the GLUE/SST2 training and validation split datasets and uploaded the json files in an S3 bucket in your account. This example will train on those json files.\n",
     "\n",
     "After you successfully run this example tensor parallel training job, you can modify the S3 bucket to where your own dataset is stored."
    ]
@@ -428,7 +480,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The below bucket will store output artifacts of the training job. You can modify this as needed."
+    "The following S3 bucket will store the output artifacts of the training job. You can modify this as needed."
    ]
   },
   {
@@ -444,9 +496,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Define Data Channels for SageMaker Training Using Amazon S3\n",
+    "## Define data channels for SageMaker Training using Amazon S3\n",
     "\n",
-    "In this step, you define SageMaker training data channels using the above buckets.  "
+    "In this step, define SageMaker training data channels to the S3 buckets.  "
    ]
   },
   {
@@ -480,7 +532,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## (Optional) Set Up and Use Amazon FSx for Data Channels and Checkpoints\n",
+    "## (Optional) Set up and use Amazon FSx for data channels and checkpoints\n",
     "\n",
     "While the previous option of using Amazon S3 is easier to setup, using an FSx can be beneficial for performance when dealing with large input sizes and large model sizes. If you are using models above 13B, checkpointing should be done using FSx. \n",
     "\n",
@@ -530,14 +582,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Set Up Hyperparameters, Metric Definitions, and MPI Options\n",
+    "## Set hyperparameters, metric definitions, and MPI options\n",
     "The following `hyperparameters` dictionary passes arguments to the training script (`train.py`) and set the model parallel configuration when creating the training job.\n",
     "\n",
     "You can also add custom `mpi` flags. By default, we have `--mca btl_vader_single_copy_mechanism none` to remove unnecessary logs.\n",
     "\n",
     "Next, we add a base metric definitions to enable the metric upload in SageMaker. You can add any further metric definitions.\n",
     "\n",
-    "Note that we added the `sharded_data_parallel_degree` parameter to the `hyperparameter` dictionary. This will be parsed and used when we configure a SageMaker PyTorch estimator to activate sharded data parallelism."
+    "Note that we add the `sharded_data_parallel_degree` parameter to the `hyperparameter` dictionary. This will be parsed and used when we configure a SageMaker PyTorch estimator to activate sharded data parallelism."
    ]
   },
   {
@@ -605,7 +657,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Set the model configuration. Specify one from `google/flan-t5-xxl`, `google/flan-t5-xl` and `google/flan-t5-large`."
+    "Set the model configuration by choose one from `google/flan-t5-xxl`, `google/flan-t5-xl` and `google/flan-t5-large`."
    ]
   },
   {
@@ -652,7 +704,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Specify Essential Parameters for a SageMaker Training Job\n",
+    "## Specify essential parameters for a SageMaker Training job\n",
     "\n",
     "Next, you use the [`SageMaker Estimator class`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to define a SageMaker Training Job, passing values through the following parameters for training job name, the number of EC2 instances, the instance type, and the size of the volume attached to the instances. \n",
     "\n",
@@ -661,7 +713,7 @@
     "* `volume_size`\n",
     "* `base_job_name`\n",
     "\n",
-    "### Update the Type and Number of EC2 Instance to Use\n",
+    "### Update the type and the number of EC2 instance to use\n",
     "\n",
     "The instance type and the number of instances you specify to the `instance_type` and `instance_count` parameters, respectively, determine the total number of GPUs (world size).\n",
     "\n",
@@ -694,31 +746,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Attach an EBS Volume to the Training Instance\n",
-    "The volume size you specify in `volume_size` must be larger than your input data size. In this example, the volume size is set to 500GB."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "volume_size = 500"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Note:** For NVMe-type memory attached instances, you don't need to specify `volume_size`. The `volume_size` parameter attaches EBS volumes to instance types that don't have instance storage. For more information, see [Tips and Considerations for Setting Up Storage Paths](https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html#model-train-storage-tips-considerations)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Specify a Base Job Name"
+    "### Specify a base job name"
    ]
   },
   {
@@ -762,7 +790,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Create a SageMaker PyTorch Estimator\n",
+    "### Create a SageMaker PyTorch estimator\n",
     "\n",
     "The following cell constructs a PyTorch estimator using the parameters defined above. To see how the SageMaker APIs and functions are applied to the script, see the `train.py` file."
    ]
@@ -784,7 +812,6 @@
     "    source_dir=os.getcwd(),\n",
     "    role=role,\n",
     "    instance_type=instance_type,\n",
-    "    volume_size=volume_size,\n",
     "    instance_count=instance_count,\n",
     "    sagemaker_session=sagemaker_session,\n",
     "    distribution={\n",
@@ -829,7 +856,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Finally, run the `estimator.fit` method to launch the SageMaker training job of the T5 model with sharded data parallelism."
+    "Finally, run the `estimator.fit` method to launch the SageMaker training job of the FLAN-T5 model with sharded data parallelism."
    ]
   },
   {
@@ -920,9 +947,9 @@
   "hide_input": false,
   "instance_type": "ml.t3.medium",
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "conda_pytorch_p310",
    "language": "python",
-   "name": "python3"
+   "name": "conda_pytorch_p310"
   },
   "language_info": {
    "codemirror_mode": {
@@ -934,7 +961,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.9"
+   "version": "3.10.10"
   }
  },
  "nbformat": 4,

From 5225b7a31e5af810d453b0d7f108fc105f26f223 Mon Sep 17 00:00:00 2001
From: Miyoung Choi <cmiyoung@amazon.com>
Date: Fri, 23 Jun 2023 11:08:53 -0700
Subject: [PATCH 2/4] fix typo

---
 .../flan-t5/smp-train-t5-sharded-data-parallel.ipynb        | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb b/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
index 010d26a36e..a5cf239f85 100644
--- a/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
+++ b/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
@@ -43,7 +43,7 @@
     "- `sdp_utils.py`: This has util functions for sharded data parallelism.\n",
     "- `t5_flash_attn.py`: This has util functions for implementation of FlashAttention. The SMP library supports FlashAttention, and additional configuration tip is available at [Support for FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html) in the *SageMaker Developer Guide*.\n",
     "\n",
-    "### Additional Resources\n",
+    "### Additional resources\n",
     "- To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).\n",
     "\n",
     "- To learn more about using the SageMaker Python SDK with PyTorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).\n",
@@ -60,7 +60,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Amazon SageMaker Initialization\n",
+    "## Amazon SageMaker initialization\n",
     "\n",
     "Run the following cell to import SageMaker modules and retrieve information of your current SageMaker work environment, such as your AWS account ID, the AWS Region, and the ARN of your Amazon SageMaker execution role. Upgrade SageMaker SDK to the latest version. \n",
     "\n",
@@ -657,7 +657,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Set the model configuration by choose one from `google/flan-t5-xxl`, `google/flan-t5-xl` and `google/flan-t5-large`."
+    "Set the model configuration by chooseing one from `google/flan-t5-xxl`, `google/flan-t5-xl` and `google/flan-t5-large`."
    ]
   },
   {

From 89d13dd28553e6f336fae6613b304501e0b08c6a Mon Sep 17 00:00:00 2001
From: Miyoung Choi <cmiyoung@amazon.com>
Date: Fri, 23 Jun 2023 11:11:29 -0700
Subject: [PATCH 3/4] fix model name for loading tokenizer

---
 .../flan-t5/smp-train-t5-sharded-data-parallel.ipynb            | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb b/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
index a5cf239f85..c0929e7c6c 100644
--- a/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
+++ b/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
@@ -237,7 +237,7 @@
    "source": [
     "### Load tokenizer\n",
     "Nearly every NLP task begins with a tokenizer. A tokenizer converts your text data into a format (token) that can be processed by the NLP model.\n",
-    "The following cell loads a tokenizer for GPT-2 using [AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/v4.19.4/en/autoclass_tutorial#autotokenizer)."
+    "The following cell loads a tokenizer for FLAN-T5 using [AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/v4.19.4/en/autoclass_tutorial#autotokenizer)."
    ]
   },
   {

From 1053c234c47cb037c2068ab43572bc64b66978ae Mon Sep 17 00:00:00 2001
From: Miyoung <myoung8739@gmail.com>
Date: Fri, 23 Jun 2023 11:30:54 -0700
Subject: [PATCH 4/4] rewrite about flasn attn support

---
 .../flan-t5/smp-train-t5-sharded-data-parallel.ipynb            | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb b/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
index c0929e7c6c..8d56608078 100644
--- a/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
+++ b/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb
@@ -28,7 +28,7 @@
     "\n",
     "Sharded data parallelism is a distributed training technique that splits the model parameters, gradients, and optimizer states across GPUs in a data parallel group. It is purpose-built for extreme-scale models and leverages Amazon in-house [MiCS](https://arxiv.org/pdf/2205.00119.pdf) technology which achieves a near-linear scaling efficiency. For large models that cannot fit into a single GPU, we also recommend to use the sharded data parallelism technique with [Activation Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html) and [Activation Offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html) in SMP first, before leveraging other techniques such as tensor parallelism or pipeline parallelism.\n",
     "\n",
-    "The SMP library also supports [FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html), which is only applicable for distributed transformer models (transformer models wrapped by `smp.DistributedModel()`) for model-parallel training. \n",
+    "The SMP library also supports [FlashAttention](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-attention-head-size-for-flash-attention.html) for both distributed and non-distributed transformer models. The FLAN-T5 model is a non-distributed transformer model, and this notebook and the accompanied scripts show how to set up FlashAttention. \n",
     "\n",
     "These two features are also compatible with [Tensor Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html). \n",
     "\n",