Skip to content

Latest commit

 

History

History
185 lines (123 loc) · 7.55 KB

tensorflow-horovod-imagenet.md

File metadata and controls

185 lines (123 loc) · 7.55 KB

Distributed Training using TensorFlow and Horovod on Amazon EKS with ImageNet Data

This document explains how to perform distributed training on Amazon EKS using TensorFlow and Horovod with ImageNet dataset. The following steps can be ued for any data set though.

Prerequisite

  1. Create EKS cluster using GPU.

  2. Install Kubeflow.

  3. Create an EFS and mount it to each worker nodes as explained in efs-on-eks-worker-nodes.md.

ImageNet Data

If you work for Amazon, then reach out to the authors of this document to have access to the data. Otherwise, follow the instructions below.

  1. Download ImageNet dataset to EFS in the data directory. This would typically be in the directory /home/ec2-user/efs/data. Use Download Original Images (for non-commercial research/educational use only) option.

  2. TensorFlow consumes the ImageNet data in a specific format. You can preprocess them by downloading and modifying the script:

    curl -O https://raw.githubusercontent.com/aws-samples/deep-learning-models/master/utils/tensorflow/preprocess_imagenet.sh
    chmod +x preprocess_imagenet.sh
    

    The following values need to be changed:

    1. [your imagenet account]
    2. [your imagenet access key]
    3. [PATH TO TFRECORD TRAINING DATASET]
    4. [PATH TO RESIZED TFRECORD TRAINING DATASET]
    5. [PATH TO TFRECORD VALIDATION DATASET]
    6. [PATH TO RESIZED TFRECORD VALIDATION DATASET]

    Execute the script:

    ./preprocess_imagenet.sh
    

Steps

  1. Create namespace:

    NAMESPACE=kubeflow-dist-train; kubectl create namespace ${NAMESPACE}
    
  2. Create ksonnet app:

    APP_NAME=kubeflow-tf-hvd; ks init ${APP_NAME}; cd ${APP_NAME}
    
  3. Set as default namespace:

    ks env set default --namespace ${NAMESPACE}
    
  4. Create the Persistent Volume (PV) based on EFS. You need to update the name of EFS server in the Kubernetes manifest file. Storage capacity based on dataset size and other requirements can be updated as well.

    kubectl create -f ../training/distributed_training/dist_pv.yaml
    
  5. Create the Persistent Volume Claim (PVC) based on EFS. The storage capacity based on PV's capacity may adjusted in the manifest. The storage capacity of PVC should be the at most storage capacity of PV.

    kubectl create -f ../training/distributed_training/dist_pvc.yaml
    
  6. Make sure that PV has been claimed by PVC under same namespace. You can verify that by running the command:

    kubectl get pv -n ${NAMESPACE}
    NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                              STORAGECLASS   REASON   AGE
    nfs-data   85Gi       RWX            Retain           Bound    kubeflow-dist-train/nfs-external   nfs-external            58s
    
  7. Create secret for ssh access between nodes:

    SECRET=openmpi-secret; mkdir -p .tmp; yes | ssh-keygen -N "" -f .tmp/id_rsa
    kubectl delete secret ${SECRET} -n ${NAMESPACE} || true
    kubectl create secret generic ${SECRET} -n ${NAMESPACE} --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub --from-file=authorized_keys=.tmp/id_rsa.pub
    
  8. Install Kubeflow openmpi component:

    VERSION=master
    ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
    ks pkg install kubeflow/openmpi@${VERSION}
    
  9. Build a Docker image for Horovod using Dockerfile from training/distributed_training/Dockerfile and the command docker image build -t arungupta/horovod .. Alternatively, you can use the image that already exists on Docker Hub:

    IMAGE=rgaut/horovod:latest
    
  10. Define the number of workers (number of machines) and number of GPU available per machine:

    WORKERS=2; GPU=4
    
  11. Formulate the MPI command based on official document from Horovod:

    EXEC="mpiexec -np 8 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --model vgg16 --batch_size 64  --data_name=imagenet  --data_dir=/mnt/data --variable_update horovod --horovod_device gpu --use_fp16'"
    

    If more than one GPU is used, then you may have to replace NCCL_SOCKET_IFNAME=eth0 with NCCL_SOCKET_IFNAME=^docker0.

    MPI command needs some explanations. -np represents a total number process which will be equal to WORKERS * GPU.

  12. Generate the config:

    COMPONENT=openmpi
    ks generate openmpi ${COMPONENT} --image ${IMAGE} --secret ${SECRET} --workers ${WORKERS} --gpu ${GPU} --exec "${EXEC}"
    
  13. Set the parameter for the created volume:

    ks param set ${COMPONENT} volumes '[{ "name": "efs-pvc", "persistentVolumeClaim": { "claimName": "nfs-external" }}]'
    
  14. Set the parameter for volumeMounts:

    ks param set ${COMPONENT} volumeMounts '[{ "name": "efs-pvc", "mountPath": "/mnt"}]'
    

    This command will make the data available under /mnt/data directory for each pod.

  15. Deploy the config to your cluster:

    ks apply default
    
  16. Check the pod status:

    kubectl get pod -n ${NAMESPACE} -o wide
    
  17. Save the log:

    mkdir -p results
    kubectl logs -n ${NAMESPACE} -f ${COMPONENT}-master > results/benchmark_1.out
    

    Here is a sample output.

  18. To iterate quickly. Remove pods, recreate openmpi component, restart from generate openmpi command

    ks delete default
    ks component rm openmpi 
    
  19. [Optional] EXEC command for 4 workers 8 GPU

    WORKERS=4
    GPU=8
    EXEC="mpiexec -np 32 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --model vgg16 --batch_size 64  --data_name=imagenet  --data_dir=/mnt/data --variable_update horovod --horovod_device gpu --use_fp16'"
    

    Then follow the steps from 10-13.

  20. [Optional] To store checkpoints for resnet50 model on efs mnt directory.

    EXEC="mpiexec -np 32 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --num_epochs=25 --batch_size 64 --data_name=imagenet  --data_dir=/mnt/data --train_dir=/mnt/checkpoint  --variable_update horovod --horovod_device gpu --weight_decay=1e-4 --use_fp16'" 
    

    To store the checkpoint which is necessary to perform inference or model evaluation. Please specify --train_dir. One can use either num_epochs or num_batches.