In this project, we use Ray cluster launcher to launch clusters. With an existing config file, you can create a cluster with:
ray up cluster.yaml
ssh to the head node of the cluster with
ray attach cluster.yaml
and terminate with cluster with
ray down cluster.yaml
Here we will create an AMI with Ray, PyTorch, NCCL and EFS. To get started, install Ray on your laptop. Before actually creating a cluster, run pip install boto3
and aws configure
on your laptop to set up AWS credentials.
Start a node with initial-cluster.yaml
:
ray up initial-cluster.yaml
Then, ssh into the head node with:
ray attach initial-cluster.yaml
After you successfully ssh into the head node, there are a couple of things you need to do: 0. Install PyTorch with
pip install torch torchvision
- Install NVIDIA Apex for FP16 Mixed Precision training. You will need to change the
CUDA_HOME
environment variable:# Change CUDA runtime to 10.2 sudo ln -sfn /usr/local/cuda-10.2 /usr/local/cuda git clone https://github.com/NVIDIA/apex cd apex pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
- Install NCCL. Specifically, you can upload the NCCL installation package to the head node with:
ray rsync-up initial-cluster.yaml \ /path/to/nccl_2.7.8-1+cuda10.1_x86_64.txz \ /home/ubuntu/
- Create an EFS. This is used as an NFS for all nodes in the cluster. Please add the security group ID of the node you just started (can be found on the AWS Management Console) to the EFS to make sure your node can access the EFS. After that, you need to install the efs-utils to mount the EFS on the node:
You can try to mount the EFS on the node by:
git clone https://github.com/aws/efs-utils cd efs-utils ./build-deb.sh sudo apt-get -y install ./build/amazon-efs-utils*deb
If this takes forever, make sure you configure the sercurity groups right.mkdir -p ~/efs sudo mount -t efs {Your EFS file system ID}:/ ~/efs sudo chmod 777 ~/efs
- Create a placement group on the AWS Management Console. Choose the
Cluster
placement strategy. This can make sure the interconnection bandwidth among different nodes in the cluster are high. - Create a ssh-key on the head node, so in the future we can directly ssh between different nodes in the cluster:
ssh-keygen cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- Set up your vimrc, git (username & email), tmux_config, etc. Make your self comfortable developing in this environment.
After that, go to AWS Management Console, create an AMI for the current head node, and then shut the node down on your laptop with:
ray down initial-cluster.yaml
Finally, make a copy of the cluster-template.yaml
and fill in all the fields surrounded by curly brackets {}
(e.g. AMI ID, EFS ID). Also consider modify the cluster_name
field and number of nodes. Start the cluster with ray up
, clone the repo under ~/efs
, and you can start developing. You can create new AMIs based on this image we just created if you install any new packages.