Skip to content

Latest commit

 

History

History
78 lines (68 loc) · 3.58 KB

README.md

File metadata and controls

78 lines (68 loc) · 3.58 KB

Cluster Management

In this project, we use Ray cluster launcher to launch clusters. With an existing config file, you can create a cluster with:

ray up cluster.yaml

ssh to the head node of the cluster with

ray attach cluster.yaml

and terminate with cluster with

ray down cluster.yaml

Create a cluster config file for development

Here we will create an AMI with Ray, PyTorch, NCCL and EFS. To get started, install Ray on your laptop. Before actually creating a cluster, run pip install boto3 and aws configure on your laptop to set up AWS credentials.

Start a node with initial-cluster.yaml:

ray up initial-cluster.yaml

Then, ssh into the head node with:

ray attach initial-cluster.yaml

After you successfully ssh into the head node, there are a couple of things you need to do: 0. Install PyTorch with

pip install torch torchvision
  1. Install NVIDIA Apex for FP16 Mixed Precision training. You will need to change the CUDA_HOME environment variable:
    # Change CUDA runtime to 10.2
    sudo ln -sfn /usr/local/cuda-10.2 /usr/local/cuda
    git clone https://github.com/NVIDIA/apex
    cd apex
    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  2. Install NCCL. Specifically, you can upload the NCCL installation package to the head node with:
    ray rsync-up initial-cluster.yaml \
    /path/to/nccl_2.7.8-1+cuda10.1_x86_64.txz \
    /home/ubuntu/
  3. Create an EFS. This is used as an NFS for all nodes in the cluster. Please add the security group ID of the node you just started (can be found on the AWS Management Console) to the EFS to make sure your node can access the EFS. After that, you need to install the efs-utils to mount the EFS on the node:
    git clone https://github.com/aws/efs-utils
    cd efs-utils
    ./build-deb.sh
    sudo apt-get -y install ./build/amazon-efs-utils*deb
    You can try to mount the EFS on the node by:
    mkdir -p ~/efs
    sudo mount -t efs {Your EFS file system ID}:/ ~/efs
    sudo chmod 777 ~/efs
    If this takes forever, make sure you configure the sercurity groups right.
  4. Create a placement group on the AWS Management Console. Choose the Cluster placement strategy. This can make sure the interconnection bandwidth among different nodes in the cluster are high.
  5. Create a ssh-key on the head node, so in the future we can directly ssh between different nodes in the cluster:
    ssh-keygen
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  6. Set up your vimrc, git (username & email), tmux_config, etc. Make your self comfortable developing in this environment.

After that, go to AWS Management Console, create an AMI for the current head node, and then shut the node down on your laptop with:

ray down initial-cluster.yaml

Finally, make a copy of the cluster-template.yaml and fill in all the fields surrounded by curly brackets {} (e.g. AMI ID, EFS ID). Also consider modify the cluster_name field and number of nodes. Start the cluster with ray up, clone the repo under ~/efs, and you can start developing. You can create new AMIs based on this image we just created if you install any new packages.