Kubernetes-Jetson-GPU-Clusters

This repository is a guideline for setting up a GPU Cluster with Nvidia Jetson Series by Kubernetes. A "Single Master Multi Slaves" topology is implemented and tested using Nvidia Jetson Xaiver (Master) and Nvidia Jetson TX2 (Slave) with JetPack 4.2.2.

Nvidia Jetson Xavier

The Nvidia Jetson AGX Xavier Developer Kit is the latest addition to the Jetson platform. It is an AI computer for autonomous machines and delivering the performance of a GPU workstation in an embedded module under 30W. It is optimial for robots, dones and other autonomous machines.

Nvidia Jetson TX2

The Nvidia Jetson TX2 is a deeplearning platform which is able to provide you exceptional speed and power efficieny in an embedded AI computer device. This supercomputer-on-a-moduel brings true AI computing at the edge. Meanwhile, a wide range of standard hardware interfaces are suppored the fit a variety of products and form factors.

GPU Cluster Setup

A. Master Node

The actual configuration steps I performed are written in Installation_Master, some steps are not essential. You may refer to it if you have any difficulties. I will briefly explain the essential steps you have to perform while setting up master nodes.

Install JetPack 4.2.2 by SDK Manager but not necessary to install tensorflow
Perform system update and system upgrade

$ sudo apt-get update && sudo apt-get upgrade

Install resources monitoring

$ sudo -H pip install -U jetson-stats
$ sudo jtop

Set to the highest power mode: MODE 30W ALL

$ sudo nvpmodel -m 3

Disable swap since it may cause issue with Kubernetes. Please notes that swap will be activated everytime the system starts. Remember to disable it.

$ sudo swapoff -a

Edit /etc/docker/daemon.json

$ sudo gedit /etc/docker/daemon.json

You can simply replace all the content of daemon.json with Kubernetes-Jetson-GPU-Clusters/Maintainance/daemon.json.

Refresh system

$ sudo apt-get update
$ sudo apt-get dist-upgrade

Add current user to docker group

$ sudo groupadd docker
$ sudo usermod -aG docker ianvidia
$ newgrp docker
$ sudo reboot

Please change ianvidia to your own account.

Test Docker GPU support

$ sudo docker run -it jitteam/devicequery ./deviceQuery

For the first time to execute this command, it will say there is no image locally. Therefore, please be patient for the system to pull the image from docker hub. If all the setup are done perfectly, you should get a PASS in this test.

Install curl

$ sudo apt install curl

Install kuberenets k8s

$ sudo apt-get update && sudo apt-get install -y apt-transport-https gnupg2
$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
$ echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
$ sudo apt-get update
$ sudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni

Confiure Master Node

$ sudo kubeadm init --pod-network-cidr=10.244.10.0/16 --kubernetes-version 1.18.2

This command should only be executed on master node!. Never do it in the slave!

Please keep the tokens in the bottom carefully. They cannot be regenearted!

If you have any problem after initialize the network, you can reset it by the following.

$ sudo kubeadm reset

Read the response from (12) carefully and perform the following

$ mkdir -p $HOME/.kube
$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

Apply flannel to the cluster

$ sysctl net.bridge.bridge-nf-call-iptables=1
$ curl -O https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
$ sudo kubectl apply -f kube-flannel.yml

List out all the nodes

$ sudo kubectl get nodes

Master should be READY. Please ignore those "workers" in the picture. Since you did not configure any slave and join the cluster yet, they cannot be seen in this step.

If you encounter an error message like: "The connection to the sever localhost:8080 was refused - did you specify the right host or port?", execute the following to solve this iusse.

$ sudo cp /etc/kubernetes/admin.conf $HOME/
$ sudo chown $(id -u):$(id -g) $HOME/admin.conf
$ export KUBECONFIG=$HOME/admin.conf

Assume some slaves joined the network and they are READY", change slave label from to worker

$ sudo kubectl label node jetson-tx2-004 node-role.kubernetes.io/worker=worker

jetson-tx2-004 is the device name of the slave.

Check wethere the cluster can support GPU

$ sudo kubectl apply -f gpu-clusters-test.yml
$ kubectl logs devicequery

The gpu-cluster-test.yml is inside YMAL-Config/pod. You should get another PASS in this test.

Apply customized image

$ sudo kubectl apply -f deeplearning-gpu-cluster.ymal
$ sudo kubectl get pod

The correct outpu should be the following.

If you want to get more detail about this pod, you may

$ sudo kubectl describe pod deeplearning

Attach to the customized container only for the deeplearning-gpu-cluster

$ ./access_jetson_tensorflow.sh

B. Slave Node

The actual configuration steps I performed are written in Installation_Slave, some steps are not essential. You may refer to it if you have any difficulties. I will briefly explain the essential steps you have to perform while setting up master nodes.

Install JetPack 4.2.2 by SDK Manager but not necessary to install tensorflow
Perform system update and system upgrade

$ sudo apt-get update && sudo apt-get upgrade

Install resources monitoring

$ sudo -H pip install -U jetson-stats
$ sudo jtop

Set to the highest power mode: MODE MAXN

$ sudo nvpmodel -m 0

Disable swap since it may cause issue with Kubernetes. Please notes that swap will be activated everytime the system starts. Remember to disable it.

$ sudo swapoff -a

Edit /etc/docker/daemon.json

$ sudo gedit /etc/docker/daemon.json

You can simply replace all the content of daemon.json with Kubernetes-Jetson-GPU-Clusters/Maintainance/daemon.json. 7) Refresh system

$ sudo apt-get update
$ sudo apt-get dist-upgrade

Add current user to docker group

$ sudo groupadd docker
$ sudo usermod -aG docker nvidiatx2-004
$ newgrp docker
$ sudo reboot

Please change nvidiatx2-004 to your own account.

Test Docker GPU support

$ sudo docker run -it jitteam/devicequery ./deviceQuery

For the first time to execute this command, it will say there is no image locally. Therefore, please be patient for the system to pull the image from docker hub. If all the setup are done perfectly, you should get a PASS in this test.

Install curl

$ sudo apt install curl

Install kuberenets k8s

$ sudo apt-get update && sudo apt-get install -y apt-transport-https gnupg2
$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
$ echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
$ sudo apt-get update
$ sudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni

Join cluster created by master

$ ./slave_join_master.sh

The tokens inside are various based on the creation of clusters. You have to replace with your own tokens. Master should be able to see the nodes.

Activate kubernetes service (evertime power on)

$ ./start_nodes.sh

This script can be found in Maintainance.

C. Common Issue

ERROR "no such files -> /run/flannel/subnet.env" Solu: copy this subnet.env file from master to all slaves which do not have this file
Error "cni0" already has an IP address different from 10.244.1.1/24 Solu: Inside Slave

$ sudo ifconfig  cni0 down
$ sudo brctl delbr cni0
$ sudo ip link delete flannel.1
$ ./Maintainance/reset_node.sh

Then, go to MASTER and remove the node

$ ./Maintainance/slave_join_master.sh

DNS Issue: you can ping through ip address but not domain name. You are failed to perform apt-get update inside container

echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf > /dev/null
where 8.8.8.8 should be 158.132.14.1 in my case

D. Kubernetes DashBoard

Please perform the followin operation in MASTER and refer to Installation_Dashboard. Here is the reference url: https://kubernetes.io/zh/docs/tasks/access-application-cluster/web-ui-dashboard/

Apply official ymal

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0/aio/deploy/recommended.yaml

Start kubectl proxy. Do not close this terminal after execution

$ sudo kubectl proxy

Open browser in MASTER

$ http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/

You will be able see a login page.

In order to get the token, it is neccesary to add a admin to manage the entire cluster.

$ sudo kubectl create -f admin-role.ymal

The admin-role.ymal can be found in YMAL-Config/services/

Get admin-token secret name

$ sudo kubectl -n kube-system get secret|grep admin-token

Get token value

$ sudo kubectl -n kube-system describe secret admin-token-sm4pn

sm4pn should be your corresponding name

Copy the token and login.

E. Customized Image vincent51689453/jetson-deeplearning

https://hub.docker.com/repository/docker/vincent51689453/jetson-deeplearning

The steps are all described in Installation_Custom_Image. Please be careful for DNS issue in PART C: Common issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubernetes-Jetson-GPU-Clusters

Nvidia Jetson Xavier

Nvidia Jetson TX2

GPU Cluster Setup

A. Master Node

B. Slave Node

C. Common Issue

D. Kubernetes DashBoard

E. Customized Image vincent51689453/jetson-deeplearning

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
GitHub_Image		GitHub_Image
Maintainance		Maintainance
YMAL-Config		YMAL-Config
Installation_Custom_Image		Installation_Custom_Image
Installation_Dashboard		Installation_Dashboard
Installation_Master		Installation_Master
Installation_Slave		Installation_Slave
README.md		README.md
access_jetson_tensorflow.sh		access_jetson_tensorflow.sh
query_detail.sh		query_detail.sh
test.py		test.py

vincent51689453/Kubernetes-Jetson-GPU-Clusters

Folders and files

Latest commit

History

Repository files navigation

Kubernetes-Jetson-GPU-Clusters

Nvidia Jetson Xavier

Nvidia Jetson TX2

GPU Cluster Setup

A. Master Node

B. Slave Node

C. Common Issue

D. Kubernetes DashBoard

E. Customized Image vincent51689453/jetson-deeplearning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages