Before deployment or maintenance, user should have the cluster configuration files ready.
You could find the example configuration files in pai/cluster-configuration/.
Note: Please do not change the name of the configuration files. And those 4 files should be put in the same directory.
- Set up cluster-configuration.yaml
- Set up k8s-role-definition.yaml
- Set up kubernetes-configuration.yaml
- Set up services-configuration.yaml
- Kubernetes High Availability Configuration
An example cluster-configuration.yaml is available here. In the following we explain the fields in the yaml file one by one.
default-machine-properties:
# A Linux host account with sudo permission
username: username
password: password
sshport: port
Set the default value of username, password, and sshport in default-machine-properties. PAI will use these default values to access cluster machines. User can override the default access information for each machine in machine-list.
machine-sku:
NC24R:
mem: 224
gpu:
type: teslak80
count: 4
cpu:
vcore: 24
#Note: Up to now, the only supported os version is Ubuntu16.04. Please do not change it here.
os: ubuntu16.04
In this field, you could define several sku with different name. And in the machine list you should refer your machine to one of them.
- mem: memory
- gpu: If there is no gpu on this sku, you could remove this field
- os: Now we only supported ubuntu, and pai is only tested on the version 16.04LTS.
machine-list:
- hostname: hostname (echo `hostname`)
hostip: IP
machine-type: D8SV3
etcdid: etcdid1
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: master
dashboard: "true"
zkid: "1"
pai-master: "true"
- hostname: hostname
hostip: IP
machine-type: D8SV3
etcdid: etcdid2
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: master
node-exporter: "true"
- hostname: hostname
hostip: IP
machine-type: NC24R
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: worker
pai-worker: "true"
hostname
: Required. You could the hostname by the commandecho `hostname`
on the host.hostip
: Required. The ip address of the corresponding host.machine-type
: Required. The sku name defined in themachine-sku
.etcdid
: K8s-Master Required. The etcd is part of kubernetes master. If you assign the k8s-role=master to a node, you should set this filed. This value will be used when starting and fixing k8s.sshport, username, password
: Optional. Used if this machine's account and port is different from the default properties. Or you can remove them.k8s-role
: Required. You could set this value tomaster
,worker
orproxy
. If you want to configure more than 1 k8s-master, please refer to Kubernetes High Availability Configuration.dashboard
: Select one node to set this field. And set the value as"true"
.pai-master
: Optional. hadoop-name-node, hadoop-resource-manager, frameworklauncher, restserver, webportal, grafana, prometheus and node-exporter.zkid
: Unique zookeeper id required bypai-master
node(s). You can set this field from1
ton
pai-worker
: Optional. hadoop-data-node, hadoop-node-manager, and node-exporter will be deployed on a pai-work.node-exporter
: Optional. You can assign this label to nodes to enable hardware and service monitoring.
Note: To deploy PAI in a single box, users should set pai-master and pai-worker labels for the same machine in machine-list section, or just follow the quick deployment approach described in this section.
An example k8s-role-definition.yaml file is available here. The file is used to bootstrap a k8s cluster. It includes a list of k8s components and specifies what components should be include in different k8s roles (master, worker, and proxy). By default, user does not need to change the file.
An example kubernetes-configuration.yaml file is available here. The yaml file includes the following fields.
kubernetes:
cluster-dns: IP
load-balance-ip: IP
service-cluster-ip-range: 10.254.0.0/16
storage-backend: etcd3
docker-registry: docker.io/openpai
hyperkube-version: v1.9.4
etcd-version: 3.2.17
apiserver-version: v1.9.4
kube-scheduler-version: v1.9.4
kube-controller-manager-version: v1.9.4
# http://gcr.io/google_containers/kubernetes-dashboard-amd64
dashboard-version: v1.8.3
cluster-dns
: Find the nameserver address in /etc/resolv.confload-balance-ip
: If the cluster has only one k8s-master, please set this field with the ip-address of your k8s-master. If there are more than one k8s-master, please refer to k8s high availability configuration.
service-cluster-ip-range
: Please specify an ip range that does not overlap with the host network in the cluster. E.g., use the 169.254.0.0/16 link-local IPv4 address according to RFC 3927, which usually will not overlap with your cluster IP.storage-backend
: ETCD major version. If you are not familiar with etcd, please do not change it.docker-registry
: The docker registry used in the k8s deployment. To use the official k8s Docker images, set this field to gcr.io/google_containers, the deployment process will pull Kubernetes component's image fromgcr.io/google_containers/hyperkube
. You can also set the docker registry to openpai.docker.io (or docker.io/pai), which is maintained by pai.hyperkube-version
: The version of hyperkube. If the registry is gcr, you could find the version tag here.etcd-version
: The version of etcd. If you are not familiar with etcd, please do not change it. If the registry is gcr, you could find the version tag here.apiserver-version
: The version of apiserver. If the registry is gcr, you could find the version tag here.kube-scheduler-version
: The version of kube-scheduler. If the registry is gcr, you could find the version tag herekube-controller-manager-version
: The version of kube-controller-manager.If the registry is gcr, you could find the version tag heredashboard-version
: The version of kubernetes-dashboard. If the registry is gcr, you could find the version tag here
An example services-configuration.yaml file is available here. The following explains the details of the yaml file.
cluster:
clusterid: pai-example
nvidia-drivers-version: 384.111
docker-verison: 17.06.2
data-path: "/datastorage"
docker-registry-info:
docker-namespace: your_registry_namespace
docker-registry-domain: your_registry_domain
# If the docker registry doesn't require authentication, please leave docker_username and docker_password empty
docker-username: your_registry_username
docker-password: your_registry_password
docker-tag: your_image_tag
# The name of the secret in kubernetes will be created in your cluster
# Must be lower case, e.g., regsecret.
secret-name: your_secret_name
clusterid
: The id of the cluster.nvidia-drivers-version
: Choose proper nvidia driver version for your cluster here.docker-verison
: The Docker client used by hadoop NM (node manager) to launch Docker containers (e.g., of a deep learning job) in the host environment. Choose a version here.data-path
: The absolute path on the host in your cluster to store the data such as hdfs, zookeeper and yarn. Note: please make sure there is enough space in this path.docker-registry-info
:docker-namespace
: Your registry's namespace. If your choose DockerHub as your docker registry. You should fill this field with your username.docker-registry-domain
: E.g., gcr.io. If public,fill docker_registry_domain with the word "public".docker-username
: The account of the docker registrydocker-password
: The password of the accountdocker-tag
: The image tag of the service. You could set the version here. Or just set latest here.secret-name
: Must be lower case, e.g., regsecret. The name of the secret in Kubernetes will be created for your cluster.
Note that we provide a read-only public docker registry on DockerHub for official releases. To use this docker registry, th docker-registry-info
section should be configured as follows, leaving docker-username
and docker-password
commented:
docker-registry-info:
- docker-namespace: openpai
- docker-registry-domain: docker.io
#- docker-username: <n/a>
#- docker-password: <n/a>
- docker-tag: latest # or a specific version, i.e. 0.5.0.
- secret-name: <anything>
Users can browse to https://hub.docker.com/r/openpai to see all the repositories in this public docker registry.
hadoop:
# custom_hadoop_binary_path specifies the path PAI stores the custom built hadoop-ai
# Notice: the name should be hadoop-{hadoop-version}.tar.gz
custom-hadoop-binary-path: /pathHadoop/hadoop-2.9.0.tar.gz
hadoop-version: 2.9.0
virtualClusters:
default:
description: default queue for all users.
capacity: 40
vc1:
description: VC for Alice's team.
capacity: 20
vc2:
description: VC for Bob's team.
capacity: 20
vc3:
description: VC for Charlie's team.
capacity: 20
custom-hadoop-binary-path
: please set a path here for paictl to build hadoop-ai.hadoop-version
: please set this to2.9.0
.virtualClusters
: hadoop queue setting. Each VC will be assigned with (capacity / total_capacity * 100%) of resources. paictl will create the 'default' VC with 0 capacity, if it is not been specified. paictl will split resources to each VC evenly if the total capacity is 0. The capacity of each VC will be set to 0 if it is a negative number.
frameworklauncher:
frameworklauncher-port: 9086
frameworklauncher-port
: Launcher's port. You can use the default value.
restserver:
server-port: 9186
jwt-secret: your_jwt_secret
default-pai-admin-username: your_default_pai_admin_username
default-pai-admin-password: your_default_pai_admin_password
server-port
: Port for rest api server. You can use the default value.jwt-secret
: secret for signing authentication tokens, e.g., "Hello PAI!"default-pai-admin-username
: database admin username, and admin username of pai.default-pai-admin-password
: database admin password
webportal:
server-port: 9286
server-port
: port for webportal, you can use the default value.
grafana:
grafana-port: 3000
grafana
: port for grafana, you can use the default value.
prometheus:
prometheus-port: 9091
node-exporter-port: 9100
prometheus-port
: port for prometheus port, you can use the default value.node-exporter-port
: port for node exporter, you can use the default value.
pylon:
# port of pylon
port: 80
port
: port of pylon, you can use the default value.
Single master mode does not have high availability.
- only set one node's k8s-role as master
- set this field
load-balance-ip
to your master's ip address
There are 3 roles in k8s-role-definition. The master
will start a k8s-master component on the specified machine. And the proxy
will start a proxy component on the specified machine. In cluster-configuration.yaml,
- one or more than one nodes are labeled with
k8s-role: master
- one node should be labeled with
k8s-role: proxy
- set the field
load-balance-ip
to your proxy node's ip address
Node: the proxy node itself is not in ha mode. How to configure the proxy node in ha mode is out of the scope of PAI deployment.
If your cluster has a reliable load-balance server (e.g. in a cloud environment such as Azure), you could set up a load-balancer and set the field load-balance-ip
in the kubernetes-configuration.yaml to the load-balancer.
- Set the field ```load-balance-ip`` to the ip-address of your load-balancer.