Skip to content

Latest commit

 

History

History
59 lines (37 loc) · 1.6 KB

README.md

File metadata and controls

59 lines (37 loc) · 1.6 KB

#TensorSets

TensorSets are a third-party resource to manage TensorFlow training clusters running in Kubernetes.

What's new

This is the initial release of the tensorsets repo.

Known issues

This is a POC. Using this in production may result in errors.

Walkthrough

First we define our ThirdPartyResource. This declares a new Kubernetes object type called TensorSets.

kubectl create -f kubernetes/tensorset-tpr-v0.yaml

Next, we deploy our TensorSet controller. The controller is a small app that performs actions based on TensorSet objects.

kubectl create -f kubernetes/tensorset-controller-v0.yaml

Now we create our first TensorSet:

kubectl create -f kubernetes/cluster1-ts-v0.yaml

The TensorSet controller will create your training cluster, and eventually you will see a bunch of pods in your current namespace.

Once they are all ready, start a training job:

kubectl create -f kubernetes/cluster1-job-v0.yaml

To see the progress of your job:

pods=$(kubectl get pods --selector=ts-cluster-name=cluster1 --output=jsonpath={.items..metadata.name})
kubectl logs -f pods

Once done with your training cluster, delete it:

kubectl delete tensorset cluster1

And your cluster will be gone!

Roadmap