See the following guides in the Kubeflow documentation:
- Concepts in Katib, hyperparameter tuning, and neural architecture search.
- Getting started with Katib.
- Detailed guide to configuring and running a Katib experiment.
After install Katib v1alpha3, you can run kubectl apply -f katib/examples/v1alpha3/random-example.yaml
to try the first example of Katib.
Then you can get the new Experiment
as below. Katib concepts will be introduced based on this example.
# kubectl get experiment random-example -n kubeflow -o yaml
apiVersion: kubeflow.org/v1alpha3
kind: Experiment
metadata:
...
name: random-example
namespace: kubeflow
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 3
maxTrialCount: 12
metricsCollectorSpec:
collector:
kind: StdOut
objective:
additionalMetricNames:
- accuracy
goal: 0.99
objectiveMetricName: Validation-accuracy
type: maximize
parallelTrialCount: 3
parameters:
- feasibleSpace:
max: "0.03"
min: "0.01"
name: --lr
parameterType: double
- feasibleSpace:
max: "5"
min: "2"
name: --num-layers
parameterType: int
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
template:
spec:
containers:
- name: {{.Trial}}
image: docker.io/kubeflowkatib/mxnet-mnist
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
status:
...
When you want to tune hyperparameters for your machine learning model before
training it further, you just need to create an Experiment
CR like above. To
learn what fields are included in the Experiment.spec
, see
the detailed guide to configuring and running a Katib
experiment.
For each set of hyperparameters, Katib will internally generate a Trial
CR with the hyperparameters key-value pairs, job manifest string with parameters instantiated and some other fields like below. Trial
CR is used for internal logic control, and end user can even ignore it.
# kubectl get trial -n kubeflow
NAME STATUS AGE
random-example-fm2g6jpj Succeeded 4h
random-example-hhzm57bn Succeeded 4h
random-example-n8whlq8g Succeeded 4h
# kubectl get trial random-example-fm2g6jpj -o yaml -n kubeflow
apiVersion: kubeflow.org/v1alpha3
kind: Trial
metadata:
...
name: random-example-fm2g6jpj
namespace: kubeflow
ownerReferences:
- apiVersion: kubeflow.org/v1alpha3
blockOwnerDeletion: true
controller: true
kind: Experiment
name: random-example
uid: c7bbb111-de6b-11e9-a6cc-00163e01b303
spec:
metricsCollector:
collector:
kind: StdOut
objective:
additionalMetricNames:
- accuracy
goal: 0.99
objectiveMetricName: Validation-accuracy
type: maximize
parameterAssignments:
- name: --lr
value: "0.027435456064371484"
- name: --num-layers
value: "4"
- name: --optimizer
value: sgd
runSpec: |-
apiVersion: batch/v1
kind: Job
metadata:
name: random-example-fm2g6jpj
namespace: kubeflow
spec:
template:
spec:
containers:
- name: random-example-fm2g6jpj
image: docker.io/kubeflowkatib/mxnet-mnist
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=0.027435456064371484"
- "--num-layers=4"
- "--optimizer=sgd"
restartPolicy: Never
status:
completionTime: 2019-09-24T01:38:39Z
conditions:
- lastTransitionTime: 2019-09-24T01:37:26Z
lastUpdateTime: 2019-09-24T01:37:26Z
message: Trial is created
reason: TrialCreated
status: "True"
type: Created
- lastTransitionTime: 2019-09-24T01:38:39Z
lastUpdateTime: 2019-09-24T01:38:39Z
message: Trial is running
reason: TrialRunning
status: "False"
type: Running
- lastTransitionTime: 2019-09-24T01:38:39Z
lastUpdateTime: 2019-09-24T01:38:39Z
message: Trial has succeeded
reason: TrialSucceeded
status: "True"
type: Succeeded
observation:
metrics:
- name: Validation-accuracy
value: 0.981489
startTime: 2019-09-24T01:37:26Z
Katib will internally create a Suggestion
CR for each Experiment
CR. Suggestion
CR includes the hyperparameter algorithm name by algorithmName
field and how many sets of hyperparameter Katib asks to be generated by requests
field. The CR also traces all already generated sets of hyperparameter in status.suggestions
. Same as Trial
, Suggestion
CR is used for internal logic control and end user can even ignore it.
# kubectl get suggestion random-example -n kubeflow -o yaml
apiVersion: kubeflow.org/v1alpha3
kind: Suggestion
metadata:
...
name: random-example
namespace: kubeflow
ownerReferences:
- apiVersion: kubeflow.org/v1alpha3
blockOwnerDeletion: true
controller: true
kind: Experiment
name: random-example
uid: c7bbb111-de6b-11e9-a6cc-00163e01b303
spec:
algorithmName: random
requests: 3
status:
...
suggestions:
- name: random-example-fm2g6jpj
parameterAssignments:
- name: --lr
value: "0.027435456064371484"
- name: --num-layers
value: "4"
- name: --optimizer
value: sgd
- name: random-example-n8whlq8g
parameterAssignments:
- name: --lr
value: "0.013743390382347042"
- name: --num-layers
value: "3"
- name: --optimizer
value: sgd
- name: random-example-hhzm57bn
parameterAssignments:
- name: --lr
value: "0.012495283371215943"
- name: --num-layers
value: "2"
- name: --optimizer
value: sgd
When a user created an Experiment
CR, Katib controllers including experiment controller, trial controller and suggestion controller will work together to achieve hyperparameters tuning for user Machine learning model.
- A
Experiment
CR is submitted to Kubernetes API server, Katib experiment mutating and validating webhook will be called to set default value for theExperiment
CR and validate the CR separately. - Experiment controller create a
Suggestion
CR. - Suggestion controller create the algorithm deployment and service based on the new
Suggestion
CR. - When Suggestion controller verifies that the algorithm service is ready, it calls the service to generate
spec.request - len(status.suggestions)
sets of hyperparamters and append them intostatus.suggestions
- Experiment controller finds that
Suggestion
CR had been updated, then generate eachTrial
for each new hyperparamters set. - Trial controller generates job based on
runSpec
manifest with the new hyperparamters set. - Related job controller (Kubernetes batch Job, kubeflow PytorchJob or kubeflow TFJob) generated Pods.
- Katib Pod mutating webhook is called to inject metrics collector sidecar container to the candidate Pod.
- During the ML model container runs, metrics collector container in the same Pod tries to collect metrics from it and persists them into Katib DB backend.
- When the ML model Job ends, Trial controller will update status of the corresponding
Trial
CR. - When a
Trial
CR goes to end, Experiment controller will increaserequest
field of correspondingSuggestion
CR if in need, then everything goes tostep 4
again. Of course, ifTrial
CRs meet one ofend
condition (exceedsmaxTrialCount
,maxFailedTrialCount
orgoal
), Experiment controller will take everything done.