Skip to content

Latest commit

 

History

History
249 lines (178 loc) · 11.1 KB

README.md

File metadata and controls

249 lines (178 loc) · 11.1 KB

Build Status

Azure Databricks operator (for Kubernetes)

This project is experimental. Expect the API to change. It is not recommended for production environments.

Introduction

Kubernetes offers the facility of extending it's API through the concept of 'Operators' (Introducing Operators: Putting Operational Knowledge into Software). This repository contains the resources and code to deploy an Azure Databricks Operator for Kubernetes.

It is a Kubernetes controller that watches Customer Resource Definitions (CRDs) that define a Databricks job.

alt text

The Databricks operator is useful in situations where Kubernetes hosted applications wish to launch and use Databricks data engineering and machine learning tasks.

The project was built using

  1. Kubebuilder
  2. Golang SDK for Azure DataBricks REST API 2.0

alt text

Prerequisites And Assumptions

  1. You have the kubectl command line (kubectl CLI) installed.

  2. You have acess to a Kubernetes cluster. It can be a local hosted Cluster like Minikube, Kind or, Docker for desktop installed localy with RBAC enabled. if you opt for Azure Kubernetes Service (AKS), you can use: az aks get-credentials --resource-group $RG_NAME --name $Cluster_NAME

  • To configure a local Kubernetes cluster on your machine

    You need to make sure a kubeconfig file is configured.

Basic commands to check your cluster

    kubectl config get-contexts
    kubectl cluster-info
    kubectl version
    kubectl get pods -n kube-system

How to use the operator

Documentation is a work in progress

Quick start

  1. Download latest release.zip
wget https://github.com/microsoft/azure-databricks-operator/releases/latest/download/release.zip
unzip release.zip
  1. Create the azure-databricks-operator-system namespace
kubectl create namespace azure-databricks-operator-system
  1. Generate a databricks token, and create Kubernetes secrets with values for DATABRICKS_HOST and DATABRICKS_TOKEN
    kubectl  --namespace azure-databricks-operator-system create secret generic dbrickssettings --from-literal=DatabricksHost="https://xxxx.azuredatabricks.net" --from-literal=DatabricksToken="xxxxx"
  1. Apply the manifests for the CRD and Operator in release/config:
kubectl apply -f release/config
  1. Create a test secret, you can pass the value of Kubernetes secrets into your notebook as Databricks secrets
kubectl create secret generic test-secret --from-literal=my_secret_key="my_secret_value"
  1. In Databricks, create a new Python Notebook called test-notebook in the root of your Workspace. Put the following in the first cell of the notebook:
run_name = dbutils.widgets.get("run_name")
secret_scope = run_name + "_scope"

secret_value = dbutils.secrets.get(scope=secret_scope, key="dbricks_secret_key") # this will come from a kubernetes secret
print(secret_value) # will be redacted

value = dbutils.widgets.get("flag")
print(value) # 'true'
  1. Define your Notebook job and apply it
apiVersion: databricks.microsoft.com/v1
kind: NotebookJob
metadata:
  annotations:
    databricks.microsoft.com/author: [email protected]
  name: sample1run1
spec:
  notebookTask:
    notebookPath: "/test-notebook"
  timeoutSeconds: 500
  notebookSpec:
    "flag": "true"
  notebookSpecSecrets:
    - secretName: "test-secret"
      mapping :
        - "secretKey": "my_secret_key"
          "outputKey": "dbricks_secret_key"
  notebookAdditionalLibraries:
    - type: "maven"
      properties:
        coordinates: "com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.9"
  clusterSpec:
    sparkVersion: "5.2.x-scala2.11"
    nodeTypeId: "Standard_DS12_v2"
    numWorkers: 1
  1. Check the NotebookJob and Operator pod
# list all notebook jobs
kubectl get notebookjob
# describe a notebook job
kubectl describe notebookjob sample1run1
# get pods
kubectl -n azure-databricks-operator-system get pods
# describe the manager pod
azure-databricks-operator-controller-manager-xxxxx
# get logs from the manager container
kubectl -n azure-databricks-operator-system logs databricks-operator-controller-manager-xxxxx -c manager
  1. Check the job ran with expected output in the Databricks UI.

Run Souce Code

  1. Clone the repo - make sure your go path points to microsoft\azure-databricks-operator

  2. Install the NotebookJob CRD in the configured Kubernetes cluster folder ~/.kube/config, run kubectl apply -f databricks-operator/config/crds or make install -C databricks-operator

  3. Set the Environment variables for DATABRICKS_HOST and DATABRICKS_TOKEN

    Windows command line:

    set DATABRICKS_TOKEN=xxxx
    set DATABRICKS_HOST=https://xxxx.azuredatabricks.net

    bash:

    export DATABRICKS_TOKEN=xxxx
    export DATABRICKS_HOST=https://xxxx.azuredatabricks.net

    Make sure your secret mapping is set correctly in config/default/manager_image_patch.yaml

4.Install [Kustomize] (https://github.com/kubernetes-sigs/kustomize) and deploy the controller in the configured Kubernetes cluster folder ~/.kube/config, run kustomize build config/default | kubectl apply -f -

  1. Change the NotebookJob name from sample1run1 to your desired name, set the Databricks notebook path and update the values in databricks_v1_notebookjob.yaml` to reflect your Databricks environment

    kubectl apply -f databricks-operator/config/samples/databricks_v1_notebookjob.yaml

How to extend the operator and build your own images

Updating the Databricks operator:

This Repo is generated by Kubebuilder, version:2.0.0-alpha.4.

To Extend the operator databricks-operator:

  1. Run go mod tidy to download dependencies. It doesn't show any progress bar and takes a while to download all of dependencies.

  2. Update api\v1\notebookjob_types.go.

  3. Regenerate CRD make manifests.

  4. Install updated CRD make install

  5. Generate code make generate

  6. Update operator controllers\notebookjob_controller.go

  7. Update tests and run make test

  8. Build make build

  9. Deploy

    make docker-build IMG={your-docker-image-name}
    make docker-push IMG={your-docker-image-name}
    make deploy

Main Contributors

  1. Jordan Knight Github, Linkedin
  2. Paul Bouwer Github, Linkedin
  3. Lace Lofranco Github, Linkedin
  4. Allan Targino Github, Linkedin
  5. Xinyun (Jacob) ZhouGithub, Linkedin
  6. Jason Goodsell Github, Linkedin
  7. Craig Rodger Github, Linkedin
  8. Justin Chizer Github, Linkedin
  9. Priya Kumaran Github, Linkedin
  10. Azadeh Khojandi Github, Linkedin

Resources

Kubernetes on WSL

On windows command line run kubectl config view to find the values of [windows-user-name],[minikubeip],[port]

mkdir ~/.kube \
&& cp /mnt/c/Users/[windows-user-name]/.kube/config ~/.kube

if you are using minikube you need to set bellow settings

kubectl config set-cluster minikube --server=https://<minikubeip>:<port> --certificate-authority=/mnt/c/Users/<windows-user-name>/.minikube/ca.crt
kubectl config set-credentials minikube --client-certificate=/mnt/c/Users/<windows-user-name>/.minikube/client.crt --client-key=/mnt/c/Users/<windows-user-name>/.minikube/client.key
kubectl config set-context minikube --cluster=minikube --user=minikub

More info:

  1. https://devkimchi.com/2018/06/05/running-kubernetes-on-wsl/
  2. https://www.jamessturtevant.com/posts/Running-Kubernetes-Minikube-on-Windows-10-with-WSL/

Build pipelines

  1. Create a pipeline and add a status badge to Github
  2. Customize status badge with shields.io

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.