Add Spark Job (kubeflow#1467)

* Add a Spark operator to Kubeflow along with integration tests. Start adding converted spark operator elements Can generate empty service account for Spark Create the service account for the spark-operator. Add clusterRole for Spark Add cluster role bindings for Spark Add deployment (todo cleanup name/image) Can now launch spark operator Put in a reasonable default for namespace (e.g default not null) and make the image used for spark-operator configurable Start working to add job type We can now launch and operator and launch a job, but the service accounts don't quite line up. TODO(holden) refactor the service accounts for the job to only be created in the job and move sparkJob inside of all.json as well then have an all / operator / job entry point in all.json maybe? Add two hacked up temporary test scripts for use during dev (TODO refactor later into proper workflow) Able to launch a job fixed coreLimit and added job arguments. Remaining TODOs are handling of nulls & svc account hack + test cleanup. Start trying to re-organize the operator/job Fix handling of optional jobArguments and mainClass and now it _works_ :) Auto format the ksonnet. Reviewer feedback: switch description of Spark operator to something meaningful, use sparkVersion param instead of hard coded v2.3.1-v1alpha1, and fix hardcoded namespace. Clarify jobName param, remove Fix this since it has been integrated into all.libsonnet as intended. CR feedback: change description typo and add opitonal param to spark operator for sparkVersion Start trying to add spark tests to test_deploy.py At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests. This reverts commit 912a763. Start trying to add Spark to the e2e workflow for testing Looks like the prow tests call into the python tests normally so Revert "At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests." This reverts commit 6c4c81f. autoformat jsonnet s/core/common/ and /var/log/syslog to README Race condition on first deployment Start adding SparkPI job to the workflow test. Generate spark operator during CI as well. Fix deploy kf indent Already covered by deploy. Install spark operator Revert "Install spark operator" This reverts commit cc559dd. Test against the PR not master. Fix string concat Take spark-deploy out of workflows since cover in kf presub anyways. Debug commit revert later. idk whats going on for real. hax Ok lets use where the sym link was coming from idk. Debug deploy kubeflow call... Pritn in maint oo. Specify a name. name Get all. More debugging also why do we eddit app.yaml; directly. don't gen common import for debug Just do spark-operator as verbose. spelling hmm namespace looked weird, lets run pytorch in verbose too so I can compare put verbose at the end Autoformat the json Add a deployment scope and give more things a namespace Format. Gen pytorch and spark ops as verbose idk wtf this is. Don't deploy the spark job in the releaser test no kfctl test either. Just use name We don't append any junk anymore format json Don't do spark in deploy_kubeflow anymore Spark job deployment with workflows Apply spark operator. Add a sleep hack Fix multi-line add a working dir for the ks app temp debug garbage specify working dir Working dir was not happy, just cd cause why not testdir not appDir change to tests.testDir Move operator deployment Make sure we are in the ks_app? Remove debugging and YOLO 90% less YOLO Add that comma Change deps well CD seems to work in the other command so uhhh who knows? Use runpath + pushd instead of kfctl generate Just generate for now Do both Generate k8s Install operator Break down setting up the spark operator into different steps We are in default rather than ghke Use the run script to do the dpeloy Change the namespace to stepsNamespace and add debug step cauise idk Append the params to generate cmd Remove params_str since we're doing list and a param of namespace s/extends/extend/ Move params to the right place Remove debug cluster step Remove local test since we now use the regular e2e argo triggered tests. Respond to the CR feedback Fix paramterization of spark executor config. Plumb through spark version to executor version label Remove unecessary whitespace change in otherwise unmodified file. * re-run autoformat * default doesn't seem to exists anymore * Debug the env list cause it changed * re-run autoformat again * Specify the env since env list shows default env is the only env present. * Remove debug env list since the operator now works * autofrmat and indent default * Address CR feedback: remove deploymentscope and just use clusterole, link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now * Take out the spark job from ther workflows in components test we just test the operator applies for now. * Remove namespace as a param and just use the env. * Fix end of line on namespace from ; to ,
StatCan · Feb 14, 2019 · 7f9ae3f · 7f9ae3f
1 parent a3930cb
commit 7f9ae3f
Show file tree

Hide file tree

Showing 6 changed files with 466 additions and 9 deletions.
diff --git a/kubeflow/spark/README.md b/kubeflow/spark/README.md
@@ -0,0 +1,3 @@
+A very early attempt at allowing Apache Spark to be used with Kubeflow.
+Starts a container to run the driver program in, and the rest is up to the Spark on K8s integration.
+Based on https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
diff --git a/kubeflow/spark/all.libsonnet b/kubeflow/spark/all.libsonnet
@@ -0,0 +1,346 @@
+{
+  // Define the various prototypes you want to support.
+  // Each prototype should be a list of different parts that together
+  // provide a userful function such as serving a TensorFlow or PyTorch model.
+  all(params, name, env):: [
+    $.parts(params, name, env).operatorServiceAccount,
+    $.parts(params, name, env).operatorClusterRole(),
+    $.parts(params, name, env).operatorClusterRoleBinding(),
+    $.parts(params, name, env).deployment,
+  ],
+
+  sparkJob(params, name, env):: [
+    $.parts(params, name, env).jobServiceAccount,
+    $.parts(params, name, env).jobClusterRole,
+    $.parts(params, name, env).jobClusterRoleBinding,
+    $.parts(params, name, env).sparkJob,
+  ],
+
+  // Parts should be a dictionary containing jsonnet representations of the various
+  // K8s resources used to construct the prototypes listed above.
+  parts(params, name, env):: {
+    // All ksonnet environments are associated with a namespace and we
+    // generally want to use that namespace for a component.
+    // However, in some cases an application may use multiple namespaces in which
+    // case the namespace for a particular component will be a parameter.
+    local namespace = env.namespace,
+    local mainClass = if params.mainClass == "null" then "" else params.mainClass,
+    local jobArguments = if params.jobArguments == "null" then [] else std.split(params.jobArguments, ","),
+    local sparkVersion = params.sparkVersion,
+
+    jobServiceAccount:: {
+      apiVersion: "v1",
+      kind: "ServiceAccount",
+      metadata: {
+        name: name,
+        namespace: namespace,
+      },
+    },
+
+    jobClusterRole:: {
+      apiVersion: "rbac.authorization.k8s.io/v1beta1",
+      kind: "Role",
+      metadata: {
+        namespace: namespace,
+        name: name,
+      },
+      rules: [
+        {
+          apiGroups: [
+            "",
+          ],
+          resources: [
+            "pods",
+          ],
+          verbs: [
+            "*",
+          ],
+        },
+        {
+          apiGroups: [
+            "",
+          ],
+          resources: [
+            "services",
+          ],
+          verbs: [
+            "*",
+          ],
+        },
+      ],
+    },
+    jobClusterRoleBinding:: {
+      apiVersion: "rbac.authorization.k8s.io/v1beta1",
+      kind: "RoleBinding",
+      metadata: {
+        name: name,
+        namespace: namespace,
+      },
+      subjects: [
+        {
+          kind: "ServiceAccount",
+          name: name,
+          namespace: namespace,
+        },
+      ],
+      roleRef: {
+        kind: "Role",
+        name: name,
+        apiGroup: "rbac.authorization.k8s.io",
+      },
+    },
+    operatorServiceAccount:: {
+      apiVersion: "v1",
+      kind: "ServiceAccount",
+      metadata: {
+        name: name,
+        namespace: namespace,
+      },
+    },
+    operatorClusterRole():: {
+      local roleType = "ClusterRole",
+      kind: roleType,
+      apiVersion: "rbac.authorization.k8s.io/v1beta1",
+      metadata: {
+        labels: {
+          app: "spark-operator",
+        },
+        name: name,
+      },
+      rules: [
+        {
+          apiGroups: [
+            "",
+          ],
+          resources: [
+            "pods",
+          ],
+          verbs: [
+            "*",
+          ],
+        },
+        {
+          apiGroups: [
+            "",
+          ],
+          resources: [
+            "services",
+            "configmaps",
+          ],
+          verbs: [
+            "create",
+            "get",
+            "delete",
+          ],
+        },
+        {
+          apiGroups: [
+            "",
+          ],
+          resources: [
+            "nodes",
+          ],
+          verbs: [
+            "get",
+          ],
+        },
+        {
+          apiGroups: [
+            "",
+          ],
+          resources: [
+            "events",
+          ],
+          verbs: [
+            "create",
+            "update",
+            "patch",
+          ],
+        },
+        {
+          apiGroups: [
+            "apiextensions.k8s.io",
+          ],
+          resources: [
+            "customresourcedefinitions",
+          ],
+          verbs: [
+            "create",
+            "get",
+            "update",
+            "delete",
+          ],
+        },
+        {
+          apiGroups: [
+            "admissionregistration.k8s.io",
+          ],
+          resources: [
+            "mutatingwebhookconfigurations",
+          ],
+          verbs: [
+            "create",
+            "get",
+            "update",
+            "delete",
+          ],
+        },
+        {
+          apiGroups: [
+            "sparkoperator.k8s.io",
+          ],
+          resources: [
+            "sparkapplications",
+            "scheduledsparkapplications",
+          ],
+          verbs: [
+            "*",
+          ],
+        },
+      ],
+    },
+    operatorClusterRoleBinding():: {
+      apiVersion: "rbac.authorization.k8s.io/v1beta1",
+      local bindingType = "ClusterRoleBinding",
+      local roleType = "ClusterRole",
+      kind: bindingType,
+      metadata: {
+        name: name,
+      },
+      subjects: [
+        {
+          kind: "ServiceAccount",
+          name: name,
+          namespace: namespace,
+        },
+      ],
+      roleRef: {
+        kind: roleType,
+        name: name,
+        apiGroup: "rbac.authorization.k8s.io",
+      },
+    },
+    deployment:: {
+      apiVersion: "apps/v1beta1",
+      kind: "Deployment",
+      metadata: {
+        name: name,
+        namespace: namespace,
+        labels: {
+          "app.kubernetes.io/name": name,
+          "app.kubernetes.io/version": sparkVersion,
+        },
+      },
+      spec: {
+        replicas: 1,
+        selector: {
+          matchLabels: {
+            "app.kubernetes.io/name": name,
+            "app.kubernetes.io/version": sparkVersion,
+          },
+        },
+        strategy: {
+          type: "Recreate",
+        },
+        template: {
+          metadata: {
+            annotations: {
+              "prometheus.io/scrape": "true",
+              "prometheus.io/port": "10254",
+              "prometheus.io/path": "/metrics",
+            },
+            labels: {
+              "app.kubernetes.io/name": name,
+              "app.kubernetes.io/version": sparkVersion,
+              name: name,
+            },
+            initializers: {
+              pending: [
+
+              ],
+            },
+          },
+          spec: {
+            serviceAccountName: name,
+            containers: [
+              {
+                name: name,
+                image: params.image,
+                imagePullPolicy: "Always",
+                command: [
+                  "/usr/bin/spark-operator",
+                ],
+                ports: [
+                  {
+                    containerPort: 10254,
+                  },
+                ],
+                args: [
+                  "-logtostderr",
+                  "-enable-metrics=true",
+                  "-metrics-labels=app_type",
+                ],
+              },
+            ],
+          },
+        },
+      },
+    },
+    // Job specific configuration
+    sparkJob:: {
+      apiVersion: "sparkoperator.k8s.io/v1alpha1",
+      kind: "SparkApplication",
+      metadata: {
+        name: params.jobName,
+        namespace: namespace,
+      },
+      spec: {
+        type: params.type,
+        mode: "cluster",
+        image: params.image,
+        imagePullPolicy: "Always",
+        mainClass: mainClass,
+        mainApplicationFile: params.applicationResource,
+        arguments: jobArguments,
+        volumes: [
+          {
+            name: "test-volume",
+            hostPath: {
+              path: "/tmp",
+              type: "Directory",
+            },
+          },
+        ],
+        driver: {
+          cores: params.driverCores,
+          memory: params.driverMemory,
+          labels: {
+            version: sparkVersion,
+          },
+          serviceAccount: params.name,
+          volumeMounts: [
+            {
+              name: "test-volume",
+              mountPath: "/tmp",
+            },
+          ],
+        },
+        executor: {
+          cores: params.executorCores,
+          instances: params.numExecutors,
+          memory: params.executorMemory,
+          labels: {
+            version: params.sparkVersion,
+          },
+          volumeMounts: [
+            {
+              name: "test-volume",
+              mountPath: "/tmp",
+            },
+          ],
+        },
+        restartPolicy: "Never",
+      },
+    },
+  },
+}
diff --git a/kubeflow/spark/parts.yaml b/kubeflow/spark/parts.yaml
@@ -0,0 +1,27 @@
+{
+   "name": "spark",
+   "apiVersion": "0.0.1",
+   "kind": "ksonnet.io/parts",
+   "description": "Holden's awesome Spark Job prototype based on https://github.com/GoogleCloudPlatform/spark-on-k8s-operator\n",
+   "author": "kubeflow-team <[email protected]>",
+   "contributors": [
+     {
+      "name": "Holden Karau",
+      "email": "[email protected]"
+     }
+   ],
+   "repository": {
+      "type": "git",
+      "url": "https://github.com/kubeflow/kubeflow"
+   },
+   "bugs": {
+      "url": "https://github.com/kubeflow/kubeflow/issues"
+   },
+   "keywords": [
+      "kubernetes",
+      "kubeflow",
+      "machine learning",
+      "apache spark"
+   ],
+   "license": "Apache 2.0",
+}