Job never comes up and fails with driver pod not found #795

sb2nov · 2020-02-07T19:24:54Z

Sometimes the creation request succeeds and then the job never comes up and doing a describe on the spark application shows

SparkApplicationFailed               6m54s  spark-operator  SparkApplication log-validation failed: Driver Pod not found

We've seen this happen about 25% of the time. Any advice ?

The text was updated successfully, but these errors were encountered:

liyinan926 · 2020-02-07T22:15:23Z

Which version of the operator are you running? This kind of error happens typically because the controller reconciliation logic tries to get the driver pod and fails to find it in the cache (the local client informer cache that is asynchronously synced against the API server). We have made changes to fallback to get the driver pod directly from the k8s API server if it is not found in the cache.

sb2nov · 2020-02-07T22:27:21Z

I'm running the 2.4.4 version

liyinan926 · 2020-02-08T07:00:57Z

What is the image tag of the version you are running?

sb2nov · 2020-02-08T16:35:51Z

Image: gcr.io/spark-operator/spark-operator:v1beta2-1.0.1-2.4.4
Image ID: docker-pullable://gcr.io/spark-operator/spark-operator@sha256:ce769e5c6a5d8fa78ceb1a0abaf961fb2424767f9535c97baac04a18169654bd

liyinan926 · 2020-02-10T19:33:05Z

Please give the new release based on 2.4.5 a try and let me know if you still run into the same issue.

sb2nov · 2020-02-11T07:41:10Z

Still seeing the same issue.

  Type     Reason                               Age   From            Message
  ----     ------                               ----  ----            -------
  Normal   SparkApplicationAdded                58s   spark-operator  SparkApplication logs-batch was added, enqueuing it for submission
  Normal   SparkApplicationSubmitted            53s   spark-operator  SparkApplication logs-batch was submitted successfully
  Normal   SparkApplicationSpecUpdateProcessed  53s   spark-operator  Successfully processed spec update for SparkApplication logs-batch
  Warning  SparkApplicationFailed               52s   spark-operator  SparkApplication logs-batch failed: Driver Pod not found

julio666 · 2020-02-11T21:31:00Z

I have exactly the same issue.

liyinan926 · 2020-02-11T21:37:25Z

@sb2nov I saw a SparkApplicationSpecUpdateProcessed event on your SparkApplication resource. Did you make an update to the spec after submission?

sb2nov · 2020-02-11T21:51:40Z

I did not change that. It was just one kubectl apply. I'm using the operator installed by helm.

liyinan926 · 2020-02-11T22:18:03Z

Did you enable Prometheus monitoring? Can you paste your SparkApplication spec here?

sb2nov · 2020-02-11T23:24:51Z

I do have prometheus monitoring turned on:

Name:         logs-batch
Namespace:    sourabh
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sparkoperator.k8s.io/v1beta2","kind":"SparkApplication","metadata":{"annotations":{},"name":"logs-batch","namespace...
API Version:  sparkoperator.k8s.io/v1beta2
Kind:         SparkApplication
Metadata:
  Creation Timestamp:  2020-02-11T22:33:00Z
  Generation:          1
  Resource Version:    46870186
  Self Link:           /apis/sparkoperator.k8s.io/v1beta2/namespaces/sourabh/sparkapplications/logs-batch
  UID:                 7576d9e5-4d1e-11ea-8be9-02f34443fbf2
Spec:
  Driver:
    Cores:  1
    Env Vars:
      INDEX LOG KAFKA:      log-events
      KAFKA BROKERS:        pubsub-0.pubsub:9092,pubsub-1.pubsub:9092,pubsub-2.pubsub:9092
      SCHEMA REGISTRY:      http://cp-schema-registry:8081
      ZOOKEEPER:            zookeeper:2181
    Memory:                       1024m
    Service Account:              spark
  Executor:
    Cores:  1
    Instances:                3
    Memory:                   2048m
  Hadoop Conf:
    Fs . S 3 A . Aws . Credentials . Provider:  com.amazonaws.auth.InstanceProfileCredentialsProvider
    Fs . S 3 A . Impl:                          org.apache.hadoop.fs.s3a.S3AFileSystem
    Fs . S 3 A . Path . Style . Access:         true
    Validate Output Specs:                      false
  Image:                                        scratchspace:sourabh-20200211143232-dev
  Image Pull Policy:                            Always
  Main Application File:                        local:///opt/spark/work-dir/logs_batch.py
  Mode:                                         cluster
  Monitoring:
    Expose Driver Metrics:    true
    Expose Executor Metrics:  true
    Prometheus:
      Jmx Exporter Jar:  /opt/spark/jars/jmx_prometheus_javaagent-0.11.0.jar
  Python Version:        3
  Restart Policy:
    On Submission Failure Retries:         5
    On Submission Failure Retry Interval:  20
    Type:                                  OnFailure
  Spark Conf:
    Spark . Kubernetes . Local . Dirs . Tmpfs:  true
    Spark . Kubernetes . Namespace:             sourabh
  Spark Version:                                2.4.4
  Type:                                         Python
Status:
  Application State:
    Error Message:  Driver Pod not found
    State:          FAILED
  Driver Info:
    Pod Name:                    logs-batch-driver
    Web UI Address:              10.100.18.81:4040
    Web UI Port:                 4040
    Web UI Service Name:         logs-batch-ui-svc
  Execution Attempts:            1
  Last Submission Attempt Time:  2020-02-11T22:33:05Z
  Spark Application Id:          spark-0e3107d4b23f4387b93802d14939fdf4
  Submission Attempts:           1
  Submission ID:                 16c9b013-ec77-493c-ab03-d492735a37d0
  Termination Time:              2020-02-11T22:33:05Z
Events:
  Type     Reason                               Age                From            Message
  ----     ------                               ----               ----            -------
  Normal   SparkApplicationAdded                43s                spark-operator  SparkApplication logs-batch was added, enqueuing it for submission
  Normal   SparkApplicationSubmitted            38s                spark-operator  SparkApplication logs-batch was submitted successfully
  Normal   SparkApplicationSpecUpdateProcessed  38s                spark-operator  Successfully processed spec update for SparkApplication logs-batch
  Warning  SparkApplicationFailed               37s (x2 over 38s)  spark-operator  SparkApplication logs-batch failed: Driver Pod not found

sb2nov · 2020-02-11T23:29:29Z

Removing everything below Monitoring: does make it a lot more stable in the 10 runs I did.

liyinan926 · 2020-02-12T00:40:05Z

That explains it. Enabling Prometheus monitoring will modify the spec with some additional annotations, which causes the submitted run to be invalidated and the driver pod to get deleted. There's a bug here: we should not write the updated spec to the API server as we only use it to construct the submission command. Will fix that.

sb2nov · 2020-02-12T00:51:32Z

Thank you for the help.

liyinan926 · 2020-02-12T01:02:30Z

Created #804 to fix the unwanted spec update. Will include it in the latest release image, and will make a new bug fix release later.

sb2nov · 2020-02-12T01:05:55Z

Awesome. Thank you so much, really appreciate the help.

liyinan926 · 2020-02-12T01:15:05Z

The spec update issue has been fixed and the image has been updated. Please uninstall and reinstall through Helm.

sb2nov · 2020-02-12T01:42:47Z

@liyinan926 I'm still seeing the issue. The operator is running image:

  Image:         gcr.io/spark-operator/spark-operator:v1beta2-1.1.0-2.4.5
  Image ID:      docker-pullable://gcr.io/spark-operator/spark-operator@sha256:c59931eb2fcbcd51750fa70d253ed445dfa5f7b8d1e6594550310a02b18ce5f9

liyinan926 · 2020-02-12T01:57:49Z

Are you still seeing the event SparkApplicationSpecUpdateProcessed ? Can you do a kubectl get sparkapplication <name> -o=yaml after the error was reported?

sb2nov · 2020-02-12T02:15:23Z

Yes, still seeing the UpdateProcessed.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"sparkoperator.k8s.io/v1beta2","kind":"SparkApplication","metadata":{"annotations":{},***********}
  creationTimestamp: "2020-02-12T02:10:26Z"
  generation: 1
  name: logs-batch
  namespace: sourabh
  resourceVersion: "46990703"
  selfLink: /apis/sparkoperator.k8s.io/v1beta2/namespaces/sourabh/sparkapplications/logs-batch
  uid: d5c7954b-4d3c-11ea-bcb5-06d1592d959c
spec:
  driver:
    cores: 1
    envVars:
      SCHEMA_REGISTRY: http://cp-schema-registry:8081
      ZOOKEEPER: zookeeper:2181
    memory: 1024m
    serviceAccount: spark
  executor:
    cores: 1
    instances: 3
    memory: 2048m
  hadoopConf:
    fs.s3a.aws.credentials.provider: com.amazonaws.auth.InstanceProfileCredentialsProvider
    fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    fs.s3a.path.style.access: "true"
    validateOutputSpecs: "false"
  image: scratchspace:sourabhbajaj-20200211181007-dev
  imagePullPolicy: Always
  mainApplicationFile: local:///opt/spark/work-dir/logs_batch.py
  mode: cluster
  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: /opt/spark/jars/jmx_prometheus_javaagent-0.11.0.jar
  pythonVersion: "3"
  restartPolicy:
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
    type: OnFailure
  sparkConf:
    spark.kubernete¨.local.dirs.tmpfs: "true"
    spark.kubernetes.namespace: sourabh
  sparkVersion: 2.4.5
  type: Python
status:
  applicationState:
    errorMessage: Driver Pod not found
    state: FAILED
  driverInfo:
    podName: logs-batch-driver
    webUIAddress: 10.100.127.207:4040
    webUIPort: 4040
    webUIServiceName: logs-batch-ui-svc
  executionAttempts: 1
  lastSubmissionAttemptTime: "2020-02-12T02:10:29Z"
  sparkApplicationId: spark-a7e0ffc3eabc418a97c31d9d40b149d9
  submissionAttempts: 1
  submissionID: a3d11fdd-9403-40a9-a588-81f6e669e695
  terminationTime: "2020-02-12T02:10:30Z"

liyinan926 · 2020-02-12T02:19:55Z

Interestingly I don't see the updated fields in the spec triggered by the Prometheus monitoring configuration. Can you do a diff on spec between your original version and the version you get by running kubectl get?

This reverts commit 1687c06.

liyinan926 · 2020-02-12T04:30:24Z

Sorry I was too quick to conclusion. We actually have updated to using the status subresource and UpdateStatus to update the status of SparkApplication only a while ago. So the updates made by the Prometheus monitoring config are not to blame. The describe output you pasted in #795 (comment) actually proved that because there doesn't seem any updates related to Prometheus monitoring. I'm not sure where the update to the spec is coming from.

This reverts commit 1687c06.

sb2nov · 2020-02-12T05:21:06Z

The different I see between the runs that work and not seems to be the second submit:

Events:
  Type     Reason                               Age                  From            Message
  ----     ------                               ----                 ----            -------
  Normal   SparkApplicationAdded                5m10s                spark-operator  SparkApplication logs-batch was added, enqueuing it for submission
  Normal   SparkApplicationSpecUpdateProcessed  5m7s                 spark-operator  Successfully processed spec update for SparkApplication logs-batch
  Warning  SparkApplicationPendingRerun         5m7s                 spark-operator  SparkApplication logs-batch is pending rerun
  Normal   SparkApplicationSubmitted            5m3s (x2 over 5m7s)  spark-operator  SparkApplication logs-batch was submitted successfully
  Normal   SparkDriverRunning                   3m55s                spark-operator  Driver logs-batch-driver is running

sb2nov · 2020-02-12T05:21:49Z

Is there a race here between the update being processed and the submit being called somehow.

liyinan926 · 2020-02-12T05:34:57Z

There seems to be some kind of racing condition here. I don't know where the update is coming from in your case. But it seems the error happened when the update happened right after the app got submitted. The update invalidated the submitted run and caused the driver pod to be deleted. On the other hand, the status update during submission caused a requeue and reconciliation and a check of the driver pod.

pudong7163 · 2020-02-19T21:39:39Z

Is there any more iteration on this issue? We start to see this issue too.

sb2nov · 2020-02-25T19:17:14Z

@liyinan926 just wanted to check if there is anything I can help with ?

…ubeflow#805) This reverts commit 1687c06.

liyinan926 · 2020-02-29T04:55:06Z

One thing you can do is to do a kubectl get sparkapplication <your app> -o=yaml and do a diff on the output of that vs. the original spec of your app and see what has been updated. I don't really see where the updated was from in your case.

liyinan926 · 2020-02-29T04:58:25Z

Another thing I don't understand is the generation of your resource remains 1 even though there seemed an update to the spec. The field generation is supposed to be incremented on every spec update with the status subresource enabled.

liyinan926 · 2020-03-14T23:20:34Z

This issue seems related to the issue reported in #826 that was caused by applying default values in the wrong place. Specifically, the default values were applied in the onAdd method directly on the resource in the cache instead of on a copy of the resource. The changes applied to the resource in the informer cache then became visible to any subsequent reads of the resource from the cache. This seemed to be the source of the change that led to the event SparkApplicationSpecUpdateProcessed. The issue has been fixed in #832.

sb2nov · 2020-03-14T23:24:43Z

Awesome, thank you so much. Will you be cutting a new release by any chance ?

liyinan926 · 2020-03-14T23:29:05Z

I have updated and pushed image gcr.io/spark-operator/spark-operator:v2.4.5-v1beta2-latest to include the fix. Can you give it a try and let me know if the issue reported here is resolved?

ekesken · 2020-03-19T17:20:53Z

we were hitting the same issue, I tried gcr.io/spark-operator/spark-operator:v2.4.5-v1beta2-latest and can confirm it solves the problem, when can we get an official release?

liyinan926 · 2020-03-19T17:23:18Z

Can you give gcr.io/spark-operator/spark-operator:v2.4.5-v1beta2-latest an try and see if it fixes the issue?

ekesken · 2020-03-19T17:27:48Z

Can you give gcr.io/spark-operator/spark-operator:v2.4.5-v1beta2-latest an try and see if it fixes the issue?

@liyinan926 I already tried and it fixed the issue for us.

sb2nov · 2020-03-19T17:43:48Z

Tried it as well, seems to have fixed the underlying issue.

liyinan926 · 2020-03-19T17:45:09Z

@ekesken @sb2nov thanks for confirming! Will make a new bug fix release soon.

liyinan926 · 2020-03-19T18:04:51Z

The fix in #832 has been released in https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/releases/tag/v1beta2-1.1.1-2.4.5.

…ubeflow#805) This reverts commit 1687c06.

liyinan926 added a commit to liyinan926/spark-on-k8s-operator that referenced this issue Feb 12, 2020

Fixed the spec update issue in kubeflow#795

24506bd

liyinan926 added a commit that referenced this issue Feb 12, 2020

Fixed the spec update issue in #795 (#804)

1687c06

liyinan926 added a commit that referenced this issue Feb 12, 2020

Revert "Fixed the spec update issue in #795 (#804)"

5b04ded

This reverts commit 1687c06.

liyinan926 added a commit that referenced this issue Feb 12, 2020

Revert "Fixed the spec update issue in #795 (#804)" (#805)

3df7030

This reverts commit 1687c06.

breetasinha1109 pushed a commit to nokia/spark-on-k8s-operator that referenced this issue Feb 27, 2020

Fixed the spec update issue in kubeflow#795 (kubeflow#804)

58643be

breetasinha1109 pushed a commit to nokia/spark-on-k8s-operator that referenced this issue Feb 27, 2020

Revert "Fixed the spec update issue in kubeflow#795 (kubeflow#804)" (k…

7089b26

…ubeflow#805) This reverts commit 1687c06.

liyinan926 added the kind/bug Something isn't working label Feb 29, 2020

pudong7163 mentioned this issue Mar 2, 2020

Race Condition to Cause Driver Pod Not Found #826

Closed

liyinan926 closed this as completed Mar 19, 2020

breetasinha1109 pushed a commit to nokia/spark-on-k8s-operator that referenced this issue Mar 31, 2020

Fixed the spec update issue in kubeflow#795 (kubeflow#804)

ed0310d

breetasinha1109 pushed a commit to nokia/spark-on-k8s-operator that referenced this issue Mar 31, 2020

Revert "Fixed the spec update issue in kubeflow#795 (kubeflow#804)" (k…

ad90828

…ubeflow#805) This reverts commit 1687c06.

jbhalodia-slack pushed a commit to jbhalodia-slack/spark-operator that referenced this issue Oct 4, 2024

Fixed the spec update issue in kubeflow#795 (kubeflow#804)

80e7330

jbhalodia-slack pushed a commit to jbhalodia-slack/spark-operator that referenced this issue Oct 4, 2024

Revert "Fixed the spec update issue in kubeflow#795 (kubeflow#804)" (k…

15f1ca0

…ubeflow#805) This reverts commit 1687c06.

Job never comes up and fails with driver pod not found #795

Job never comes up and fails with driver pod not found #795

Comments

sb2nov commented Feb 7, 2020

liyinan926 commented Feb 7, 2020

sb2nov commented Feb 7, 2020

liyinan926 commented Feb 8, 2020

sb2nov commented Feb 8, 2020

liyinan926 commented Feb 10, 2020

sb2nov commented Feb 11, 2020 • edited Loading

julio666 commented Feb 11, 2020

liyinan926 commented Feb 11, 2020

sb2nov commented Feb 11, 2020 • edited Loading

liyinan926 commented Feb 11, 2020

sb2nov commented Feb 11, 2020 • edited Loading

sb2nov commented Feb 11, 2020

liyinan926 commented Feb 12, 2020

sb2nov commented Feb 12, 2020

liyinan926 commented Feb 12, 2020 • edited Loading

sb2nov commented Feb 12, 2020

liyinan926 commented Feb 12, 2020

sb2nov commented Feb 12, 2020

liyinan926 commented Feb 12, 2020

sb2nov commented Feb 12, 2020 • edited Loading

liyinan926 commented Feb 12, 2020 • edited Loading

liyinan926 commented Feb 12, 2020 • edited Loading

sb2nov commented Feb 12, 2020

sb2nov commented Feb 12, 2020

liyinan926 commented Feb 12, 2020

pudong7163 commented Feb 19, 2020

sb2nov commented Feb 25, 2020

liyinan926 commented Feb 29, 2020

liyinan926 commented Feb 29, 2020

liyinan926 commented Mar 14, 2020

sb2nov commented Mar 14, 2020

liyinan926 commented Mar 14, 2020

ekesken commented Mar 19, 2020 • edited Loading

liyinan926 commented Mar 19, 2020

ekesken commented Mar 19, 2020

sb2nov commented Mar 19, 2020

liyinan926 commented Mar 19, 2020

liyinan926 commented Mar 19, 2020

sb2nov commented Feb 11, 2020 •

edited

Loading

sb2nov commented Feb 11, 2020 •

edited

Loading

sb2nov commented Feb 11, 2020 •

edited

Loading

liyinan926 commented Feb 12, 2020 •

edited

Loading

sb2nov commented Feb 12, 2020 •

edited

Loading

liyinan926 commented Feb 12, 2020 •

edited

Loading

liyinan926 commented Feb 12, 2020 •

edited

Loading

ekesken commented Mar 19, 2020 •

edited

Loading