Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job never comes up and fails with driver pod not found #795

Closed
sb2nov opened this issue Feb 7, 2020 · 38 comments
Closed

Job never comes up and fails with driver pod not found #795

sb2nov opened this issue Feb 7, 2020 · 38 comments
Labels
kind/bug Something isn't working

Comments

@sb2nov
Copy link

sb2nov commented Feb 7, 2020

Sometimes the creation request succeeds and then the job never comes up and doing a describe on the spark application shows

SparkApplicationFailed               6m54s  spark-operator  SparkApplication log-validation failed: Driver Pod not found

We've seen this happen about 25% of the time. Any advice ?

@liyinan926
Copy link
Collaborator

Which version of the operator are you running? This kind of error happens typically because the controller reconciliation logic tries to get the driver pod and fails to find it in the cache (the local client informer cache that is asynchronously synced against the API server). We have made changes to fallback to get the driver pod directly from the k8s API server if it is not found in the cache.

@sb2nov
Copy link
Author

sb2nov commented Feb 7, 2020

I'm running the 2.4.4 version

@liyinan926
Copy link
Collaborator

What is the image tag of the version you are running?

@sb2nov
Copy link
Author

sb2nov commented Feb 8, 2020

Image: gcr.io/spark-operator/spark-operator:v1beta2-1.0.1-2.4.4
Image ID: docker-pullable://gcr.io/spark-operator/spark-operator@sha256:ce769e5c6a5d8fa78ceb1a0abaf961fb2424767f9535c97baac04a18169654bd

@liyinan926
Copy link
Collaborator

Please give the new release based on 2.4.5 a try and let me know if you still run into the same issue.

@sb2nov
Copy link
Author

sb2nov commented Feb 11, 2020

Still seeing the same issue.

  Type     Reason                               Age   From            Message
  ----     ------                               ----  ----            -------
  Normal   SparkApplicationAdded                58s   spark-operator  SparkApplication logs-batch was added, enqueuing it for submission
  Normal   SparkApplicationSubmitted            53s   spark-operator  SparkApplication logs-batch was submitted successfully
  Normal   SparkApplicationSpecUpdateProcessed  53s   spark-operator  Successfully processed spec update for SparkApplication logs-batch
  Warning  SparkApplicationFailed               52s   spark-operator  SparkApplication logs-batch failed: Driver Pod not found

@julio666
Copy link

I have exactly the same issue.

@liyinan926
Copy link
Collaborator

@sb2nov I saw a SparkApplicationSpecUpdateProcessed event on your SparkApplication resource. Did you make an update to the spec after submission?

@sb2nov
Copy link
Author

sb2nov commented Feb 11, 2020

I did not change that. It was just one kubectl apply. I'm using the operator installed by helm.

@liyinan926
Copy link
Collaborator

Did you enable Prometheus monitoring? Can you paste your SparkApplication spec here?

@sb2nov
Copy link
Author

sb2nov commented Feb 11, 2020

I do have prometheus monitoring turned on:

Name:         logs-batch
Namespace:    sourabh
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sparkoperator.k8s.io/v1beta2","kind":"SparkApplication","metadata":{"annotations":{},"name":"logs-batch","namespace...
API Version:  sparkoperator.k8s.io/v1beta2
Kind:         SparkApplication
Metadata:
  Creation Timestamp:  2020-02-11T22:33:00Z
  Generation:          1
  Resource Version:    46870186
  Self Link:           /apis/sparkoperator.k8s.io/v1beta2/namespaces/sourabh/sparkapplications/logs-batch
  UID:                 7576d9e5-4d1e-11ea-8be9-02f34443fbf2
Spec:
  Driver:
    Cores:  1
    Env Vars:
      INDEX LOG KAFKA:      log-events
      KAFKA BROKERS:        pubsub-0.pubsub:9092,pubsub-1.pubsub:9092,pubsub-2.pubsub:9092
      SCHEMA REGISTRY:      http://cp-schema-registry:8081
      ZOOKEEPER:            zookeeper:2181
    Memory:                       1024m
    Service Account:              spark
  Executor:
    Cores:  1
    Instances:                3
    Memory:                   2048m
  Hadoop Conf:
    Fs . S 3 A . Aws . Credentials . Provider:  com.amazonaws.auth.InstanceProfileCredentialsProvider
    Fs . S 3 A . Impl:                          org.apache.hadoop.fs.s3a.S3AFileSystem
    Fs . S 3 A . Path . Style . Access:         true
    Validate Output Specs:                      false
  Image:                                        scratchspace:sourabh-20200211143232-dev
  Image Pull Policy:                            Always
  Main Application File:                        local:///opt/spark/work-dir/logs_batch.py
  Mode:                                         cluster
  Monitoring:
    Expose Driver Metrics:    true
    Expose Executor Metrics:  true
    Prometheus:
      Jmx Exporter Jar:  /opt/spark/jars/jmx_prometheus_javaagent-0.11.0.jar
  Python Version:        3
  Restart Policy:
    On Submission Failure Retries:         5
    On Submission Failure Retry Interval:  20
    Type:                                  OnFailure
  Spark Conf:
    Spark . Kubernetes . Local . Dirs . Tmpfs:  true
    Spark . Kubernetes . Namespace:             sourabh
  Spark Version:                                2.4.4
  Type:                                         Python
Status:
  Application State:
    Error Message:  Driver Pod not found
    State:          FAILED
  Driver Info:
    Pod Name:                    logs-batch-driver
    Web UI Address:              10.100.18.81:4040
    Web UI Port:                 4040
    Web UI Service Name:         logs-batch-ui-svc
  Execution Attempts:            1
  Last Submission Attempt Time:  2020-02-11T22:33:05Z
  Spark Application Id:          spark-0e3107d4b23f4387b93802d14939fdf4
  Submission Attempts:           1
  Submission ID:                 16c9b013-ec77-493c-ab03-d492735a37d0
  Termination Time:              2020-02-11T22:33:05Z
Events:
  Type     Reason                               Age                From            Message
  ----     ------                               ----               ----            -------
  Normal   SparkApplicationAdded                43s                spark-operator  SparkApplication logs-batch was added, enqueuing it for submission
  Normal   SparkApplicationSubmitted            38s                spark-operator  SparkApplication logs-batch was submitted successfully
  Normal   SparkApplicationSpecUpdateProcessed  38s                spark-operator  Successfully processed spec update for SparkApplication logs-batch
  Warning  SparkApplicationFailed               37s (x2 over 38s)  spark-operator  SparkApplication logs-batch failed: Driver Pod not found

@sb2nov
Copy link
Author

sb2nov commented Feb 11, 2020

Removing everything below Monitoring: does make it a lot more stable in the 10 runs I did.

@liyinan926
Copy link
Collaborator

That explains it. Enabling Prometheus monitoring will modify the spec with some additional annotations, which causes the submitted run to be invalidated and the driver pod to get deleted. There's a bug here: we should not write the updated spec to the API server as we only use it to construct the submission command. Will fix that.

@sb2nov
Copy link
Author

sb2nov commented Feb 12, 2020

Thank you for the help.

liyinan926 added a commit to liyinan926/spark-on-k8s-operator that referenced this issue Feb 12, 2020
@liyinan926
Copy link
Collaborator

liyinan926 commented Feb 12, 2020

Created #804 to fix the unwanted spec update. Will include it in the latest release image, and will make a new bug fix release later.

@sb2nov
Copy link
Author

sb2nov commented Feb 12, 2020

Awesome. Thank you so much, really appreciate the help.

@liyinan926
Copy link
Collaborator

The spec update issue has been fixed and the image has been updated. Please uninstall and reinstall through Helm.

@sb2nov
Copy link
Author

sb2nov commented Feb 12, 2020

@liyinan926 I'm still seeing the issue. The operator is running image:

  Image:         gcr.io/spark-operator/spark-operator:v1beta2-1.1.0-2.4.5
  Image ID:      docker-pullable://gcr.io/spark-operator/spark-operator@sha256:c59931eb2fcbcd51750fa70d253ed445dfa5f7b8d1e6594550310a02b18ce5f9

@liyinan926
Copy link
Collaborator

Are you still seeing the event SparkApplicationSpecUpdateProcessed ? Can you do a kubectl get sparkapplication <name> -o=yaml after the error was reported?

@sb2nov
Copy link
Author

sb2nov commented Feb 12, 2020

Yes, still seeing the UpdateProcessed.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"sparkoperator.k8s.io/v1beta2","kind":"SparkApplication","metadata":{"annotations":{},***********}
  creationTimestamp: "2020-02-12T02:10:26Z"
  generation: 1
  name: logs-batch
  namespace: sourabh
  resourceVersion: "46990703"
  selfLink: /apis/sparkoperator.k8s.io/v1beta2/namespaces/sourabh/sparkapplications/logs-batch
  uid: d5c7954b-4d3c-11ea-bcb5-06d1592d959c
spec:
  driver:
    cores: 1
    envVars:
      SCHEMA_REGISTRY: http://cp-schema-registry:8081
      ZOOKEEPER: zookeeper:2181
    memory: 1024m
    serviceAccount: spark
  executor:
    cores: 1
    instances: 3
    memory: 2048m
  hadoopConf:
    fs.s3a.aws.credentials.provider: com.amazonaws.auth.InstanceProfileCredentialsProvider
    fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    fs.s3a.path.style.access: "true"
    validateOutputSpecs: "false"
  image: scratchspace:sourabhbajaj-20200211181007-dev
  imagePullPolicy: Always
  mainApplicationFile: local:///opt/spark/work-dir/logs_batch.py
  mode: cluster
  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: /opt/spark/jars/jmx_prometheus_javaagent-0.11.0.jar
  pythonVersion: "3"
  restartPolicy:
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
    type: OnFailure
  sparkConf:
    spark.kubernete¨.local.dirs.tmpfs: "true"
    spark.kubernetes.namespace: sourabh
  sparkVersion: 2.4.5
  type: Python
status:
  applicationState:
    errorMessage: Driver Pod not found
    state: FAILED
  driverInfo:
    podName: logs-batch-driver
    webUIAddress: 10.100.127.207:4040
    webUIPort: 4040
    webUIServiceName: logs-batch-ui-svc
  executionAttempts: 1
  lastSubmissionAttemptTime: "2020-02-12T02:10:29Z"
  sparkApplicationId: spark-a7e0ffc3eabc418a97c31d9d40b149d9
  submissionAttempts: 1
  submissionID: a3d11fdd-9403-40a9-a588-81f6e669e695
  terminationTime: "2020-02-12T02:10:30Z"

@liyinan926
Copy link
Collaborator

liyinan926 commented Feb 12, 2020

Interestingly I don't see the updated fields in the spec triggered by the Prometheus monitoring configuration. Can you do a diff on spec between your original version and the version you get by running kubectl get?

liyinan926 added a commit that referenced this issue Feb 12, 2020
@liyinan926
Copy link
Collaborator

liyinan926 commented Feb 12, 2020

Sorry I was too quick to conclusion. We actually have updated to using the status subresource and UpdateStatus to update the status of SparkApplication only a while ago. So the updates made by the Prometheus monitoring config are not to blame. The describe output you pasted in #795 (comment) actually proved that because there doesn't seem any updates related to Prometheus monitoring. I'm not sure where the update to the spec is coming from.

liyinan926 added a commit that referenced this issue Feb 12, 2020
@sb2nov
Copy link
Author

sb2nov commented Feb 12, 2020

The different I see between the runs that work and not seems to be the second submit:

Events:
  Type     Reason                               Age                  From            Message
  ----     ------                               ----                 ----            -------
  Normal   SparkApplicationAdded                5m10s                spark-operator  SparkApplication logs-batch was added, enqueuing it for submission
  Normal   SparkApplicationSpecUpdateProcessed  5m7s                 spark-operator  Successfully processed spec update for SparkApplication logs-batch
  Warning  SparkApplicationPendingRerun         5m7s                 spark-operator  SparkApplication logs-batch is pending rerun
  Normal   SparkApplicationSubmitted            5m3s (x2 over 5m7s)  spark-operator  SparkApplication logs-batch was submitted successfully
  Normal   SparkDriverRunning                   3m55s                spark-operator  Driver logs-batch-driver is running

@sb2nov
Copy link
Author

sb2nov commented Feb 12, 2020

Is there a race here between the update being processed and the submit being called somehow.

@liyinan926
Copy link
Collaborator

There seems to be some kind of racing condition here. I don't know where the update is coming from in your case. But it seems the error happened when the update happened right after the app got submitted. The update invalidated the submitted run and caused the driver pod to be deleted. On the other hand, the status update during submission caused a requeue and reconciliation and a check of the driver pod.

@pudong7163
Copy link

Is there any more iteration on this issue? We start to see this issue too.

@sb2nov
Copy link
Author

sb2nov commented Feb 25, 2020

@liyinan926 just wanted to check if there is anything I can help with ?

breetasinha1109 pushed a commit to nokia/spark-on-k8s-operator that referenced this issue Feb 27, 2020
breetasinha1109 pushed a commit to nokia/spark-on-k8s-operator that referenced this issue Feb 27, 2020
@liyinan926
Copy link
Collaborator

One thing you can do is to do a kubectl get sparkapplication <your app> -o=yaml and do a diff on the output of that vs. the original spec of your app and see what has been updated. I don't really see where the updated was from in your case.

@liyinan926
Copy link
Collaborator

Another thing I don't understand is the generation of your resource remains 1 even though there seemed an update to the spec. The field generation is supposed to be incremented on every spec update with the status subresource enabled.

@liyinan926
Copy link
Collaborator

This issue seems related to the issue reported in #826 that was caused by applying default values in the wrong place. Specifically, the default values were applied in the onAdd method directly on the resource in the cache instead of on a copy of the resource. The changes applied to the resource in the informer cache then became visible to any subsequent reads of the resource from the cache. This seemed to be the source of the change that led to the event SparkApplicationSpecUpdateProcessed. The issue has been fixed in #832.

@sb2nov
Copy link
Author

sb2nov commented Mar 14, 2020

Awesome, thank you so much. Will you be cutting a new release by any chance ?

@liyinan926
Copy link
Collaborator

I have updated and pushed image gcr.io/spark-operator/spark-operator:v2.4.5-v1beta2-latest to include the fix. Can you give it a try and let me know if the issue reported here is resolved?

@ekesken
Copy link

ekesken commented Mar 19, 2020

we were hitting the same issue, I tried gcr.io/spark-operator/spark-operator:v2.4.5-v1beta2-latest and can confirm it solves the problem, when can we get an official release?

@liyinan926
Copy link
Collaborator

Can you give gcr.io/spark-operator/spark-operator:v2.4.5-v1beta2-latest an try and see if it fixes the issue?

@ekesken
Copy link

ekesken commented Mar 19, 2020

Can you give gcr.io/spark-operator/spark-operator:v2.4.5-v1beta2-latest an try and see if it fixes the issue?

@liyinan926 I already tried and it fixed the issue for us.

@sb2nov
Copy link
Author

sb2nov commented Mar 19, 2020

Tried it as well, seems to have fixed the underlying issue.

@liyinan926
Copy link
Collaborator

@ekesken @sb2nov thanks for confirming! Will make a new bug fix release soon.

@liyinan926
Copy link
Collaborator

breetasinha1109 pushed a commit to nokia/spark-on-k8s-operator that referenced this issue Mar 31, 2020
breetasinha1109 pushed a commit to nokia/spark-on-k8s-operator that referenced this issue Mar 31, 2020
jbhalodia-slack pushed a commit to jbhalodia-slack/spark-operator that referenced this issue Oct 4, 2024
jbhalodia-slack pushed a commit to jbhalodia-slack/spark-operator that referenced this issue Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants