Race Condition to Cause Driver Pod Not Found #826

pudong7163 · 2020-03-02T17:48:33Z

I am relatively new to go-client and k8s. But in the past few days I spent a lot of time trying to see what’s going on there, regarding issue795

This section explains briefly a behavior why spark driver restarts

When I submit a SparkApplication, I do not set the restartPolicy. There are other default specs that cause the spark driver to restart similar to this. Spark operator will delete the submitted spark driver pod because its local cache (from Informer) is different from the API server in terms of the SparkApplication spec (local cache has an updated default restartPolicy - default, compared to the API server’s restartPolicy - “”). So normally, you will see the following pod behavior for spark driver pod.

Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 0s
Pu-spark-driver 0/1 Terminating 0 1s
Pu-spark-driver 0/1 Terminating 0 1s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 2s
Pu-spark-driver 1/1 Running 0 3s

I think this behavior is also observed by other people in the spark-operator channel.

This section explains the race condition (as least that’s what I believe, correct me if I am wrong)

A failed SparkApplication has following state transition parsed from the spark operator log. This log is reproducible and the order is verified to be like this across all my failed SparkApplications.

{ } -> {SUBMITTED }
{SUBMITTED } -> {SUBMITTED }
{INVALIDATING } -> {PENDING_RERUN }
{SUBMITTED } -> {FAILING Driver Pod not found}
{FAILING Driver Pod not found} -> {FAILED Driver Pod not found}
{FAILED Driver Pod not found} -> {FAILED Driver Pod not found}

The {INVALIDATING } -> {PENDING_RERUN } is the state transition after spark operator deletes the driver pod. However, the next state transition is {SUBMITTED } -> {FAILING Driver Pod not found}. Which does not make sense because the previous state should be {PENDING_RERUN}, should not be {SUBMITTED}. This is where my theory of race condition comes into the picture. If you check the code, you will find the states showed above in the log refer to cached states and submitted state to API server.

So basically for this state transition, {INVALIDATING} is the local cache state, and the spark operator is sending {PENDING_RERUN} state to the API server, using the rest API to api server, not via informer. It does not update the local cache.
{INVALIDATING } -> {PENDING_RERUN }

And in the next state transition, it reads from the local cache, where the state is still {SUBMITTED }. In this case, it checks if the driver pod is alive. It could not find the driver pod, so it transits the state to {FAILING Driver Pod not found}. And the following state transition from {FAILING Driver Pod not found} -> {FAILED Driver Pod not found} make senses. This concludes my analysis.

liyinan926 · 2020-03-08T23:21:48Z

The issue seems related to apply some default values, e.g., restartPolicy to the cached resource directly. The current way of applying the default values is really not ideal. To fix it before switching to the native defaulting support for CRDs in Kubernetes, we should move the internal default to the in-memory copy of the resource when submitting the application. By doing this, the resource will no longer be updated because of the defaulting.

liyinan926 · 2020-03-08T23:35:13Z

Created #832 to fix the defaulting issue.

liyinan926 · 2020-03-11T21:24:22Z

PR #832 should have fixed the issue caused by defaulting.

liyinan926 · 2020-03-19T18:04:37Z

The fix in #832 has been released in https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/releases/tag/v1beta2-1.1.1-2.4.5.

pudong7163 · 2020-03-19T18:08:04Z

Thank you @liyinan926! I already forwarded the release email to my colleagues. Because of this bug, we chose bare metal spark-submit on k8s for now. We may switch back to spark operator some time soon. Thank you gain. :)

liyinan926 added a commit to liyinan926/spark-on-k8s-operator that referenced this issue Mar 8, 2020

Fix for kubeflow#826 and some refactoring

4ff0a9a

liyinan926 mentioned this issue Mar 8, 2020

Fix for #826 and some refactoring #832

Merged

liyinan926 closed this as completed in #832 Mar 11, 2020

liyinan926 added a commit that referenced this issue Mar 11, 2020

Fix for #826 and some refactoring (#832)

aae3654

liyinan926 mentioned this issue Mar 14, 2020

Job never comes up and fails with driver pod not found #795

Closed

jbhalodia-slack pushed a commit to jbhalodia-slack/spark-operator that referenced this issue Oct 4, 2024

Fix for kubeflow#826 and some refactoring (kubeflow#832)

a9244ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Condition to Cause Driver Pod Not Found #826

Race Condition to Cause Driver Pod Not Found #826

pudong7163 commented Mar 2, 2020

liyinan926 commented Mar 8, 2020 •

edited

Loading

liyinan926 commented Mar 8, 2020

liyinan926 commented Mar 11, 2020

liyinan926 commented Mar 19, 2020

pudong7163 commented Mar 19, 2020

Race Condition to Cause Driver Pod Not Found #826

Race Condition to Cause Driver Pod Not Found #826

Comments

pudong7163 commented Mar 2, 2020

This section explains briefly a behavior why spark driver restarts

This section explains the race condition (as least that’s what I believe, correct me if I am wrong)

liyinan926 commented Mar 8, 2020 • edited Loading

liyinan926 commented Mar 8, 2020

liyinan926 commented Mar 11, 2020

liyinan926 commented Mar 19, 2020

pudong7163 commented Mar 19, 2020

liyinan926 commented Mar 8, 2020 •

edited

Loading