Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition to Cause Driver Pod Not Found #826

Closed
pudong7163 opened this issue Mar 2, 2020 · 5 comments · Fixed by #832
Closed

Race Condition to Cause Driver Pod Not Found #826

pudong7163 opened this issue Mar 2, 2020 · 5 comments · Fixed by #832

Comments

@pudong7163
Copy link

I am relatively new to go-client and k8s. But in the past few days I spent a lot of time trying to see what’s going on there, regarding issue795

This section explains briefly a behavior why spark driver restarts

When I submit a SparkApplication, I do not set the restartPolicy. There are other default specs that cause the spark driver to restart similar to this. Spark operator will delete the submitted spark driver pod because its local cache (from Informer) is different from the API server in terms of the SparkApplication spec (local cache has an updated default restartPolicy - default, compared to the API server’s restartPolicy - “”). So normally, you will see the following pod behavior for spark driver pod.

Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 0s
Pu-spark-driver 0/1 Terminating 0 1s
Pu-spark-driver 0/1 Terminating 0 1s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 2s
Pu-spark-driver 1/1 Running 0 3s

I think this behavior is also observed by other people in the spark-operator channel.

This section explains the race condition (as least that’s what I believe, correct me if I am wrong)

A failed SparkApplication has following state transition parsed from the spark operator log. This log is reproducible and the order is verified to be like this across all my failed SparkApplications.

{ } -> {SUBMITTED }
{SUBMITTED } -> {SUBMITTED }
{INVALIDATING } -> {PENDING_RERUN }
{SUBMITTED } -> {FAILING Driver Pod not found}
{FAILING Driver Pod not found} -> {FAILED Driver Pod not found}
{FAILED Driver Pod not found} -> {FAILED Driver Pod not found}

The {INVALIDATING } -> {PENDING_RERUN } is the state transition after spark operator deletes the driver pod. However, the next state transition is {SUBMITTED } -> {FAILING Driver Pod not found}. Which does not make sense because the previous state should be {PENDING_RERUN}, should not be {SUBMITTED}. This is where my theory of race condition comes into the picture. If you check the code, you will find the states showed above in the log refer to cached states and submitted state to API server.

So basically for this state transition, {INVALIDATING} is the local cache state, and the spark operator is sending {PENDING_RERUN} state to the API server, using the rest API to api server, not via informer. It does not update the local cache.
{INVALIDATING } -> {PENDING_RERUN }

And in the next state transition, it reads from the local cache, where the state is still {SUBMITTED }. In this case, it checks if the driver pod is alive. It could not find the driver pod, so it transits the state to {FAILING Driver Pod not found}. And the following state transition from {FAILING Driver Pod not found} -> {FAILED Driver Pod not found} make senses. This concludes my analysis.

@liyinan926
Copy link
Collaborator

liyinan926 commented Mar 8, 2020

The issue seems related to apply some default values, e.g., restartPolicy to the cached resource directly. The current way of applying the default values is really not ideal. To fix it before switching to the native defaulting support for CRDs in Kubernetes, we should move the internal default to the in-memory copy of the resource when submitting the application. By doing this, the resource will no longer be updated because of the defaulting.

liyinan926 added a commit to liyinan926/spark-on-k8s-operator that referenced this issue Mar 8, 2020
@liyinan926
Copy link
Collaborator

Created #832 to fix the defaulting issue.

@liyinan926
Copy link
Collaborator

PR #832 should have fixed the issue caused by defaulting.

@liyinan926
Copy link
Collaborator

@pudong7163
Copy link
Author

Thank you @liyinan926! I already forwarded the release email to my colleagues. Because of this bug, we chose bare metal spark-submit on k8s for now. We may switch back to spark operator some time soon. Thank you gain. :)

jbhalodia-slack pushed a commit to jbhalodia-slack/spark-operator that referenced this issue Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants