-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race Condition to Cause Driver Pod Not Found #826
Comments
The issue seems related to apply some default values, e.g., |
Created #832 to fix the defaulting issue. |
PR #832 should have fixed the issue caused by defaulting. |
The fix in #832 has been released in https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/releases/tag/v1beta2-1.1.1-2.4.5. |
Thank you @liyinan926! I already forwarded the release email to my colleagues. Because of this bug, we chose bare metal spark-submit on k8s for now. We may switch back to spark operator some time soon. Thank you gain. :) |
I am relatively new to go-client and k8s. But in the past few days I spent a lot of time trying to see what’s going on there, regarding issue795
This section explains briefly a behavior why spark driver restarts
When I submit a SparkApplication, I do not set the restartPolicy. There are other default specs that cause the spark driver to restart similar to this. Spark operator will delete the submitted spark driver pod because its local cache (from Informer) is different from the API server in terms of the SparkApplication spec (local cache has an updated default restartPolicy - default, compared to the API server’s restartPolicy - “”). So normally, you will see the following pod behavior for spark driver pod.
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 0s
Pu-spark-driver 0/1 Terminating 0 1s
Pu-spark-driver 0/1 Terminating 0 1s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 Pending 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 0s
Pu-spark-driver 0/1 ContainerCreating 0 2s
Pu-spark-driver 1/1 Running 0 3s
I think this behavior is also observed by other people in the spark-operator channel.
This section explains the race condition (as least that’s what I believe, correct me if I am wrong)
A failed SparkApplication has following state transition parsed from the spark operator log. This log is reproducible and the order is verified to be like this across all my failed SparkApplications.
{ } -> {SUBMITTED }
{SUBMITTED } -> {SUBMITTED }
{INVALIDATING } -> {PENDING_RERUN }
{SUBMITTED } -> {FAILING Driver Pod not found}
{FAILING Driver Pod not found} -> {FAILED Driver Pod not found}
{FAILED Driver Pod not found} -> {FAILED Driver Pod not found}
The {INVALIDATING } -> {PENDING_RERUN } is the state transition after spark operator deletes the driver pod. However, the next state transition is {SUBMITTED } -> {FAILING Driver Pod not found}. Which does not make sense because the previous state should be {PENDING_RERUN}, should not be {SUBMITTED}. This is where my theory of race condition comes into the picture. If you check the code, you will find the states showed above in the log refer to cached states and submitted state to API server.
So basically for this state transition, {INVALIDATING} is the local cache state, and the spark operator is sending {PENDING_RERUN} state to the API server, using the rest API to api server, not via informer. It does not update the local cache.
{INVALIDATING } -> {PENDING_RERUN }
And in the next state transition, it reads from the local cache, where the state is still {SUBMITTED }. In this case, it checks if the driver pod is alive. It could not find the driver pod, so it transits the state to {FAILING Driver Pod not found}. And the following state transition from {FAILING Driver Pod not found} -> {FAILED Driver Pod not found} make senses. This concludes my analysis.
The text was updated successfully, but these errors were encountered: