-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job never comes up and fails with driver pod not found #795
Comments
Which version of the operator are you running? This kind of error happens typically because the controller reconciliation logic tries to get the driver pod and fails to find it in the cache (the local client informer cache that is asynchronously synced against the API server). We have made changes to fallback to get the driver pod directly from the k8s API server if it is not found in the cache. |
I'm running the 2.4.4 version |
What is the image tag of the version you are running? |
Image: gcr.io/spark-operator/spark-operator:v1beta2-1.0.1-2.4.4 |
Please give the new release based on 2.4.5 a try and let me know if you still run into the same issue. |
Still seeing the same issue.
|
I have exactly the same issue. |
@sb2nov I saw a |
I did not change that. It was just one |
Did you enable Prometheus monitoring? Can you paste your |
I do have prometheus monitoring turned on:
|
Removing everything below |
That explains it. Enabling Prometheus monitoring will modify the spec with some additional annotations, which causes the submitted run to be invalidated and the driver pod to get deleted. There's a bug here: we should not write the updated spec to the API server as we only use it to construct the submission command. Will fix that. |
Thank you for the help. |
Created #804 to fix the unwanted spec update. Will include it in the latest release image, and will make a new bug fix release later. |
Awesome. Thank you so much, really appreciate the help. |
The spec update issue has been fixed and the image has been updated. Please uninstall and reinstall through Helm. |
@liyinan926 I'm still seeing the issue. The operator is running image:
|
Are you still seeing the event |
Yes, still seeing the
|
Interestingly I don't see the updated fields in the spec triggered by the Prometheus monitoring configuration. Can you do a diff on |
Sorry I was too quick to conclusion. We actually have updated to using the status subresource and |
The different I see between the runs that work and not seems to be the second submit:
|
Is there a race here between the update being processed and the submit being called somehow. |
There seems to be some kind of racing condition here. I don't know where the update is coming from in your case. But it seems the error happened when the update happened right after the app got submitted. The update invalidated the submitted run and caused the driver pod to be deleted. On the other hand, the status update during submission caused a requeue and reconciliation and a check of the driver pod. |
Is there any more iteration on this issue? We start to see this issue too. |
@liyinan926 just wanted to check if there is anything I can help with ? |
…ubeflow#805) This reverts commit 1687c06.
One thing you can do is to do a |
Another thing I don't understand is the |
This issue seems related to the issue reported in #826 that was caused by applying default values in the wrong place. Specifically, the default values were applied in the |
Awesome, thank you so much. Will you be cutting a new release by any chance ? |
I have updated and pushed image |
we were hitting the same issue, I tried |
Can you give |
@liyinan926 I already tried and it fixed the issue for us. |
Tried it as well, seems to have fixed the underlying issue. |
The fix in #832 has been released in https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/releases/tag/v1beta2-1.1.1-2.4.5. |
…ubeflow#805) This reverts commit 1687c06.
…ubeflow#805) This reverts commit 1687c06.
Sometimes the creation request succeeds and then the job never comes up and doing a describe on the spark application shows
We've seen this happen about 25% of the time. Any advice ?
The text was updated successfully, but these errors were encountered: