-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent KeyError when handling a finished future in the task worker #11350
Comments
I would like to be assigned to this issue. |
am also interested |
Hi @laynestephens and @Nyu10, thanks for volunteering. I will assign this to @laynestephens since they were first. @Nyu10 I assigned you another issue you asked for meanwhile. |
Hi again! I am working with a partner on this issue, @adviti-mishra , and I was wondering if they could be assigned simultaneously. |
Hi @laynestephens, yes sure |
@laynestephens We will need @adviti-mishra to comment on this issue so I can assign them |
Hello @MisRob Thank you so much! I'm with @laynestephens and would love to be assigned as well. |
The KeyError issue seems to be in deleting a variable called future from self.job_future_mapping and a variable called job.job_id from self.future_job_mapping. The first fix that came to mind was adding an if check to see if the keys exist in the dictionaries before deleting them. However, we suspect the root cause of this error is a race condition where this function is trying to delete from the dictionaries in one thread while another function is trying to modify the dictionaries in another thread. |
@adviti-mishra It seems you've already discussed this on Slack. Let us know there if you needed anything else. Thank you. |
Fixed intermittent KeyError when handling a finished future in the task worker #11350
Fixed in #11591 |
Observed behavior
Occasionally, when a task is completed, deleting its future from the futures map results in a KeyError that is unhandled.
Errors and logs
Expected behavior
Would be good to understand why these KeyErrors are happening, but ultimately making them either not happen, or be handled, because the deletion is unneeded is sufficient resolution here.
User-facing consequences
Erroneous error logs that suggest something has gone wrong when it hasn't
Steps to reproduce
Running the task runner tests in multiprocessing mode should be sufficient.
Context
Observed in the MacOS tests on Github Actions (but also occasionally observed locally with threaded task runners).
The text was updated successfully, but these errors were encountered: