Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent KeyError when handling a finished future in the task worker #11350

Closed
rtibbles opened this issue Oct 4, 2023 · 10 comments
Closed
Assignees
Labels
DEV: backend Python, databases, networking, filesystem... help wanted Open source contributors welcome P2 - normal Priority: Nice to have

Comments

@rtibbles
Copy link
Member

rtibbles commented Oct 4, 2023

Observed behavior

Occasionally, when a task is completed, deleting its future from the futures map results in a KeyError that is unhandled.

Errors and logs

ERROR    2023-10-04 15:52:05,784 exception calling callback for <Future at 0x116e39358 state=finished returned NoneType>
Traceback (most recent call last):
  File "/Users/runner/hostedtoolcache/Python/3.6.15/x64/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/Users/runner/work/kolibri/kolibri/kolibri/core/tasks/worker.py", line 90, in handle_finished_future
    del self.future_job_mapping[job.job_id]
KeyError: '68a88bf36f0b4cd8942b4db6f895d6f5'

Expected behavior

Would be good to understand why these KeyErrors are happening, but ultimately making them either not happen, or be handled, because the deletion is unneeded is sufficient resolution here.

User-facing consequences

Erroneous error logs that suggest something has gone wrong when it hasn't

Steps to reproduce

Running the task runner tests in multiprocessing mode should be sufficient.

Context

Observed in the MacOS tests on Github Actions (but also occasionally observed locally with threaded task runners).

@rtibbles rtibbles added P2 - normal Priority: Nice to have DEV: backend Python, databases, networking, filesystem... help wanted Open source contributors welcome labels Oct 4, 2023
@rtibbles rtibbles added this to the 0.16 Future Patches triage milestone Oct 4, 2023
@laynestephens
Copy link

I would like to be assigned to this issue.

@Nyu10
Copy link

Nyu10 commented Nov 8, 2023

am also interested

@MisRob
Copy link
Member

MisRob commented Nov 10, 2023

Hi @laynestephens and @Nyu10, thanks for volunteering. I will assign this to @laynestephens since they were first. @Nyu10 I assigned you another issue you asked for meanwhile.

@laynestephens
Copy link

Hi again! I am working with a partner on this issue, @adviti-mishra , and I was wondering if they could be assigned simultaneously.

@MisRob
Copy link
Member

MisRob commented Nov 24, 2023

Hi @laynestephens, yes sure

@MisRob
Copy link
Member

MisRob commented Nov 24, 2023

@laynestephens We will need @adviti-mishra to comment on this issue so I can assign them

@adviti-mishra
Copy link
Contributor

Hello @MisRob Thank you so much! I'm with @laynestephens and would love to be assigned as well.

@adviti-mishra
Copy link
Contributor

The KeyError issue seems to be in deleting a variable called future from self.job_future_mapping and a variable called job.job_id from self.future_job_mapping. The first fix that came to mind was adding an if check to see if the keys exist in the dictionaries before deleting them. However, we suspect the root cause of this error is a race condition where this function is trying to delete from the dictionaries in one thread while another function is trying to modify the dictionaries in another thread.
If our hypotheses are correct, could we make a PR for the first fix and then look into fixing the race condition? This is our first time contributing to open source, so we aren't sure if ideating like this on the issue is a good idea or not : ) Thank you!

@MisRob
Copy link
Member

MisRob commented Dec 4, 2023

@adviti-mishra It seems you've already discussed this on Slack. Let us know there if you needed anything else. Thank you.

rtibbles added a commit that referenced this issue Jan 17, 2024
Fixed intermittent KeyError when handling a finished future in the task worker #11350
@rtibbles
Copy link
Member Author

Fixed in #11591

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DEV: backend Python, databases, networking, filesystem... help wanted Open source contributors welcome P2 - normal Priority: Nice to have
Projects
None yet
Development

No branches or pull requests

5 participants