Fix a race condition which would allow a rescheduled task to be reported missing even though it is not #5160
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There are two race conditions in here. I could only produce one of them in a test
If a key is transitioned to "flight", a
gather_dep
task for this key is scheduled. This race condition requires the task to be scheduled in batch with another set of tasks. If one task of this batch is released before thegather_dep
coroutine is first executed but after the transition inensure_communicate
, thegather_dep
coroutine will filter the released task and will only try to fetch results of the remote for the tasks still in state "flight".Once the response is available, all keys are tried to be parsed in the response, not only the actually fetched ones. I marked this situation with a FIXME a while ago but didn't anticipate the impact nor did I have an actual test backing this up.
Two things can now happen:
get_data_from_worker
the forgotten key is rescheduled, this results in us taking the "missing-data" path since the data for the key is not in the response, the task is not in state memory, the remote was not busy and the task has still dependents since it was rescheduled. In this case the worker issues a missing data signal to the scheduler which deletes the possibly only replica of that task on the remote which in turn might escalate pretty badly.I tried very hard in reproducing 2.) using multiple mocks and locks 🤯 but it is extremely difficult to produce due to the rather monolithic design of that particular section of code. I settled for for a test which covers only 1.) since 2.) is a special case of 1.) I intend to eliminate such race conditions by implementing a cancelled state as described in #4413 (comment) but this is way out of scope for this PR. The test will still be useful, though.