-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task pool: fix flow behaviour with incomplete outputs #4737
task pool: fix flow behaviour with incomplete outputs #4737
Conversation
* Closes cylc#4687. * Tasks that have incomplete outputs are held in the n=0 pool. * This means as new flows approach them they merge but are not re-run. * This change resets such tasks to waiting and re-queues them to allow them to re-run in the same way a task with complete outputs would.
On reflection I'm not 100% sure about this. Incomplete tasks are retained in They have by definition not been fixed and retriggered yet (otherwise they would not still be incomplete) so we should not expect them to run successfully if triggered automatically (in another flow or not). On that basis we should expect an incomplete task to block other flows as well as its own, and wait for manual retriggering to allow the merged flow to continue. It won't break anything to re-run a task that is destined to fail (to complete expected outputs) again, but it's still unnecessary use of compute resource. Caveat:
|
I don't think the outputs the task has or has not generated should affect subsequent flows. We don't distinguish between waiting and finished tasks so why distinguish between complete and incomplete ones? Seems a bit odd for the flow front to just stop because it met a finished task (which could be a succeeded task that didn't produce a required custom output). The way we hold incomplete tasks in the n=0 window is an implementation detail we added to increase the visibility of these tasks in the UIs and to assist with stall detection/reporting not originally present in the original "pure" SoD implementation.
I think this is one of, if not the main use case for reflow? E.G: "My data got poisoned at some point upstream resulting in a failure downstream, now I need to fix the problem in an upstream task and re-run the sub-graph". (I've seen this use case arise due to a faulty node in an archive system which caused data corruption resulting in a few workflows ending up in this position). Or: "I didn't configure my model to output the right fields resulting in a failure downstream ...". Or: "I'm writing a workflow incrementally in a write-reload-run loop trying to get downstream tasks succeeding". |
Fair enough, good argument. The majority of incomplete tasks will be failed tasks that did not succeed for internal reasons, and so will fail again in any flow. But upstream reasons are possible, so my 2nd caveat above justifies this even if it's not super likely. Not sure I've ever seen your use cases in the wild (it's usually about getting the next task running successfully by understanding and tweaking its own config) but I like them in principle. Approving... |
Co-authored-by: Hilary James Oliver <[email protected]>
* Closes cylc#4687. * Tasks that have incomplete outputs are held in the n=0 pool. * This means as new flows approach them they merge but are not re-run. * This change resets such tasks to waiting and re-queues them to allow them to re-run in the same way a task with complete outputs would.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually tested (on master to recreate and on this branch) with the example in the issue. Fixes the issue. Read the code (I appreciate the code explanations) and have run the relevant tests locally with no problems.
Manually pytest'ed locally to get around the bugbear failures, all good. |
Requirements check-list
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
andconda-environment.yml
.