task pool: fix flow behaviour with incomplete outputs #4737

oliver-sanders · 2022-03-09T16:32:16Z

Closes reflow: tasks with incomplete output should rerun #4687.
Tasks that have incomplete outputs are held in the n=0 pool.
This means as new flows approach them they merge but are not re-run.
This change resets such tasks to waiting and re-queues them to allow them to re-run in the same way a task with complete outputs would.

Requirements check-list

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg and conda-environment.yml.
Appropriate tests are included (unit and/or functional).
Appropriate change log entry included.
No documentation update required.

* Closes cylc#4687. * Tasks that have incomplete outputs are held in the n=0 pool. * This means as new flows approach them they merge but are not re-run. * This change resets such tasks to waiting and re-queues them to allow them to re-run in the same way a task with complete outputs would.

hjoliver · 2022-03-17T22:40:16Z

On reflection I'm not 100% sure about this.

Incomplete tasks are retained in n=0 in the expectation of manual retriggering at some point, to complete their expected outputs and continue their own flow.

They have by definition not been fixed and retriggered yet (otherwise they would not still be incomplete) so we should not expect them to run successfully if triggered automatically (in another flow or not).

On that basis we should expect an incomplete task to block other flows as well as its own, and wait for manual retriggering to allow the merged flow to continue.

It won't break anything to re-run a task that is destined to fail (to complete expected outputs) again, but it's still unnecessary use of compute resource.

Caveat:

If different flows are processing different data through the graph (i.e. doing flow-number based processing as opposed to merely rerunning exactly the same tasks after fixing a single problem somewhere, for instance) it's not necessarily true that a broken task will be broken in all flows ... but we won't be recommending that sort of use case until/if flows can be entirely independent (i.e. no merging).
I suppose there is a chance that an upstream change, for the new flow, will "fix" the incomplete task by changing its input data, but ... rather unlikely??

oliver-sanders · 2022-03-18T10:23:37Z

I don't think the outputs the task has or has not generated should affect subsequent flows. We don't distinguish between waiting and finished tasks so why distinguish between complete and incomplete ones?

Seems a bit odd for the flow front to just stop because it met a finished task (which could be a succeeded task that didn't produce a required custom output).

The way we hold incomplete tasks in the n=0 window is an implementation detail we added to increase the visibility of these tasks in the UIs and to assist with stall detection/reporting not originally present in the original "pure" SoD implementation.

I suppose there is a chance that an upstream change, for the new flow, will "fix" the incomplete task by changing its input data, but ... rather unlikely??

I think this is one of, if not the main use case for reflow?

E.G:

"My data got poisoned at some point upstream resulting in a failure downstream, now I need to fix the problem in an upstream task and re-run the sub-graph".

(I've seen this use case arise due to a faulty node in an archive system which caused data corruption resulting in a few workflows ending up in this position).

Or:

"I didn't configure my model to output the right fields resulting in a failure downstream ...".

Or:

"I'm writing a workflow incrementally in a write-reload-run loop trying to get downstream tasks succeeding".

hjoliver · 2022-03-20T21:07:14Z

Fair enough, good argument.

The majority of incomplete tasks will be failed tasks that did not succeed for internal reasons, and so will fail again in any flow. But upstream reasons are possible, so my 2nd caveat above justifies this even if it's not super likely.

Not sure I've ever seen your use cases in the wild (it's usually about getting the next task running successfully by understanding and tweaking its own config) but I like them in principle.

Approving...

cylc/flow/task_pool.py

tests/functional/spawn-on-demand/15-reflow-incomplete-outputs/flow.cylc

Co-authored-by: Hilary James Oliver <[email protected]>

* Closes cylc#4687. * Tasks that have incomplete outputs are held in the n=0 pool. * This means as new flows approach them they merge but are not re-run. * This change resets such tasks to waiting and re-queues them to allow them to re-run in the same way a task with complete outputs would.

datamel

Manually tested (on master to recreate and on this branch) with the example in the issue. Fixes the issue. Read the code (I appreciate the code explanations) and have run the relevant tests locally with no problems.

oliver-sanders · 2022-03-21T12:57:55Z

Manually pytest'ed locally to get around the bugbear failures, all good.

oliver-sanders added the bug Something is wrong :( label Mar 9, 2022

oliver-sanders added this to the cylc-8.0rc2 milestone Mar 9, 2022

oliver-sanders requested a review from hjoliver March 9, 2022 16:32

oliver-sanders self-assigned this Mar 9, 2022

hjoliver approved these changes Mar 20, 2022

View reviewed changes

cylc/flow/task_pool.py Outdated Show resolved Hide resolved

tests/functional/spawn-on-demand/15-reflow-incomplete-outputs/flow.cylc Show resolved Hide resolved

oliver-sanders modified the milestones: cylc-8.0rc2, cylc-8.0rc3 Mar 21, 2022

oliver-sanders and others added 2 commits March 21, 2022 10:58

Update cylc/flow/task_pool.py

21517fe

Co-authored-by: Hilary James Oliver <[email protected]>

oliver-sanders modified the milestones: cylc-8.0rc3, cylc-8.0rc2 Mar 21, 2022

datamel self-requested a review March 21, 2022 11:20

datamel approved these changes Mar 21, 2022

View reviewed changes

oliver-sanders merged commit 545b701 into cylc:master Mar 21, 2022

oliver-sanders deleted the reflow-incomplete-outputs branch March 21, 2022 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task pool: fix flow behaviour with incomplete outputs #4737

task pool: fix flow behaviour with incomplete outputs #4737

oliver-sanders commented Mar 9, 2022 •

edited

Loading

hjoliver commented Mar 17, 2022 •

edited

Loading

oliver-sanders commented Mar 18, 2022

hjoliver commented Mar 20, 2022

datamel left a comment

oliver-sanders commented Mar 21, 2022

task pool: fix flow behaviour with incomplete outputs #4737

task pool: fix flow behaviour with incomplete outputs #4737

Conversation

oliver-sanders commented Mar 9, 2022 • edited Loading

hjoliver commented Mar 17, 2022 • edited Loading

oliver-sanders commented Mar 18, 2022

hjoliver commented Mar 20, 2022

datamel left a comment

Choose a reason for hiding this comment

oliver-sanders commented Mar 21, 2022

oliver-sanders commented Mar 9, 2022 •

edited

Loading

hjoliver commented Mar 17, 2022 •

edited

Loading