Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task pool: fix flow behaviour with incomplete outputs #4737

Merged
merged 3 commits into from
Mar 21, 2022

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Mar 9, 2022

  • Closes reflow: tasks with incomplete output should rerun #4687.
  • Tasks that have incomplete outputs are held in the n=0 pool.
  • This means as new flows approach them they merge but are not re-run.
  • This change resets such tasks to waiting and re-queues them to allow them to re-run in the same way a task with complete outputs would.

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • No documentation update required.

* Closes cylc#4687.
* Tasks that have incomplete outputs are held in the n=0 pool.
* This means as new flows approach them they merge but are not re-run.
* This change resets such tasks to waiting and re-queues them to allow
  them to re-run in the same way a task with complete outputs would.
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Mar 9, 2022
@oliver-sanders oliver-sanders added this to the cylc-8.0rc2 milestone Mar 9, 2022
@oliver-sanders oliver-sanders requested a review from hjoliver March 9, 2022 16:32
@oliver-sanders oliver-sanders self-assigned this Mar 9, 2022
@hjoliver
Copy link
Member

hjoliver commented Mar 17, 2022

On reflection I'm not 100% sure about this.

Incomplete tasks are retained in n=0 in the expectation of manual retriggering at some point, to complete their expected outputs and continue their own flow.

They have by definition not been fixed and retriggered yet (otherwise they would not still be incomplete) so we should not expect them to run successfully if triggered automatically (in another flow or not).

On that basis we should expect an incomplete task to block other flows as well as its own, and wait for manual retriggering to allow the merged flow to continue.

It won't break anything to re-run a task that is destined to fail (to complete expected outputs) again, but it's still unnecessary use of compute resource.

Caveat:

  • If different flows are processing different data through the graph (i.e. doing flow-number based processing as opposed to merely rerunning exactly the same tasks after fixing a single problem somewhere, for instance) it's not necessarily true that a broken task will be broken in all flows ... but we won't be recommending that sort of use case until/if flows can be entirely independent (i.e. no merging).
  • I suppose there is a chance that an upstream change, for the new flow, will "fix" the incomplete task by changing its input data, but ... rather unlikely??

@oliver-sanders
Copy link
Member Author

I don't think the outputs the task has or has not generated should affect subsequent flows. We don't distinguish between waiting and finished tasks so why distinguish between complete and incomplete ones?

Seems a bit odd for the flow front to just stop because it met a finished task (which could be a succeeded task that didn't produce a required custom output).

The way we hold incomplete tasks in the n=0 window is an implementation detail we added to increase the visibility of these tasks in the UIs and to assist with stall detection/reporting not originally present in the original "pure" SoD implementation.

I suppose there is a chance that an upstream change, for the new flow, will "fix" the incomplete task by changing its input data, but ... rather unlikely??

I think this is one of, if not the main use case for reflow?

E.G:

"My data got poisoned at some point upstream resulting in a failure downstream, now I need to fix the problem in an upstream task and re-run the sub-graph".

(I've seen this use case arise due to a faulty node in an archive system which caused data corruption resulting in a few workflows ending up in this position).

Or:

"I didn't configure my model to output the right fields resulting in a failure downstream ...".

Or:

"I'm writing a workflow incrementally in a write-reload-run loop trying to get downstream tasks succeeding".

@hjoliver
Copy link
Member

Fair enough, good argument.

The majority of incomplete tasks will be failed tasks that did not succeed for internal reasons, and so will fail again in any flow. But upstream reasons are possible, so my 2nd caveat above justifies this even if it's not super likely.

Not sure I've ever seen your use cases in the wild (it's usually about getting the next task running successfully by understanding and tweaking its own config) but I like them in principle.

Approving...

oliver-sanders and others added 2 commits March 21, 2022 10:58
Co-authored-by: Hilary James Oliver <[email protected]>
* Closes cylc#4687.
* Tasks that have incomplete outputs are held in the n=0 pool.
* This means as new flows approach them they merge but are not re-run.
* This change resets such tasks to waiting and re-queues them to allow
  them to re-run in the same way a task with complete outputs would.
@datamel datamel self-requested a review March 21, 2022 11:20
Copy link
Contributor

@datamel datamel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually tested (on master to recreate and on this branch) with the example in the issue. Fixes the issue. Read the code (I appreciate the code explanations) and have run the relevant tests locally with no problems.

@oliver-sanders
Copy link
Member Author

Manually pytest'ed locally to get around the bugbear failures, all good.

@oliver-sanders oliver-sanders merged commit 545b701 into cylc:master Mar 21, 2022
@oliver-sanders oliver-sanders deleted the reflow-incomplete-outputs branch March 21, 2022 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reflow: tasks with incomplete output should rerun
3 participants