Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output_from from multi-step workflows #1206

Closed
gaow opened this issue Feb 3, 2019 · 20 comments
Closed

output_from from multi-step workflows #1206

gaow opened this issue Feb 3, 2019 · 20 comments

Comments

@gaow
Copy link
Member

gaow commented Feb 3, 2019

I have this example:

[vhat_1]
i = range(9)
input: for_each = 'i'
output: f'{_i}.vhat'
_output.touch()

[vhat_2]
input: group_by = 'all'
output: 'all.txt'
bash: expand=True
  cat {_input} > {_output}

[1]
input: output_from('vhat')
output: f'{_input:n}.out'
_output.touch()

It does not work,

$ sos run test.sos -s build
INFO: Running vhat_1: 
INFO: Step vhat_1 (index=3) is ignored with signature constructed
INFO: Step vhat_1 (index=2) is ignored with signature constructed
INFO: Step vhat_1 (index=4) is ignored with signature constructed
INFO: Step vhat_1 (index=0) is ignored with signature constructed
INFO: Step vhat_1 (index=1) is ignored with signature constructed
INFO: Step vhat_1 (index=7) is ignored with signature constructed
INFO: Step vhat_1 (index=8) is ignored with signature constructed
INFO: Step vhat_1 (index=6) is ignored with signature constructed
INFO: Step vhat_1 (index=5) is ignored with signature constructed
INFO: output:   0.vhat 1.vhat... (9 items in 9 groups)
INFO: Running vhat_2: 
INFO: Step vhat_2 (index=0) is ignored with signature constructed
INFO: output:   all.txt
INFO: Running 1: 
ERROR: [1]: Failed to process input statement output_from('vhat')
: Failed to obtain output of step vhat

When I change output_from('vhat') to output_from('vhat_2') it works when all.txt already exists. I would expect vhat to work though.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 3, 2019

You are trying to get the output of a workflow, not a step, right? Right now output_from strictly gets from step so it expects vhat_2.

@gaow
Copy link
Member Author

gaow commented Feb 3, 2019

Yes. But it will not work because vhat_2 depends on vhat_1 which in my current implemention the dependency is forward style relying on numerical index.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 3, 2019

So, without running anything, you mean

  1. output_from('vhat_2') does not work because it does not trace further back?
  2. Do you want output_from('vhat') to be the same as output_from('vhat_2') in this case? It is not reliable because vhat_2 does not have to be the last step of the numerically ordered workflow.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 3, 2019

I see the problem.

When you run the workflow with output_from('vhat_2'),

  1. It executed vhat_2 because steps in a numerically ordered workflow can also be used individually.
  2. The step has unspecified input so it does not trace dependency further.
  3. Then the action becomes cat > all.txt, which waits for user input

@gaow
Copy link
Member Author

gaow commented Feb 3, 2019

Yes your 1-3 is what happens. Also curiously when i ask for output_from('vhat') it does execute both vhat_1 and vhat_2 first, then quit on error. Would it make sense to have output_from also work for workflows like this case?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 3, 2019

output_from('vhat') implicitly adds sos_step('vhat') during the DAG buidling phase so the two steps were executed. Then output_from('vhat') does not exist because there is no such step.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 3, 2019

So there are a few questions:

  1. should output_from('vhat_2') adds vhat_1 as dependency? In a forward style workflow when vhat_2 is placed before vhat_1 this seems to make sense but input from vhat_2 will make it independent of vhat_1. The tricky part here is that in this example the input statement for vhat_2 exists but is still unspecified.

  2. should output_from('vhat') get the output of the last step of vhat? This can be tricky because the "output" does not have to be the output from the step with the highest index.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 3, 2019

The problem with 2 is that conceptually speaking there is no "output from a workflow", because a forward-style workflow can have multiple branches and not necessarily a last step. We can say output_from('vhat') is output_from('vhat_2') (the last step according to numeric index), and which would work for perhaps 99% of the cases but I am not sure if that is a good idea.

@gaow
Copy link
Member Author

gaow commented Feb 3, 2019

which would work for perhaps 99% of the cases but I am not sure if that is a good idea.

what's the cons with this approach? In the context of numerically ordered process oriented style, I'd think it is reasonable that output_from('vhat') refers to vhat_2 and SoS will figure out what vhat_2 depends on, whether it be explicit or implicit. Currently output_from('vhat') is not defined and we have to define it one way or another. The proposed approach seems the most reasonable.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 3, 2019

I am thinking of

[vhat_1]
depends: sos_step('vhat_2')

[vhat_2]

[default]
input: output_from('vhat')

namely when vhat_2 is not the real last step of vhat.

@gaow
Copy link
Member Author

gaow commented Feb 3, 2019

I see. Then it is perhaps cleanest to somehow introduce the concept of output from workflows, if possible?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 3, 2019

Some added syntax? Isn't that named_output?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2019

Thinking of

[A_1]
[A_2]
[A_3]
[A_4]

[default]
depends: sos_step('A_3')

We assume that A is a forward-style workflow so conceptually A_3 depends on A_1 and A_2. So instead of executing A_3 only, it might make sense to execute A_1, A_2 and A_3.

If this can be done, then

[A_1]
[A_2]
[A_3]
[A_4]

[default]
depends: output_from('A_3')

will execute A_1, A_2, and A_3, and returns the output of step A_3.

@gaow
Copy link
Member Author

gaow commented Feb 4, 2019

I think syntax-wise we can stick to what we have, but output_from('A') means executing A_1 to A_4 and is shorthand for output_from('A_4'). This is what i meant by the concept of output from workflows. I agree output_from('A_3') should imply executing the first 3 steps. And maybe we can have some output_from option to let it only consider the exact step asked, without trying to resolve what's upstream?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2019

In terms of A1 to A_3 we already have the syntax of A_1:3 (not sure the exact syntax) but sos_step does not accept it. It makes sense to execute A_1 and A_2 if A_3 is executed but in practice this can be hard to judge. For example, are we going to execute A1 and A2 in the following case?

[A_1]
[A_2]
[A_3]
input: None
output: 'a.txt'
[A_4]

[default]
input: 'a.txt'

It might make sense to run A_1 and A_2 but then many logics in SoS will be changed.

@gaow
Copy link
Member Author

gaow commented Feb 4, 2019

I think when input is explicitly specified in this case, then only A_3 should be executed because it overrides the default numerical indices. We only rely on numerical ordering under the "default" setting, eg, similar to what I had in the vhat case. I thought these differences are distinguishable in SoS already?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2019

After more thought, I think it is too much trouble to call all previous steps in this case. What is lacking, in my opinion, is dependency of A_2 on A_1 in the case of

[A_1]
[A_2]
[default}
depends: sos_step('A_2')

That is to say, we say A_2 is dependent on A_1 in the forward sense because A_2 has unspecified input, but in the backward sense we say A_2 is not dependent on A_1.

I think it is good enough for us to make the dependency consistent, so whereas the aforementioned example would trigger A_1,

[A_1]
[A_2]
input: None

[default}
depends: sos_step('A_2')

will not because A_2 does not depend on A_1.

@BoPeng BoPeng closed this as completed Feb 4, 2019
@BoPeng BoPeng reopened this Feb 4, 2019
@gaow
Copy link
Member Author

gaow commented Feb 4, 2019

so whereas the aforementioned example would trigger A_1, ... will not because A_2 does not depend on A_1.

I agree with this. I'm just wondering if we can make output_from('A') by default refer to the largest index A_2 in this case, because the indices are likely to change as one works on the workflows. Assuming the largest index seems a reasonable default.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2019

What has been done:

  1. When name_idx is added as an auxiliary step, and if this step has unspecified input, it will add all its previous step to the DAG.
  2. output_from(name) will get the output of the last step of name if name is a workflow name.

For 1, we have to add all previous steps because a step might have multiple dependency steps so we cannot simply trace back. For example, A_3 depends on A_1 even if A_2 has no input.

[A_1]

[A_2]
input: None

[A_3]

@BoPeng BoPeng closed this as completed Feb 4, 2019
@gaow
Copy link
Member Author

gaow commented Feb 4, 2019

Thanks @BoPeng this is reasonable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants