-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not require for all task inputs to be filled by the calling workflow #347
Conversation
Looks good to me. I would/will thumb this as-is, if/when given the chance |
I too value not having to declare workflow inputs whose only purpose is to tediously feed straight into some required call input. There's an upside we should acknowledge, though, in making all the workflow's required inputs plainly visible in one code section. If that's lacking and workflow calls tasks from several imported documents, the user may practically need help from a static analysis program to figure out what inputs they actually need to write. So I didn't mind the compromise of declaring & passing through required inputs while being able to omit optional inputs, although it was a little weird. Open to new ideas... tagging @orodeh as I recall we discussed this at length in the past. |
It seems there is no controversy about this. Shall we get this into voting? Who can initiate a voting process? @geoffjentry ? |
Hi all, there is also an impact on workflow composition -- workflow calling another sub-workflow. If the 'inner' workflow has this kind of implicit required input, currently there is no syntax for the 'outer' workflow to supply them in the So use of this feature, while often convenient, (i) obfuscates a workflow's required inputs and (ii) interferes with composition of the workflow. I believe we are just codifying a past behavior of Cromwell than many of us enjoyed over the years, and which miniwdl also currently supports. However, these downsides are significant and perhaps worth a passing mention in the spec where the feature is discussed. |
This is a good point. But in practice we do pass along the required inputs to the calling workflow. However this has the significant downside of being a burden on the engine developers, having to distinguish between required and non-required inputs. I think that not passing all required inputs to the top workflow is a bad practice. Not only that it will make the workflow a lot harder to use for users. I think this is enough of a deterrent. I do not think that this needs to be enforced by the engines and workflow runners.
EDIT: I think this behavior is very useful for optional inputs to tasks, but it is indeed very bad for required inputs to tasks. The PR should indeed reflect that.
I agree. I will update the PR accordingly. |
@mlin I updated the PR. Does this properly address your concerns? |
LGTM, thanks! A possible long-term future idea from @kislyuk: some notation in the call inputs section saying, 'if an input is not provided here and the workflow has an input of the same name, pass it through' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea makes sense to me. Some of the wording is a bit confusing, in my opinion.
versions/development/SPEC.md
Outdated
* It is good practice for workflows to provide the *required* inputs for its tasks. If this is | ||
not done the workflow cannot be called as a subworkflow as the call `input:` section does | ||
not allow namespaced inputs. Also, if a workflow does not fill its tasks' required options then these should be supplied by the user, while not being listed in the workflow `input` section. This can cause confusion and make the workflow harder to use. | ||
Engines may optionally enforce the good behavior of supplying all tasks' required inputs by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be enforced. It isn't hard for an engine to distinguish optional from required inputs. If a required input is not provided, the task is most likely to fail. The nice property here is that it can be checked statically, before running a large workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I made this optional on purpose, so I would not burden the engine developers with this. I can imagine it is not as straightforward as it may sound. But I am curious what the engine developers actually think. @mlin, @cjllanwarne ?
versions/development/SPEC.md
Outdated
[the section on fully qualified names](#fully-qualified-names--namespaced-identifiers). | ||
* *Optional* inputs (or required inputs with defaults) for tasks are not required to be | ||
satisfied. | ||
* It is good practice for workflows to provide the *required* inputs for its tasks. If this is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly makes sense to say that a workflow doesn't have to provide all the inputs, but only the required ones. This particular point is confusing though. The original text is sufficient, in my opinion:
If a workflow will only ever be submitted as a top-level workflow, it may optionally leave its tasks' inputs unsatisfied. This then forces the engine to additionally supply those inputs at run time. In this case, the inputs' names must be qualified in the inputs as workflow_name.task_name.input_name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think that @mlin was referring to this exact original text you quote when he said:
Hi all, there is also an impact on workflow composition -- workflow calling another sub-workflow. If the 'inner' workflow has this kind of implicit required input, currently there is no syntax for the 'outer' workflow to supply them in the call inputs:, which does not admit namespaced keys.
So use of this feature, while often convenient, (i) obfuscates a workflow's required inputs and (ii) interferes with composition of the workflow.
I agree that the current text is not quite clear. But I think that the better option is to make it required for all required task inputs to be satisfied, as you suggest. The text will become more clear that way. But as stated, I would like to know how the engine developers feel about this.
A serious issue that we ran into at DNAnexus, which @mlin was hinting at, is the inheritance of optionals. Assume workflow We do not have a general solution for the problem, but dxWDL tries to limit this behavior so that compiled workflows can be presented in our UI. |
Interesting. In BioWDL workflows we use parameter_meta for this. We provide a description and a category for each parameter. Categories are: |
@orodeh Correct me if I'm wrong here, but this is an issue regardless of whether they are passed through in a namespaced manner (as per this PR) or they are defined in the top-level inputs sections and then passed from call to call explicitly, right? Either way, all of these inputs will be available and would thus get rendered in a UI. |
@DavyCats That's right. All of these inputs will need to be rendered. The point is that the user is not expecting there to be so many inputs. If the tasks have a small number of required inputs, but many optional inputs, then, the user is typically surprised to see all of the optionals become top-level inputs. |
@orodoh, agreed, and this is why the PR states that optional inputs don't have to be filled in by the top level workflow. That solves this issue, correct? I am not so sure I understand what the problem is here. Is there something that needs to be addressed in the PR for this particular issue? |
The issue is that the input section is very large, even if you don't have to specify everything in it. As an engine developer, this is a problem, for example, because the UI has to render the entire input section. I suggest that we explicitly limit this behavior to a workflow calling tasks; it would not work for a workflow calling sub-workflows. For example, let's say that:
Then, |
I slightly disagree with the direction this PR is heading in. I'm much rather have flexibility and simplicity than a rigid set of complex rules. If it were up to me, this PR would just change those two existing bullets to be:
Notes:
|
@cjllanwarne, this sounds good to me. If an engine can limit the bubble up behavior, then I am just fine with it. |
@cjllanwarne I agree, with all points except this one:
But if it is optional for the engines, that basically means workflow authors can not rely on optional inputs to bubble up. If it is not reliable, it can not be used. Which means every time a user (for example) wants to use one of the 100+ flags of STAR that is optional, this flag will have to be propagated to the top workflow. The whole point of this PR (from my point of view), is to make sure the situation where all optional task inputs need to be propagated to the top-level workflow is avoided at all cost. Because that will lead to humongous input sections. Maybe to solve @orodeh 's problem it can better be reworded like this:
====
Agreed, I will remove that part. Also everything with the word "optional" should be avoided in the SPEC, i think. As it is aimed at setting standards. (Yes, I am aware that I am guilty of this myself.) What do you think @cjllanwarne , @orodeh ? |
I can see why this is a valuable property, and we have users who really wanted it. I did try to support the full bubble up behavior, but found that it was hard an probably impractical. You can see the discussion here (https://github.com/dnanexus/dxWDL/blob/master/doc/MissingCallArguments.md). It totally makes sense for you to run your workflow on an engine that supports this property, such as Cromwell and MiniWDL. However, think this should be optional for an engine implementation. |
@orodeh in that ticket it sounds like all unsupplied inputs are tricky for you (even those currently allowed in the SPEC)? Is there anything particularly more tricky about bubbled-up inputs? @rhpvorderman on reflection, I agree with you that we should preferably say that engines have flexibility on these inputs being allowed to be "hidden by the UI" rather than "ignorable by the engine". Since workflows are often passed around as "workflow + sample inputs", having some engines respect the bubbled-up inputs and some not would seem to me to be a bad idea. However, having some UIs hide the optional bubbled up inputs from users as "very noisy" when generating new inputs sets seems fine. |
@cjllanwarne Since your list of bullet points was much more clear and much better worded I copied it mostly at verbatim. I made a change in the last bullet points to reflect the discussion. Engines must allow all bubbled-inputs for optional task to be filled by the end user. However engines may choose how many layers are made visible to the end user. Does this solve your problem @orodeh ? |
Here is a deeper explanation for how dxWDL works, and what the root problem is. This is a simplified description of the dnanexus system, but should convey the essence of it. IntroductionNatively, dnanexus supports applets and dx:workflows. An applet is equivalent to a WDL task with no expressions and a simple type system ( An example of a dx:workflow, written in WDL:
The simplicity of the the dx:workflows makes the infrastructure scalable (job scheduling, control, etc.). The challenge is the big gap between what WDL supports and what dnanexus provides. For example, any WDL expression requires running a cloud job to evaluate. By comparison, Cromwell can evaluate WDL expressions easily, at the cost of a scaling challenge for the Cromwell server(s). The task of the dxWDL compiler is to statically compile a WDL program into applets and dx-workflows. It attempts to:
Argument passingThe crux of the problem involves what happens when a workflow calls a sub-workflow. In dnanexus, this translates into a dx:workflow calling a dx:subworkflow. A dx:subworkflow is the same as a top level workflow, with the exception that all of its arguments must be declared up front, and must be provided at runtime (unless optional). Let's say that dx:workflow
This must be done for any parameter we want to pass to
Therefore, I settled on supporting only one level of argument passing, in the special case of a top-level workflow calling tasks. I hope this gives people a better picture. |
@orodeh, I see were you are coming from. This will be hard to implement in dxWDL and it will lead to numerous difficulties if it is implemented in dxWDL. What do you think of the following statement:
Also what do you think of the statement I put in the inital post?
Given these statements, is there still a problem for dxWDL? |
@rhpvorderman dxWDL may be a compiler, but the reason for its existence is to be able to run on the dnanexus engine. It should be able to run and accept any WDL program. Therefore, I prefer the wording by @cjllanwarne with the allowance for an engine (dnanexus or dxWDL or both) to limit the bubble up behavior. My deeper concern is that the idea of unlimited bubble-up behavior runs counter to normal programming practice and thinking. Yes, it will work nicely for your use case. However, WDL programs lose the important invariant that you can visually see all the inputs to a call to sub-workflow |
WDL is not a programming language. It solves a different problem and therefore requires a different approach. It calls command line tools. These tools have optional flags. I don't mind declaring all the inputs that I think are important. But researchers might have some special requirements for flags of certain tools. Instead of having to take into account every special use case, this bubble up behavior solves the problem while keeping workflows readable.
You can argue also the other way round, that requiring all inputs to be filled, leads to a humongous number of inputs that obscure the inputs that are really important.
I can understand that, and I am not opposed to that. However my deeper concern is that making it optional does accomplish nothing. It will make inputs jsons and workflows essentially non-portable across engines. Anyway. I don't mind being pragmatic about this. We both have entrenched opinions on this, and the likelihood of agreement seems to be near-zero. The primary motivation for this PR was that Cromwell breaks certain behavior that was allowed in the past. This was due to a word change in the spec. This PR attempts to correct that. If @cjllanwarne doesn't mind supporting this behavior in Cromwell, I can use the original wording and make this optional. Under protest, because I think writing: "Hey engine developer, do whatever you want!" has no place in a spec. |
I'm not entirely in agreement here. Various programming languages allow for this type of behaviour. For example, in R one can define a function as having the
The specification wouldn't be very specific anymore if it allowed implementations to do whatever they want. It might be okay in some cases to allow for some optional features but we have to careful here, I think. |
Abstracting away from our DNAnexus experience, I think there is one more point worth mentioning here. In case others will learn from it. We have customers with FDA approved pipelines that required This is a case where we strictly wanted to disallow the bubble up behavior. |
Perhaps in that case adding syntax to explicitly enable the bubble up behaviour might be a good idea. That way people will also be able to see it when looking at the WDL file. |
Very good idea @DavyCats . Now we need some intuitive syntax to do so. Proposal 1: comment-based. Inspired by python. workflow example {
inputs { # locked: false
}
}
# Or the comment can be at the workflow level
workflow example { # locked: false
inputs {
}
} Proposal 2: new lock keyword workflow example {
lock true
inputs {
}
} Proposal 3: I think proposal 3 is by far the best option. This means I have to rewrite this PR a bit. I propose that I write down the behavior for locked and unlocked mode. And then make it optional for engines which mode(s) to support. I know that for cromwell, it is only one line of code. So it should be easy to support both modes. I do not think this will be hard for engine developers to implement. What do you think about this @cjllanwarne , @mlin , @orodeh ? @DavyCats, @patmagee what do you think of this as WDL users? |
Proposal for wording this in the spec (iteration 1):
Thinking this over again, I think we can ensure portability between locked and unlocked modes by requiring required subworkflow inputs to be filled by the calling workflow. The proposal becomes then as follows (iteration 2):
I feel iteration 2 works really well:
|
Are you proposing that this lock syntax will be a required part of the workflow? If not, I think it might be preferable to have to disable the lock rather than enable it. That way there would be an explicit indication in the WDL file that more inputs are available. I would rather not make comments meaningful on an execution level. So I'd go with the second proposal. If this were on task level, this seems like something that would be in the runtime section. I'm wondering if it's worth it to allow for a runtime section to be added to workflows as well. Although I'm guessing there wouldn't be many meaningful runtime attributes for a workflow. |
I agree with that. The lock should be explicitly disabled. However I do think this is done best on the engine side, because the extra syntax looks a bit ugly in my opinion. I think the default mode should be "locked" mode, and "unlocked" mode can be optionally supported. |
The "use syntax to enable this" reminds me of another option that came up in the past - that an author might call out which calls are bubbling up, and the engine rewrites your WDL to insert the passthrough variables. Eg you could have something like: workflow foo {
inputs {
passthrough all from my_task
}
call my_task
}
task my_task {
inputs {
Int x
Int y
Int z
}
} in which case the workflow inputs would be:
In case of collision (eg if more than one task had the same unsupplied input name, we could say either that they both get the same value, or it's a parsing error, or they need to disambiguate with an extra prefix (see additional options below). Within that framework, a couple of other options might also be possible, eg things like:
... just one more option to throw into the mix... 😄 |
Interesting proposal @cjllanwarne !
I think this is potentially more confusing than simply using namespaced inputs. I am still in favour of not adding extra syntax, but solving this engine side, defaulting to the behavior were bubbling up is not possible. |
Since this discussion got quite long I feel that the original title and introductory post no longer matches with the PR, so I opened a new one #359 . |
This is a follow-up on the discussion in broadinstitute/cromwell#5317.
The problem
Requiring the workflow to fill all task inputs is a bad idea for the following reasons:
Most of these are optional. Still the spec requires them to be filled. That is a huge burden on
the developer.
all workflows that depend on this task.
The solution
Using properly namespaced inputs. This way the workflow can satisfy all the required inputs for a workflow and propagate them to top level. This will lead to wdl files with relatively small input sections. All optional inputs (like an obscure flag on STAR) can then be input by the user by using namespaced inputs.
Cromwell 47 and earlier versions already implement this. We use this feature heavily in BioWDL (example here).
48 will not support fully qualified inputs anymore because of the current development spec and because the WDL1.0 spec does not mention this. In my opinion that is a huge regression in usability that needs to be rectified in the spec.
EDIT: Please note that this spec change does not break things if you do want to include all task inputs in the top level workflow.