-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AoU GVS Cohort Extract wdl #7242
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- a few questions, and I think we should change the name of the top level script to be something to make it clear that this is for the AoU use case (thus cutting off the need to make it generally/generically configurable, etc).
@@ -0,0 +1,135 @@ | |||
version 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give this a more distinct name from GvsExtractCallset? I don't have any great suggestions with this little coffee in me... GvsEndToEndCohortExtract
? @mmorgantaylor @ahaessly any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to be a little AoU specific, maybe we can use that for the name "GvsAoU..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting aside the comments in the WDL that reference AoU specifically.. shouldn't there be alignment on having a general purpose extraction workflow to run all of the extraction steps together, rather than as multiple separate workflows? This is my understanding of the main need that we have (but perhaps there are some other subtleties here). Is it possible to instead generalize anything else here that might be too specific (e.g. make certain tasks skippable/idempotent or generalize naming if needed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that having a wrapper WDL for the prepare/extract step is a good thing to have. It's not at the top of our priority list (compared to making sure that it scales, doesn't cost a fortune, etc) but it's a "good thing" for sure.
The reason I suggested naming it AoU specifically was so that we wouldn't hold this up in order to make it generally more useful outside AoU. If y'all are up for that, that would be great!
There are a few things in here that seem to be AoU specific, for example:
- requiring a destination bucket to copy the results
- requiring an extraction uuid
- not using filter_set (though maybe this is just temporary)
The other thing that I think needs to be figured out is... the lifecycle of the tables created by the Prepare step. There are temp tables with a TTL, that's probably ok, but then there's the main table used by Extract. That probably should be cleaned up. In addition, the creation of the extract table in Prepare is 90% of the cost of this pipeline. If that succeeds, but then one of the Extract scatters fails... you need to rerun the whole pipeline. At small scale, you might not care but at larger scale you definitely don't want to do this. How to best use cromwell call caching, or disable that and do some form of short circuiting in the task if the output already exists, or something else entirely needs to be thought through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also a little hesitant to name it something AoU specific because my intent is for this WDL to be generalized enough for your use cases as well such that we're both working on and using the same WDL. That way, the work we do to improve stuff like table lifecycle and cromwell call caching will automatically be picked up by us when we update the WDL snapshot.
Given that, all those things you mentioned are definitely changes we can make. Though, I might try to put those changes in a follow up PR so I can wrap this one up soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to GvsExtractCohortFromSampleNames
so it looks a little less like GvsExtractCallset
and better captures what our intent with the WDL is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for making this more generally useful if you are -- I'll go make some comments along these lines then. What I want to avoid is it being really only usable by AoU, yet not saying it's primarily for AoU.
>>> | ||
|
||
output { | ||
String fq_cohort_extract_table = read_string("fq_cohort_extract_table.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My WDL is rusty, but do you have to go through a file here? Can you just do something like
String fq_cohort_extract_table = "~{fq_destination_dataset}.~{destination_cohort_table_name}"
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works! thanks for the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, ended up backtracking on this because I added some more bash interpolation logic to generate the output string. unless you know of a way to directly map a bash env variable -> cromwell output?
@@ -36,6 +36,10 @@ workflow GvsPrepareCallset { | |||
docker = docker_final | |||
} | |||
|
|||
output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good improvement here
|
||
String output_file_base_name | ||
|
||
String? fq_filter_set_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the AoU case, I'm thinking this would always be required
data_project = "", # unused if fq_filter_set_* args are given or filtering is off | ||
query_project = query_project, | ||
default_dataset = "", # unused if fq_filter_set_* args are given or filtering is off | ||
filter_set_name = "", # unused if fq_filter_set_* args are given or filtering is off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does AoU want unfiltered data, that would be very surprising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I hadn't gotten around to adding the filter steps yet which is the main reason these are missing. Do you know what the process for that will be? Do we just point the filter set arguments to tables that you guys generate or is it something that is created on a per extract basis?
@kcibul @ahaessly @mmorgantaylor do you guys know what the best way to upload multifile WDLs to Agora is? I can't import the current WDL as it is because the imports refer to a relative local file which Agora can't resolve. My workaround has been to rename the import path so that it points to a raw github file. I've been doing this manually for testing but it will not work with our automated Github -> Agora method creation tool (without some code changes to read the WDL and rename the imports). |
# Drop trailing slash if one exists | ||
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//') | ||
|
||
if [ -n "${OUTPUT_GCS_DIR}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these files are also task outputs, if someone runs with output_gcs_dir
they'll have two copies of the data, one in the cromwell delocalized results of this task and another in the output bucket? Do you have ideas about how to handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the cromwell delocalized results, do you mean the files in the "execution directory" (as Terra calls it)? In my mind, I had always thought of those files as "temporary" as in they're usually created within a TTL folder. I'm realizing now that this may be an incorrect assumption that I carried over from my time working with the long reads methods team.
I feel like that might be OK though? Users of the WDL that do not want the duplication can drop the output_gcs_dir
argument. Users that want the output_gcs_dir
argument likely only care about the VCF output files so they can clean up the execution directory as needed.
String gvs_extraction_destination_dataset | ||
String gvs_extraction_temp_tables_dataset | ||
String extraction_uuid | ||
String output_gcs_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be optional
|
||
call GvsPrepareCallset.GvsPrepareCallset { | ||
input: | ||
destination_cohort_table_name = extraction_uuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the same dataset is used for gvs_extraction_destination_dataset
and gvs_extraction_cohorts_dataset
? In both cases the table name is just the UUID so I think things will error out. If that's the case maybe use the uuid as a suffix instead of the sole table name
FROM | ||
\`~{gvs_dataset}.sample_info\` | ||
WHERE | ||
sample_name IN " > create_cohort.sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the limit for how many samples can be queried this way? In other parts of GVS we found that performance drops off after > ~1000 entries. I don't think performance is a concern here, but there are limitations on query size, etc. If there's a limit here, would be good to know what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW... looks like the total query size limit is 1 MB so with a wild guess at 20 bytes per sample name that's 50k samples. Seems reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking on that for me. Agree that it should work but it was good to point out. I might be able to load this from a file which probably has better performance and scalability.
@ericsong on the Agora question -- we've been using Dockstore since it pulls right from GitHub and these are public WDLs. You'll need to add this WDL to the .dockstore.yml file, and if you want to have you're branch you'll add that as well (should be fairly obvious when you look at the file). I haven't tried imports, but I think they work well/better than Agora |
Thinking out loud... if the PrepareCallset could take a file of sample names as an input would we no longer need to have this step of making the sample table? ExtractCohort is able to take that already, Eric added that a while back right? |
thanks for the review @kcibul. I made some changes accordingly. re: PrepareCallset file of sample names. That would be nice! It would make this workflow simpler and it also simplifies the access requirements for PrepareCallset. re: Dockstore. We actually ruled this out because Terra says that the definition of a method configuration can change automatically if its updated in dockstore. Which can be useful but it adds a security risk since a compromised Dockstore can change the definition of the production AoU extraction WDL which runs with highly elevated permissions. We already have a script that creates method configurations from github so I can probably add something a little hacky to resolve relative imports to the raw github file that it refers to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Are you going to try to incorporate kris' changes?
yep! I should have that wrapped up soon. testing the updated WDL now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few changes for defaulting values
|
||
Boolean do_not_filter_override | ||
String? fq_filter_set_info_table | ||
String? fq_filter_set_site_table | ||
String? fq_filter_set_tranches_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything that is assigned a default value in the outer workflow should not be optional here.
Also, filter_set_name should not be optional
Boolean do_not_filter_override = false | ||
String? filter_set_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter set name should not be optional. This will be named something that indicates which release of the filtering model the extract should use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be set even if filtering is not going to be used? for example, if do_not_filter_override is true, the argument will never be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i didn't catch that. It's fine to be optional.
String output_file_base_name | ||
|
||
Boolean do_not_filter_override = false | ||
String? filter_set_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be optional. or if optional, should default to a valid filter set name
String? filter_set_name | ||
String fq_filter_set_info_table = "~{gvs_project}.~{gvs_dataset}.filter_set_info" | ||
String fq_filter_set_site_table = "~{gvs_project}.~{gvs_dataset}.filter_set_sites" | ||
String fq_filter_set_tranches_table = "~{gvs_project}.~{gvs_dataset}.filter_set_tranches" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to set these values or pass them to the workflow since they are defaulted in the workflow. If you want to allow the caller to set them, you should make them optional and then you would pass them into the workflow.
fq_petvet_dataset = "~{gvs_project}.~{gvs_dataset}", | ||
fq_sample_mapping_table = "~{gvs_project}.~{gvs_dataset}.sample_info", | ||
fq_temp_table_dataset = fq_gvs_extraction_temp_tables_dataset, | ||
fq_destination_dataset = fq_gvs_extraction_destination_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make these parameters in the definition of the task optional, since they are assigned defaults and then you won't have to pass the hard coded values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these do have defaults in GvsPrepareCallset
but the default doesn't work for me which is why I overrode it here. is that what you're suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, well then maybe add a comment that you are explicitly overriding the default values
destination_cohort_table_prefix = extraction_uuid, | ||
sample_names_to_extract = cohort_sample_names, | ||
data_project = query_project, | ||
default_dataset = gvs_dataset, # unused if fq_* args are given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though it's unused here, it still needs to be passed into the task
String gvs_dataset | ||
String fq_gvs_extraction_cohorts_dataset | ||
String fq_gvs_extraction_destination_dataset | ||
String fq_gvs_extraction_temp_tables_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make these 2 optional since they are defaulted in the called workflow.
also I don't see fq_gvs_extraction_cohorts_dataset
used anywhere. should it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. removed.
regarding the two optionals though, do you mean the arguments on lines 16 and 17? those are passed directly into Prepare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. if you make fq_gvs_extraction_destination_dataset
and fq_gvs_extraction_temp_tables_dataset
optional (and still pass them into the prepare) then the caller does not have to supply the input. If an optional that has not been set is passed to a workflow/task they will be treated as None (and will therefore get the default values). The only reason not to make them optional is if you know the defaults won't work for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
set -e | ||
if [ ~{has_service_account_file} = 'true' ]; then | ||
gcloud auth activate-service-account --key-file='~{service_account_json}' | ||
fi | ||
|
||
LASTMODIFIED=$(bq show --location=US --format=json ~{dataset_table} | python3 -c "import sys, json; print(json.load(sys.stdin)['lastModifiedTime']);") | ||
# bq needs the project name to be separate by a colon | ||
DATASET_TABLE_COLON=$(echo ~{dataset_table} | sed 's/\./:/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a branch where I fixed this a different way. by just taking the table name prefix and using the dataset.tablename. I also had an issue where the show command was return a bunch of text prepended to the result which i fixed by creating a .bigqueryrc. I will have a PR for this soon and tag you to review it.
Our production extraction WDL which calls
I had to make a few small changes to GvsPrepare/Extract but I tried to keep them as minimal as possible.