Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AoU GVS Cohort Extract wdl #7242

Merged
merged 37 commits into from
Jun 11, 2021
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
b72c5ed
push docker override into task
ericsong Apr 30, 2021
6621160
optional
ericsong Apr 30, 2021
84ce13a
use docker_final
ericsong Apr 30, 2021
ce8f26b
output prepare callset's table table
ericsong May 4, 2021
f528b12
add optional bucket to copy output vcfs into
ericsong May 4, 2021
87ab1db
add extract step to AoU wdl
ericsong May 4, 2021
ee321d6
add prepare output to workflow output
ericsong May 4, 2021
43aeea8
add empty output section
ericsong May 4, 2021
06d52b8
add output files
ericsong May 4, 2021
2496017
log
ericsong May 4, 2021
dd1d54f
fix var setting in bash
ericsong May 5, 2021
6051a6c
move docker override back into workflow top level
ericsong May 5, 2021
11ecec3
rename vars
ericsong May 5, 2021
100744d
rename
ericsong May 5, 2021
9eb49c5
metadata -> sample_info
ericsong May 5, 2021
3abbda9
add backticks to tablename
ericsong May 6, 2021
a7593b6
rename, testing some changes
ericsong May 19, 2021
fa4068d
pipe through filter params and refactor cohort sample table creation
ericsong May 21, 2021
48a93ea
make filter arguments optional
ericsong May 22, 2021
33f5104
mirror prepare/extract arguments
ericsong May 23, 2021
9ed26dc
Merge branch 'ah_var_store' of github.com:broadinstitute/gatk into so…
ericsong May 26, 2021
d28d4fb
update wdl to use new prepare/extract
ericsong May 26, 2021
f700fd4
remove prepare output
ericsong May 26, 2021
d3bfb28
remove output
ericsong May 26, 2021
46105cb
turns out prepare just needs a table prefix not fq
ericsong May 26, 2021
5683c30
pipe output
ericsong May 26, 2021
e69432a
merge
ericsong Jun 8, 2021
91ca1f1
use defaults
ericsong Jun 8, 2021
6ca3071
add comment
ericsong Jun 8, 2021
b4776a8
remove optional on default strings
ericsong Jun 8, 2021
e052cdc
add query project arg
ericsong Jun 8, 2021
13dd121
debug lines
ericsong Jun 8, 2021
5601e2f
colon swap
ericsong Jun 8, 2021
5ea46ac
add argument for no filtering
ericsong Jun 10, 2021
aba3686
merge
ericsong Jun 11, 2021
c5539f9
cleanup
ericsong Jun 11, 2021
fbaa396
restore localize
ericsong Jun 11, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 27 additions & 11 deletions scripts/variantstore/wdl/GvsExtractCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ workflow GvsExtractCallset {
input {
String data_project
String default_dataset
String filter_set_name

File wgs_intervals
Int scatter_count
Expand All @@ -16,10 +15,11 @@ workflow GvsExtractCallset {
String fq_cohort_extract_table_prefix
String query_project = data_project

String fq_filter_set_info_table = "~{data_project}.~{default_dataset}.filter_set_info"
String fq_filter_set_site_table = "~{data_project}.~{default_dataset}.filter_set_sites"
String fq_filter_set_tranches_table = "~{data_project}.~{default_dataset}.filter_set_tranches"
Boolean do_not_filter_override = false
String? filter_set_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter set name should not be optional. This will be named something that indicates which release of the filtering model the extract should use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be set even if filtering is not going to be used? for example, if do_not_filter_override is true, the argument will never be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i didn't catch that. It's fine to be optional.

String? fq_filter_set_info_table = "~{data_project}.~{default_dataset}.filter_set_info"
String? fq_filter_set_site_table = "~{data_project}.~{default_dataset}.filter_set_sites"
String? fq_filter_set_tranches_table = "~{data_project}.~{default_dataset}.filter_set_tranches"

# if these are unset, default sensitivity levels will be used
Float? snps_truth_sensitivity_filter_level_override
Expand All @@ -31,6 +31,7 @@ workflow GvsExtractCallset {
File? service_account_json

String output_file_base_name
String? output_gcs_dir
File? gatk_override
}

Expand Down Expand Up @@ -64,9 +65,15 @@ workflow GvsExtractCallset {
excluded_intervals = excluded_intervals,
emit_pls = emit_pls,
service_account_json = service_account_json,
output_file = "${output_file_base_name}_${i}.vcf.gz"
output_file = "${output_file_base_name}_${i}.vcf.gz",
output_gcs_dir = output_gcs_dir
}
}

output {
Array[File] output_vcfs = ExtractTask.output_vcf
Array[File] output_vcf_indexes = ExtractTask.output_vcf_index
}
}

################################################################################
Expand All @@ -90,14 +97,15 @@ task ExtractTask {
String fq_cohort_extract_table
String read_project_id
String output_file
String fq_filter_set_info_table
String fq_filter_set_site_table
String fq_filter_set_tranches_table
String filter_set_name
Float? snps_truth_sensitivity_filter_level
Float? indels_truth_sensitivity_filter_level
String? output_gcs_dir

Boolean do_not_filter_override
String? fq_filter_set_info_table
String? fq_filter_set_site_table
String? fq_filter_set_tranches_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that is assigned a default value in the outer workflow should not be optional here.
Also, filter_set_name should not be optional

String? filter_set_name
Float? snps_truth_sensitivity_filter_level
Float? indels_truth_sensitivity_filter_level

File? excluded_intervals
Boolean? emit_pls
Expand Down Expand Up @@ -149,6 +157,14 @@ task ExtractTask {
--project-id ~{read_project_id} \
~{true='--emit-pls' false='' emit_pls} \
${FILTERING_ARGS}

# Drop trailing slash if one exists
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')
ericsong marked this conversation as resolved.
Show resolved Hide resolved

if [ -n "${OUTPUT_GCS_DIR}" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these files are also task outputs, if someone runs with output_gcs_dir they'll have two copies of the data, one in the cromwell delocalized results of this task and another in the output bucket? Do you have ideas about how to handle that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the cromwell delocalized results, do you mean the files in the "execution directory" (as Terra calls it)? In my mind, I had always thought of those files as "temporary" as in they're usually created within a TTL folder. I'm realizing now that this may be an incorrect assumption that I carried over from my time working with the long reads methods team.

I feel like that might be OK though? Users of the WDL that do not want the duplication can drop the output_gcs_dir argument. Users that want the output_gcs_dir argument likely only care about the VCF output files so they can clean up the execution directory as needed.

gsutil cp ~{output_file} ${OUTPUT_GCS_DIR}/
gsutil cp ~{output_file}.tbi ${OUTPUT_GCS_DIR}/
fi
>>>

# ------------------------------------------------
Expand Down
87 changes: 87 additions & 0 deletions scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
version 1.0

import "GvsPrepareCallset.wdl" as GvsPrepareCallset
import "GvsExtractCallset.wdl" as GvsExtractCallset

# Workflow used by AoU to extract variants for a given cohort of sample_names

workflow GvsExtractCohortFromSampleNames {

input {
File cohort_sample_names
String query_project
String gvs_project
String gvs_dataset
String gvs_extraction_cohorts_dataset
String gvs_extraction_destination_dataset
String gvs_extraction_temp_tables_dataset
String extraction_uuid
String? output_gcs_dir

# Extract parameters
File wgs_intervals
Int scatter_count

File reference
File reference_index
File reference_dict

String output_file_base_name

Boolean do_not_filter_override = false
String? filter_set_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be optional. or if optional, should default to a valid filter set name

String fq_filter_set_info_table = "~{gvs_project}.~{gvs_dataset}.filter_set_info"
String fq_filter_set_site_table = "~{gvs_project}.~{gvs_dataset}.filter_set_sites"
String fq_filter_set_tranches_table = "~{gvs_project}.~{gvs_dataset}.filter_set_tranches"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to set these values or pass them to the workflow since they are defaulted in the workflow. If you want to allow the caller to set them, you should make them optional and then you would pass them into the workflow.


# if these are unset, default sensitivity levels will be used
Float? snps_truth_sensitivity_filter_level_override
Float? indels_truth_sensitivity_filter_level_override

File? gatk_override
}

String fq_cohort_extract_table_prefix = "~{gvs_project}.~{gvs_dataset}.~{extraction_uuid}"

call GvsPrepareCallset.GvsPrepareCallset {
input:
destination_cohort_table_prefix = fq_cohort_extract_table_prefix,
sample_names_to_extract = cohort_sample_names,
data_project = query_project,
default_dataset = gvs_dataset, # unused if fq_* args are given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though it's unused here, it still needs to be passed into the task

fq_petvet_dataset = "~{gvs_project}.~{gvs_dataset}",
fq_sample_mapping_table = "~{gvs_project}.~{gvs_dataset}.sample_info",
fq_temp_table_dataset = gvs_extraction_temp_tables_dataset,
fq_destination_dataset = gvs_extraction_destination_dataset
}


call GvsExtractCallset.GvsExtractCallset {
input:
data_project = gvs_project, # unused if fq_filter_set_* args are given or filtering is off
query_project = query_project,
default_dataset = gvs_dataset, # unused if fq_filter_set_* args are given or filtering is off

wgs_intervals = wgs_intervals,
scatter_count = scatter_count,

reference = reference,
reference_index = reference_index,
reference_dict = reference_dict,

fq_cohort_extract_table_prefix = fq_cohort_extract_table_prefix,

do_not_filter_override = do_not_filter_override,
filter_set_name = filter_set_name,
fq_filter_set_info_table = fq_filter_set_info_table,
fq_filter_set_site_table = fq_filter_set_site_table,
fq_filter_set_tranches_table = fq_filter_set_tranches_table,
snps_truth_sensitivity_filter_level_override = snps_truth_sensitivity_filter_level_override,
indels_truth_sensitivity_filter_level_override = indels_truth_sensitivity_filter_level_override,

output_file_base_name = output_file_base_name,
output_gcs_dir = output_gcs_dir,
gatk_override = gatk_override
}

}
4 changes: 4 additions & 0 deletions scripts/variantstore/wdl/GvsPrepareCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ workflow GvsPrepareCallset {
docker = docker_final
}

output {}

}

task PrepareCallsetTask {
Expand Down Expand Up @@ -82,6 +84,8 @@ task PrepareCallsetTask {
~{"--sa_key_path " + service_account_json}
>>>

output {}

runtime {
docker: docker
memory: "3 GB"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ def create_position_table(fq_temp_table_dataset, min_variant_samples):
def make_new_pet_union_all(fq_pet_vet_dataset, fq_temp_table_dataset, sample_ids):
def get_pet_subselect(fq_pet_table, samples, id):
sample_stanza = ','.join([str(s) for s in samples])
sql = f" q_{id} AS (SELECT p.location, p.sample_id, p.state from {fq_pet_table} p " \
sql = f" q_{id} AS (SELECT p.location, p.sample_id, p.state from `{fq_pet_table}` p " \
f" join `{fq_temp_table_dataset}.{VET_DISTINCT_POS_TABLE}` v on (p.location = v.location) WHERE p.sample_id IN ({sample_stanza})), "
return sql

Expand Down