Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change drop_state to NONE for Ingest/Extract [VS-607] #8000

Merged
merged 7 commits into from
Aug 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,14 +119,15 @@ workflows:
branches:
- master
- ah_var_store
- vs_447_fixup_non_fq_invocations
- rsa_vs_607_drop_state
- name: GvsImportGenomes
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsImportGenomes.wdl
filters:
branches:
- master
- ah_var_store
- rsa_vs_607_drop_state
- name: GvsPrepareRangesCallset
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsPrepareRangesCallset.wdl
Expand Down Expand Up @@ -168,7 +169,7 @@ workflows:
branches:
- master
- ah_var_store
- kc_variant_search_extract_wdl
- rsa_vs_607_drop_state
- name: GvsWithdrawSamples
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsWithdrawSamples.wdl
Expand All @@ -183,6 +184,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_vs_607_drop_state
- name: GvsJointVariantCalling
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsJointVariantCalling.wdl
Expand Down Expand Up @@ -212,6 +214,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_vs_607_drop_state
- name: GvsIngestTieout
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsIngestTieout.wdl
Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/AOU_DELIVERABLES.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
- It will need to be run twice, once with `control_samples` set to "true" (see [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to use for `extract_table_prefix` or cohort prefix, which you will need to keep track of for the `GvsExtractCallset` WDL); the default value is `false`.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
8. `GvsExtractCallset` workflow
- This workflow extracts the data in BigQuery and transforms it into a sharded joint called VCF incorporating the VQSR filter set data.
- This workflow extracts the data in BigQuery and transforms it into a sharded joint called VCF incorporating the VQSR filter set data. We will probably not run this on callsets of more than 100K samples.
- It also needs to be run twice, once with `control_samples` set to "true", and with the `filter_set_name` and `extract_table_prefix` from step 5 & 6. Include a valid (and secure) "output_gcs_dir" parameter, which is where the VCF, interval list, manifest, and sample name list files will go.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
9. **TBD VDS Prepare WDL/notebook/??**
Expand Down
7 changes: 5 additions & 2 deletions scripts/variantstore/wdl/GvsExtractCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ workflow GvsExtractCallset {
Int? scatter_count
Boolean zero_pad_output_vcf_filenames = true

# set to "NONE" if all the reference data was loaded into GVS in GvsImportGenomes
String drop_state = "NONE"

File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
File interval_weights_bed = "gs://broad-public-datasets/gvs/weights/gvs_vet_weights_1kb.bed"
File gatk_override = "gs://gvs_quickstart_storage/jars/gatk-package-4.2.0.0-552-g0f9780a-SNAPSHOT-local.jar"
Expand Down Expand Up @@ -154,7 +157,7 @@ workflow GvsExtractCallset {
fq_filter_set_tranches_table = fq_filter_set_tranches_table,
filter_set_name = filter_set_name,
filter_set_name_verified = select_first([ValidateFilterSetName.done, "done"]),
drop_state = "FORTY",
drop_state = drop_state,
output_file = vcf_filename,
output_gcs_dir = output_gcs_dir,
max_last_modified_timestamp = GetBQTablesMaxLastModifiedTimestamp.max_last_modified_timestamp,
Expand Down Expand Up @@ -386,7 +389,7 @@ task SumBytes {

command <<<
set -e
echo "~{sep=" " file_sizes_bytes}" | tr " " "\n" | python -c "
echo "~{sep=" " file_sizes_bytes}" | tr " " "\n" | python3 -c "
import sys;
total_bytes = sum(float(i.strip()) for i in sys.stdin);
total_mb = total_bytes/10**6;
Expand Down
3 changes: 3 additions & 0 deletions scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ workflow GvsExtractCohortFromSampleNames {
Int scatter_count

String? output_gcs_dir
# set to "NONE" if all the reference data was loaded into GVS in GvsImportGenomes
String drop_state = "NONE"

Int? extract_preemptible_override
Int? extract_maxretries_override
Expand Down Expand Up @@ -79,6 +81,7 @@ workflow GvsExtractCohortFromSampleNames {
output_file_base_name = output_file_base_name,
output_gcs_dir = output_gcs_dir,

drop_state = drop_state,
extract_preemptible_override = extract_preemptible_override,
extract_maxretries_override = extract_maxretries_override,
split_intervals_disk_size_override = split_intervals_disk_size_override,
Expand Down
5 changes: 4 additions & 1 deletion scripts/variantstore/wdl/GvsImportGenomes.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ workflow GvsImportGenomes {

Boolean skip_loading_vqsr_fields = false

# set to "NONE" to ingest all the reference data into GVS for VDS (instead of VCF) output
String drop_state = "NONE"

File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
Int? load_data_batch_size
Int? load_data_preemptible_override
Expand Down Expand Up @@ -94,7 +97,7 @@ workflow GvsImportGenomes {
dataset_name = dataset_name,
project_id = project_id,
skip_loading_vqsr_fields = skip_loading_vqsr_fields,
drop_state = "FORTY",
drop_state = drop_state,
drop_state_includes_greater_than = false,
input_vcf_indexes = read_lines(CreateFOFNs.vcf_batch_vcf_index_fofns[i]),
input_vcfs = read_lines(CreateFOFNs.vcf_batch_vcf_fofns[i]),
Expand Down
2 changes: 2 additions & 0 deletions scripts/variantstore/wdl/GvsQuickstartIntegration.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ workflow GvsQuickstartIntegration {
]

Int? extract_scatter_count
String drop_state = "NONE"
}
String project_id = "gvs-internal"

Expand All @@ -73,6 +74,7 @@ workflow GvsQuickstartIntegration {
# Force filtering off as it is not deterministic and the initial version of this integration test does not
# allow for inexact matching of actual and expected results.
extract_do_not_filter_override = true,
drop_state = drop_state
}

call AssertIdenticalOutputs {
Expand Down
8 changes: 6 additions & 2 deletions scripts/variantstore/wdl/GvsUnified.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ workflow GvsUnified {
Array[File] input_vcf_indexes
File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"

# set to "NONE" to ingest all the reference data into GVS for VDS (instead of VCF) output
String drop_state = "NONE"

# The larger the `load_data_batch_size` the greater the probability of preemptions and non-retryable
# BigQuery errors so if specifying this adjust preemptible and maxretries accordingly. Or just take the defaults,
Expand Down Expand Up @@ -93,7 +95,8 @@ workflow GvsUnified {
load_data_preemptible_override = load_data_preemptible_override,
load_data_maxretries_override = load_data_maxretries_override,
load_data_gatk_override = gatk_override,
load_data_batch_size = load_data_batch_size
load_data_batch_size = load_data_batch_size,
drop_state = drop_state
}

call CreateAltAllele.GvsPopulateAltAllele {
Expand Down Expand Up @@ -155,7 +158,8 @@ workflow GvsUnified {
output_gcs_dir = extract_output_gcs_dir,
split_intervals_disk_size_override = split_intervals_disk_size_override,
split_intervals_mem_override = split_intervals_mem_override,
do_not_filter_override = extract_do_not_filter_override
do_not_filter_override = extract_do_not_filter_override,
drop_state = drop_state
}

output {
Expand Down