Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VS-694 - Extract Callset for VQSR Lite #8182

Merged
merged 18 commits into from
Feb 15, 2023
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_vqsr_lite_poc
- VS-693_VQSR_lite
- gg_VS-694_VQSR_Lite
- name: GvsPopulateAltAllele
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsPopulateAltAllele.wdl
Expand All @@ -120,6 +119,7 @@ workflows:
branches:
- master
- ah_var_store
- gg_VS-694_VQSR_Lite
- name: GvsImportGenomes
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsImportGenomes.wdl
Expand All @@ -144,15 +144,13 @@ workflows:
branches:
- master
- ah_var_store
- gg_VS-785_RegenerateTheVATTsv
- name: GvsCreateVATFilesFromBigQuery
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant_annotations_table/GvsCreateVATFilesFromBigQuery.wdl
filters:
branches:
- master
- ah_var_store
- gg_VS-785_RegenerateTheVATTsv
- name: GvsValidateVat
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl
Expand Down
4 changes: 2 additions & 2 deletions scripts/variantstore/wdl/GvsCallsetStatistics.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ workflow GvsCallsetStatistics {

call Utils.ValidateFilterSetName {
input:
data_project = project_id,
dataset_name = dataset_name,
query_project = project_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going back to having the query proj being diff?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot!

fq_filter_set_info_table = "~{project_id}.~{dataset_name}.filter_set_info",
filter_set_name = filter_set_name
}

Expand Down
46 changes: 34 additions & 12 deletions scripts/variantstore/wdl/GvsCreateFilterSet.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ workflow GvsCreateFilterSet {
String fq_sample_table = "~{project_id}.~{dataset_name}.sample_info"
String fq_alt_allele_table = "~{project_id}.~{dataset_name}.alt_allele"
String fq_info_destination_table = "~{project_id}.~{dataset_name}.filter_set_info"
String fq_info_destination_table_vqsr_lite = "~{project_id}.~{dataset_name}.vqsr_lite_filter_set_info"
String fq_info_destination_table_vqsr_lite = "~{project_id}.~{dataset_name}.filter_set_info_vqsr_lite"
String fq_tranches_destination_table = "~{project_id}.~{dataset_name}.filter_set_tranches"
String fq_filter_sites_destination_table = "~{project_id}.~{dataset_name}.filter_set_sites"

Expand Down Expand Up @@ -158,14 +158,36 @@ workflow GvsCreateFilterSet {
preemptible_tries = 3,
}

call PopulateFilterSetInfo {
# These calls to SelectVariants are being added for two reasons
# 1) The snps_variant_scored_vcf and indels_variant_scored_vcf output by JointVcfFiltering contains ALL variants,
# but are currently ONLY annotating SNPs and INDELs respectively.
# 2) Those output VCFs also contain filtered sites which we don't want to put into the filter_set_info_vqsr_lite table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "filtered sites" here mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sites in the VCF that have the filter field set to anything other than '.' or 'PASS'.

call Utils.SelectVariants as CreateFilteredScoredSNPsVCF {
input:
input_vcf = MergeSNPScoredVCFs.output_vcf,
input_vcf_index = MergeSNPScoredVCFs.output_vcf_index,
type_to_include = "SNP",
exclude_filtered = true,
output_basename = "${filter_set_name}.filtered.scored.snps"
}

call Utils.SelectVariants as CreateFilteredScoredINDELsVCF {
input:
input_vcf = MergeINDELScoredVCFs.output_vcf,
input_vcf_index = MergeINDELScoredVCFs.output_vcf_index,
type_to_include = "INDEL",
exclude_filtered = true,
output_basename = "${filter_set_name}.filtered.scored.indels"
}

call PopulateFilterSetInfo {
input:
gatk_override = gatk_override,
filter_set_name = filter_set_name,
snp_recal_file = MergeSNPScoredVCFs.output_vcf,
snp_recal_file_index = MergeSNPScoredVCFs.output_vcf_index,
indel_recal_file = MergeINDELScoredVCFs.output_vcf,
indel_recal_file_index = MergeINDELScoredVCFs.output_vcf_index,
snp_recal_file = CreateFilteredScoredSNPsVCF.output_vcf,
snp_recal_file_index = CreateFilteredScoredSNPsVCF.output_vcf_index,
indel_recal_file = CreateFilteredScoredINDELsVCF.output_vcf,
indel_recal_file_index = CreateFilteredScoredINDELsVCF.output_vcf_index,
fq_info_destination_table = fq_info_destination_table_vqsr_lite,
filter_schema = fq_info_destination_table_vqsr_lite_schema,
query_project = project_id,
Expand Down Expand Up @@ -402,7 +424,7 @@ task ExtractFilterTask {
>>>

runtime {
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2022_10_17_2a8c210ac35094997603259fa1cd784486b92e42"
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2023_02_07_38fb966b84d138680e9e46992ae60feb28127d41"
memory: "7 GB"
disks: "local-disk 10 HDD"
bootDiskSizeGb: 15
Expand Down Expand Up @@ -445,7 +467,7 @@ task PopulateFilterSetInfo {

export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}

echo "Creating SNPs reacalibration file"
echo "Creating SNPs recalibration file"
gatk --java-options "-Xmx1g" \
CreateFilteringFiles \
--ref-version 38 \
Expand All @@ -455,7 +477,7 @@ task PopulateFilterSetInfo {
-V ~{snp_recal_file} \
-O ~{filter_set_name}.snps.recal.tsv

echo "Creating INDELs reacalibration file"
echo "Creating INDELs racalibration file"
gatk --java-options "-Xmx1g" \
CreateFilteringFiles \
--ref-version 38 \
Expand All @@ -482,7 +504,7 @@ task PopulateFilterSetInfo {
>>>

runtime {
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2022_10_17_2a8c210ac35094997603259fa1cd784486b92e42"
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2023_02_07_38fb966b84d138680e9e46992ae60feb28127d41"
memory: "3500 MB"
disks: "local-disk 250 HDD"
bootDiskSizeGb: 15
Expand Down Expand Up @@ -538,7 +560,7 @@ task PopulateFilterSetSites {
>>>

runtime {
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2022_10_17_2a8c210ac35094997603259fa1cd784486b92e42"
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2023_02_07_38fb966b84d138680e9e46992ae60feb28127d41"
memory: "3500 MB"
disks: "local-disk 200 HDD"
bootDiskSizeGb: 15
Expand Down Expand Up @@ -585,7 +607,7 @@ task PopulateFilterSetTranches {
>>>

runtime {
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2022_10_17_2a8c210ac35094997603259fa1cd784486b92e42"
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2023_02_07_38fb966b84d138680e9e46992ae60feb28127d41"
memory: "3500 MB"
disks: "local-disk 200 HDD"
bootDiskSizeGb: 15
Expand Down
6 changes: 3 additions & 3 deletions scripts/variantstore/wdl/GvsExtractAvroFilesForHail.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ workflow GvsExtractAvroFilesForHail {

call Utils.ValidateFilterSetName {
input:
data_project = project_id,
dataset_name = dataset_name,
filter_set_name = filter_set_name,
query_project = project_id,
fq_filter_set_info_table = "~{project_id}.~{dataset_name}.filter_set_info",
filter_set_name = filter_set_name
}

call OutputPath {
Expand Down
68 changes: 43 additions & 25 deletions scripts/variantstore/wdl/GvsExtractCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,9 @@ workflow GvsExtractCallset {
File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
Boolean use_interval_weights = true
File interval_weights_bed = "gs://broad-public-datasets/gvs/weights/gvs_vet_weights_1kb.bed"
File? gatk_override
Boolean use_classic_VQSR = true

File? gatk_override = "gs://broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/gg_VS-694_VQSR_Lite_20230208/gatk-package-4.2.0.0-655-gc55c463-SNAPSHOT-local.jar"

String output_file_base_name = filter_set_name

Expand All @@ -48,7 +50,11 @@ workflow GvsExtractCallset {
String fq_cohort_dataset = "~{cohort_project_id}.~{cohort_dataset_name}"

String full_extract_prefix = if (control_samples) then "~{extract_table_prefix}_controls" else extract_table_prefix
String fq_filter_set_info_table = "~{fq_gvs_dataset}.filter_set_info"

String filter_set_info_table_name = "filter_set_info"
String filter_set_info_vqsr_lite_table_name = "filter_set_info_vqsr_lite"
String fq_filter_set_info_table = if (use_classic_VQSR) then "~{fq_gvs_dataset}.~{filter_set_info_table_name}" else "~{fq_gvs_dataset}.~{filter_set_info_vqsr_lite_table_name}"

String fq_filter_set_site_table = "~{fq_gvs_dataset}.filter_set_sites"
String fq_filter_set_tranches_table = "~{fq_gvs_dataset}.filter_set_tranches"
String fq_sample_table = "~{fq_gvs_dataset}.sample_info"
Expand Down Expand Up @@ -115,20 +121,19 @@ workflow GvsExtractCallset {
}

call Utils.GetBQTableLastModifiedDatetime as FilterSetInfoTimestamp {
input:
query_project = project_id,
fq_table = "~{fq_gvs_dataset}.filter_set_info"
input:
query_project = project_id,
fq_table = "~{fq_filter_set_info_table}"
}

if ( !do_not_filter_override ) {
call Utils.ValidateFilterSetName {
input:
query_project = query_project,
filter_set_name = filter_set_name,
filter_set_info_timestamp = FilterSetInfoTimestamp.last_modified_timestamp,
data_project = project_id,
dataset_name = dataset_name
}
query_project = query_project,
fq_filter_set_info_table = "~{fq_filter_set_info_table}",
filter_set_name = filter_set_name,
filter_set_info_timestamp = FilterSetInfoTimestamp.last_modified_timestamp
}
}

call Utils.GetBQTablesMaxLastModifiedTimestamp {
Expand All @@ -146,8 +151,9 @@ workflow GvsExtractCallset {
call ExtractTask {
input:
go = select_first([ValidateFilterSetName.done, true]),
dataset_id = dataset_name,
dataset_name = dataset_name,
call_set_identifier = call_set_identifier,
use_classic_VQSR = use_classic_VQSR,
gatk_override = gatk_override,
reference = reference,
reference_index = reference_index,
Expand All @@ -162,7 +168,7 @@ workflow GvsExtractCallset {
do_not_filter_override = do_not_filter_override,
fq_filter_set_info_table = fq_filter_set_info_table,
fq_filter_set_site_table = fq_filter_set_site_table,
fq_filter_set_tranches_table = fq_filter_set_tranches_table,
fq_filter_set_tranches_table = if (use_classic_VQSR) then fq_filter_set_tranches_table else none,
filter_set_name = filter_set_name,
drop_state = drop_state,
output_file = vcf_filename,
Expand Down Expand Up @@ -219,9 +225,11 @@ task ExtractTask {
input {
Boolean go

String dataset_id
String dataset_name
String call_set_identifier

Boolean use_classic_VQSR

File reference
File reference_index
File reference_dict
Expand All @@ -247,7 +255,7 @@ task ExtractTask {
Boolean do_not_filter_override
String fq_filter_set_info_table
String fq_filter_set_site_table
String fq_filter_set_tranches_table
String? fq_filter_set_tranches_table
String? filter_set_name
Boolean write_cost_to_db

Expand All @@ -265,28 +273,37 @@ task ExtractTask {
# Not `volatile: true` since there shouldn't be a need to re-run this if there has already been a successful execution.
}

File monitoring_script = "gs://gvs_quickstart_storage/cromwell_monitoring_script.sh"

String intervals_name = basename(intervals)
String cost_observability_line = if (write_cost_to_db == true) then "--cost-observability-tablename ~{cost_observability_tablename}" else ""

String inferred_reference_state = if (drop_state == "NONE") then "ZERO" else drop_state

String gatk_tool = if (use_classic_VQSR == true) then 'ExtractCohort' else 'ExtractCohortLite'

command <<<
set -e
export GATK_LOCAL_JAR="~{default="/root/gatk.jar" gatk_override}"

df -h
bash ~{monitoring_script} > monitoring.log &

export GATK_LOCAL_JAR="~{default="/root/gatk.jar" gatk_override}"

if [ ~{do_not_filter_override} = 'true' ]; then
if [ ~{do_not_filter_override} = true ]; then
FILTERING_ARGS=''
elif [ ~{use_classic_VQSR} = true ]; then
FILTERING_ARGS='--filter-set-info-table ~{fq_filter_set_info_table}
--filter-set-site-table ~{fq_filter_set_site_table}
--tranches-table ~{fq_filter_set_tranches_table}
--filter-set-name ~{filter_set_name}'
else
FILTERING_ARGS='--filter-set-info-table ~{fq_filter_set_info_table}
--filter-set-site-table ~{fq_filter_set_site_table}
--tranches-table ~{fq_filter_set_tranches_table}
--filter-set-name ~{filter_set_name}'
--filter-set-site-table ~{fq_filter_set_site_table}
--filter-set-name ~{filter_set_name}'
fi

gatk --java-options "-Xmx9g" \
ExtractCohort \
~{gatk_tool} \
--vet-ranges-extract-fq-table ~{fq_ranges_cohort_vet_extract_table} \
--ref-ranges-extract-fq-table ~{fq_ranges_cohort_ref_extract_table} \
--ref-version 38 \
Expand All @@ -300,9 +317,9 @@ task ExtractTask {
~{true='--emit-pls' false='' emit_pls} \
~{true='--emit-ads' false='' emit_ads} \
${FILTERING_ARGS} \
--dataset-id ~{dataset_id} \
--dataset-id ~{dataset_name} \
--call-set-identifier ~{call_set_identifier} \
--wdl-step GvsCreateCallset \
--wdl-step GvsExtractCallset \
--wdl-call ExtractTask \
--shard-identifier ~{intervals_name} \
~{cost_observability_line}
Expand Down Expand Up @@ -331,7 +348,7 @@ task ExtractTask {
echo ~{interval_index},${OUTPUT_FILE_DEST},${OUTPUT_FILE_BYTES},${OUTPUT_FILE_INDEX_DEST},${OUTPUT_FILE_INDEX_BYTES} >> manifest.txt
>>>
runtime {
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2022_10_17_2a8c210ac35094997603259fa1cd784486b92e42"
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2023_02_07_38fb966b84d138680e9e46992ae60feb28127d41"
memory: "12 GB"
disks: "local-disk 150 HDD"
bootDiskSizeGb: 15
Expand All @@ -347,6 +364,7 @@ task ExtractTask {
File output_vcf_index = "~{output_file}.tbi"
Float output_vcf_index_bytes = read_float("vcf_index_bytes.txt")
String manifest = read_string("manifest.txt")
File monitoring_log = "monitoring.log"
}
}

Expand Down
Loading