Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up optional and inconsistently named inputs [VS-294] [VS-218] #7715

Merged
merged 36 commits into from
Mar 21, 2022
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
c1a4a70
GvsAssignIds
rsasch Mar 3, 2022
fd6631f
more GvsAssignIds
rsasch Mar 3, 2022
36afed1
GvsImportGenomes
rsasch Mar 3, 2022
900b03b
GvsAssignIds and GvsImportGenomes
rsasch Mar 3, 2022
f1244a6
bit more for GvsImportGenomes
rsasch Mar 3, 2022
8b6c26e
gatk_override for GvsImportGenomes
rsasch Mar 3, 2022
24cf6f5
neaten GvsImportGenomes
rsasch Mar 3, 2022
cac91e3
fix gatk for GvsImportGenomes
rsasch Mar 3, 2022
5975cf1
fix gatk for GvsImportGenomes
rsasch Mar 3, 2022
e00fac7
GvsCreateAltAllele
rsasch Mar 3, 2022
7663e4c
GvsCreateFilterSet
rsasch Mar 4, 2022
0863de0
starting on GvsCreateFilterSet
rsasch Mar 4, 2022
9f411e5
moreGvsCreateFilterSet
rsasch Mar 4, 2022
a037dd8
even more GvsCreateFilterSet
rsasch Mar 4, 2022
bf2df5c
fix sample_info table name
rsasch Mar 4, 2022
ec4134c
fix typo in reference path
rsasch Mar 7, 2022
4dc643a
GvsPrepareRangesCallset
rsasch Mar 7, 2022
e1ee0d9
wrong dockstore -- d'oh
rsasch Mar 7, 2022
95d80bd
modify readme
rsasch Mar 7, 2022
9681e5d
cleaned up extract, quickstart md for prepare and extract
rsasch Mar 7, 2022
e6a7106
remove default for extract_table_prefix
rsasch Mar 7, 2022
a535a5f
quickstart and typo fix
rsasch Mar 7, 2022
3d71b7d
meta wdls
rsasch Mar 7, 2022
7c4dcbd
more neatening
rsasch Mar 10, 2022
8349161
reodering
rsasch Mar 10, 2022
7603cbc
Merge branch 'ah_var_store' into rsa_optional_inputs
rsasch Mar 10, 2022
b3a0a09
Merge branch 'ah_var_store' into rsa_optional_inputs
rsasch Mar 14, 2022
1d1a1c0
unused inputs for GvsExtractCohortFromSampleNames
rsasch Mar 14, 2022
cec1f70
spacing for GvsExtractCallset
rsasch Mar 18, 2022
c0bcd99
some nits
rsasch Mar 21, 2022
4c013cc
take out preemptibles from CreateTables
rsasch Mar 21, 2022
ec3f04a
spacing and interval_weights_bed
rsasch Mar 21, 2022
752f8f1
removed unused read_project_id from CreateFilterSet
rsasch Mar 21, 2022
923d59f
documentation on scatter width
rsasch Mar 21, 2022
a3a5e8a
formatting
rsasch Mar 21, 2022
e902512
dockstore
rsasch Mar 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ workflows:
branches:
- master
- ah_var_store
- kc_quoting_bug
- rsa_optional_inputs
- name: GvsAoUReblockGvcf
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsAoUReblockGvcf.wdl
Expand All @@ -80,6 +80,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_optional_inputs
- name: GvsCreateFilterSet
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateFilterSet.wdl
Expand All @@ -89,7 +90,7 @@ workflows:
branches:
- master
- ah_var_store
- rc-vs-222-dataset-id
- rsa_optional_inputs
- name: GvsCreateAltAllele
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateAltAllele.wdl
Expand All @@ -99,7 +100,7 @@ workflows:
branches:
- master
- ah_var_store
- kc_quoting_bug
- rsa_optional_inputs
- name: GvsCreateTables
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateTables.wdl
Expand All @@ -118,7 +119,7 @@ workflows:
branches:
- master
- ah_var_store
- rc-split-intervals-odd
- rsa_optional_inputs
- name: GvsImportGenomes
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsImportGenomes.wdl
Expand All @@ -128,7 +129,7 @@ workflows:
branches:
- master
- ah_var_store
- kc_quoting_bug
- rsa_optional_inputs
- name: GvsPrepareCallset
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsPrepareCallset.wdl
Expand All @@ -138,7 +139,6 @@ workflows:
branches:
- master
- ah_var_store
- ah_flag_in_prepare
- name: GvsPrepareRangesCallset
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsPrepareRangesCallset.wdl
Expand All @@ -148,6 +148,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_optional_inputs
- name: GvsCreateVAT
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateVAT.wdl
Expand All @@ -173,7 +174,7 @@ workflows:
branches:
- master
- ah_var_store
- rc-split-intervals-odd
- rsa_optional_inputs
- name: MitochondriaPipeline
subclass: WDL
primaryDescriptorPath: /scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl
Expand Down
125 changes: 53 additions & 72 deletions scripts/variantstore/TERRA_QUICKSTART.md
Original file line number Diff line number Diff line change
@@ -1,125 +1,106 @@
# Quickstart - Joint Calling with the Broad Genomic Variant Store

**Note** The markdown source for this quickstart is maintained in the the [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/TERRA_QUICKSTART.md). Submit any feedback, corrections or improvements in a pull request there. Do not edit this file directly.
**Note:** The markdown source for this quickstart is maintained in the the [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/TERRA_QUICKSTART.md). Submit any feedback, corrections or improvements in a pull request there. Do not edit this file directly.

## Overview
Through this QuickStart you will learn how to use the Broad Genomic Variant Store to create a GATK VQSR Filtered joint callset VCF for whole genome samples.

The sequencing data in this quickstart came from the [AnVIL 1000G High Coverage workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019)
Through this QuickStart you will learn how to use the Broad Genomic Variant Store to create a GATK VQSR Filtered joint callset VCF for 10 whole genome samples. The sequencing data in this quickstart came from the [AnVIL 1000G High Coverage workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019).


## Prerequisites

This quickstart assumes that you are familiar with Terra workspaces, the data model and providing input parameters and launching workflows.

1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on).
2. Grant the "BigQuery Editor" role on that **dataset** to your Terra PROXY group. Your proxy group name can be found on your Terra Profile page and look something like `PROXY_3298237498237948372@firecloud.org`
3. Grant the following roles on the Google **project** containing the dataset to your proxy group
1. You will need to have or create a BigQuery dataset (we'll call this `dataset_name` later on).
2. Grant the "BigQuery Data Editor" role on that **dataset** to your Terra PROXY group. Your proxy group name can be found on your Terra Profile page and look something like `PROXY_12345678901234567890@firecloud.org`.
3. Grant the following roles on the Google **project** (we'll call this `project_id` later on) containing the dataset to your proxy group:
- BigQuery data editor
- BigQuery job user
- BigQuery Read Session User
4. These tools expect re-blocked gVCF files as input, which are provided in this workspace

## 1. Import Data

A sample set for the quickstart has already been created with 10 samples and paths to re-blocked gVCFs for each sample. Run the two import workflows against this sample set by selecting "sample_set" as the root entity type ("Step 1") and `gvs_demo-10` for the data ("Step 2"). If you are creating your own sample set, note that the sample table should have a column for the re-blocked gVCFs (`hg38_reblocked_gvcf` or `reblocked_gvcf_path`) and their index files need to be in the same location.

## 1.1 Assign Gvs IDs
## 1.1 Assign Gvs IDs and Create Loading Tables
To optimize the internal queries, each sample must have a unique and consecutive integer ID assigned. Run the `GvsAssignIds` workflow, which will create an appropriate ID for each sample in the sample set and update the BigQuery dataset with the sample name to ID mapping info.

This workflow should be run on a **sample set** as the root entity, for the quickstart that is the `gvs_demo_10` sample set.
This workflow should be run on a **sample set** as the root entity, for this workspace, the `gvs_demo_10` sample set.

These are the required parameters which must be supplied to the workflow:

| Parameter | Description |
| --------------------- | ----------- |
| project_id | The name of the google project containing the dataset |
| dataset_name | The name of the dataset you created above |
| external_sample_names | datamodel (e.g `this.samples.sample_id`) |
| dataset_name | the name of the dataset you created above |
| external_sample_names | `this.samples.sample_id` (the sample identifier column from the `gvs_demo_10` sample set) |
| project_id | the name of the google project containing the dataset |
rsasch marked this conversation as resolved.
Show resolved Hide resolved

## 1.2 Load data
Next, your re-blocked gVCF files will be copied into the `ref_ranges_*` and `vet_*` tables by running the `GvsImportGenomes` workflow.

Next, your re-blocked gVCF files should be imported into GVS by running the `GvsImportGenomes` workflow.

This workflow should be run on a **sample set** as the root entity, for the quickstart that is the `gvs_demo_10` sample set.
This workflow should be run on a **sample set** as the root entity, for this workspace, the `gvs_demo_10` sample set.

These are the required parameters which must be supplied to the workflow:

| Parameter | Description |
| ----------------- | ----------- |
| dataset_name | The name of the dataset you created above |
| project_id | The name of the google project containing the dataset |
| external_sample_names | from datamodel (e.g `this.samples.sample_id`) |
| input_vcf | reblocked gvcf for this sample; from datamodel (e.g. `this.samples.hg38_reblocked_gvcf`) |
| input_vcf_indexes | reblocked gvcf indexes for this sample; from datamodel (e.g. `this.samples.hg38_reblocked_gvcf_index`) |
| interval_list | Intervals to load (Use `gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list` for WGS) |

| Parameter | Description |
| --------------------- | ----------- |
| dataset_name | the name of the dataset you created above |
| external_sample_names | `this.samples.sample_id` (the sample identifier from the `gvs_demo_10` sample set) |
| input_vcf_indexes | `this.samples.hg38_reblocked_gvcf_index` (reblocked gvcf index file for each sample) |
| input_vcfs | `this.samples.hg38_reblocked_gvcf` (reblocked gvcf file for each sample) |
| project_id | the name of the google project containing the dataset |

## 2. Create Alt Allele Table
This step loads data into the ALT_ALLELE table from the `vet_*` tables.

This workflow does not use the Terra data model to run, so be sure to select `Run workflow with inputs defined by file paths`.

This is done by running the `GvsCreateAltAllele` workflow with the following parameters:

| Parameter | Description |
| Parameter | Description |
| ----------------- | ----------- |
| data_project | The name of the google project containing the dataset |
| default_dataset | The name of the dataset |

**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
| dataset_name | the name of the dataset you created above |
| project_id | the name of the google project containing the dataset |

## 3. Create Filter Set
This step calculates features from the ALT_ALLELE table, trains the VQSR filtering model along with site-level QC filters, and loads them into BigQuery into a series of `filter_set_*` tables.

This step calculates features from the ALT_ALLELE table, and trains the VQSR filtering model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.
This workflow does not use the Terra data model to run, so be sure to select `Run workflow with inputs defined by file paths`.

This is done by running the `GvsCreateFilterSet` workflow with the following parameters:

| Parameter | Description |
| ----------------- | ----------- |
| data_project | The name of the google project containing the dataset |
| default_dataset | The name of the dataset |
| filter_set_name | A unique name to identify this filter set (e.g. `my_demo_filters` ); you will want to make note of this for use in step 4 |
| output_file_base_name | TODO: should be defaulted and optional |
| SNPsVariantRecalibratorClassic.max-gaussians | 4 |
| wgs_intervals | Intervals to load (Use `gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list` for WGS) |

**Note:** VQSR dies with the default/recommended configuration, so we set SNPsVariantRecalibratorClassic.max-gaussians to 4 here.

**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`

## 5. Extract Cohort

This step extracts the data in BigQuery into a sharded joint called VCF

To prepare the dataset for extraction by putting it into a single vet and ref_ranges table, run the `GvsPrepareRangesCallset` workflow with the following parameters:
| Parameter | Description |
| --------------------------------- | ----------- |
| dataset_name | the name of the dataset you created above |
| filter_set_name | a unique name to identify this filter set (e.g. `my_demo_filters` ); you will want to make note of this for use in step 5 |
| INDEL_VQSR_max_gaussians_override | you don't need to set this unless a previous run of IndelsVariantRecalibrator task failed to converge, start with 3 and lower as needed |
| project_id | the name of the google project containing the dataset |
| SNP_VQSR_max_gaussians_override | you don't need to set this unless a previous run of SNPsVariantRecalibratorClassic task failed to converge, start with 5 and lower as needed |
rsasch marked this conversation as resolved.
Show resolved Hide resolved

## 4. Prepare Callset
This step performs the heavy lifting in BigQuery to gather all the data required to create a jointly called VCF.

| Parameter | Description |
| ---------------- |---------------------------------------------------|
| data_project | The name of the google project containing the dataset |
| default_dataset | The name of the dataset |
| destination_cohort_table_prefix | The name of the preparation table that this step is creating (e.g. <destination_cohort_table_prefix>__REF_RANGES) |
| fq_sample_mapping_table | The fully qualified table name of the samples to extract (e.g. <project>.<dataset>.sample_info) |

**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
This is done by running the `GvsPrepareRangesCallset` workflow with the following parameters:

| Parameter | Description |
|--------------------- | ----------- |
| dataset_name | the name of the dataset you created above |
| extract_table_prefix | A unique, descriptive name for the tables containing the callset (for simplicity, you can use the same name you used for `filter_set_name` in step 3); you will want to make note of this for use in the next step |
| project_id | the name of the google project containing the dataset |

## 5. Extract Cohort
Now the data is ready to be extracted!

This is done by running the `GvsExtractCallset` workflow with the following parameters:
This workflow does not use the Terra data model to run, so be sure to select `Run workflow with inputs defined by file paths`.

This is done by running the `GvsExtractCallset` workflow with the following parameters:

| Parameter | Description |
| ----------------- |--------------------------|
| data_project | The name of the google project containing the dataset |
| default_dataset | The name of the dataset |
| output_file\_base\_name | Base name for generated VCFs |
| filter\_set_name | The name of the filter set identifier created in step #3 |
| fq_samples_to_extract_table | The fully qualified table name of the samples to extract (e.g. <project>.<dataset>.sample_info) |
| scatter_count | The scatter count for extract (e.g. 100 for quickstart) |
| wgs_intervals | Intervals to load (Use `gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list` for WGS) |

**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
| Parameter | Description |
| -------------------- | -------------------------|
| dataset_name | the name of the dataset you created above |
| extract_table_prefix | the unique, descriptive name for the tables containing the callset you chose in step 4 |
| filter_set_name | the name of the filter set created in step 3 |
| project_id | the name of the google project containing the dataset |
| scatter_count | how wide to scatter the extract task (use 100 for the Quickstart) |

## 6. Your VCF is ready!!
## 6. Your VCF files are ready!

The sharded VCF output files are listed in the `ExtractTask.output_vcf` workflow output, and the associated index files are listed in `ExtractTask.output_vcf_index`
The sharded VCF output files are listed in the `ExtractTask.output_vcf` workflow output, and the associated index files are listed in `ExtractTask.output_vcf_index`.
Loading