broadinstitute · rsasch · Mar 21, 2022 · Mar 3, 2022 · Mar 3, 2022 · Mar 3, 2022
diff --git a/.dockstore.yml b/.dockstore.yml
@@ -65,7 +65,7 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - kc_quoting_bug
+         - rsa_optional_inputs
    - name: GvsAoUReblockGvcf
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsAoUReblockGvcf.wdl
@@ -80,6 +80,7 @@ workflows:
        branches:
          - master
          - ah_var_store
+         - rsa_optional_inputs
    - name: GvsCreateFilterSet
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateFilterSet.wdl
@@ -89,7 +90,7 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - rc-vs-222-dataset-id
+         - rsa_optional_inputs
    - name: GvsCreateAltAllele
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateAltAllele.wdl
@@ -99,7 +100,7 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - kc_quoting_bug
+         - rsa_optional_inputs
    - name: GvsCreateTables
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateTables.wdl
@@ -118,7 +119,7 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - rc-split-intervals-odd
+         - rsa_optional_inputs
    - name: GvsImportGenomes
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsImportGenomes.wdl
@@ -128,7 +129,7 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - kc_quoting_bug
+         - rsa_optional_inputs
    - name: GvsPrepareCallset
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsPrepareCallset.wdl
@@ -138,7 +139,6 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - ah_flag_in_prepare
    - name: GvsPrepareRangesCallset
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsPrepareRangesCallset.wdl
@@ -148,6 +148,7 @@ workflows:
        branches:
          - master
          - ah_var_store
+         - rsa_optional_inputs
    - name: GvsCreateVAT
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateVAT.wdl
@@ -173,7 +174,7 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - rc-split-intervals-odd
+         - rsa_optional_inputs
    - name: MitochondriaPipeline
      subclass: WDL
      primaryDescriptorPath: /scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl

diff --git a/scripts/variantstore/TERRA_QUICKSTART.md b/scripts/variantstore/TERRA_QUICKSTART.md
@@ -1,125 +1,106 @@
 # Quickstart - Joint Calling with the Broad Genomic Variant Store 
 
-**Note** The markdown source for this quickstart is maintained in the the  [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/TERRA_QUICKSTART.md).  Submit any feedback, corrections or improvements in a pull request there.  Do not edit this file directly.
+**Note:** The markdown source for this quickstart is maintained in the the  [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/TERRA_QUICKSTART.md). Submit any feedback, corrections or improvements in a pull request there.  Do not edit this file directly.
 
 ## Overview
-Through this QuickStart you will learn how to use the Broad Genomic Variant Store to create a GATK VQSR Filtered joint callset VCF for whole genome samples.
-
-The sequencing data in this quickstart came from the [AnVIL 1000G High Coverage workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019)
+Through this QuickStart you will learn how to use the Broad Genomic Variant Store to create a GATK VQSR Filtered joint callset VCF for 10 whole genome samples. The sequencing data in this quickstart came from the [AnVIL 1000G High Coverage workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019).
 
 
 ## Prerequisites
-
 This quickstart assumes that you are familiar with Terra workspaces, the data model and providing input parameters and launching workflows.
 
-1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on). 
-2. Grant the "BigQuery Editor" role on that **dataset** to your Terra PROXY group.  Your proxy group name can be found on your Terra Profile page and look something like `PROXY_3298237498237948372@firecloud.org`
-3. Grant the following roles on the Google **project** containing the dataset to your proxy group
+1. You will need to have or create a BigQuery dataset (we'll call this `dataset_name` later on).
+2. Grant the "BigQuery Data Editor" role on that **dataset** to your Terra PROXY group.  Your proxy group name can be found on your Terra Profile page and look something like `PROXY_12345678901234567890@firecloud.org`.
+3. Grant the following roles on the Google **project** (we'll call this `project_id` later on) containing the dataset to your proxy group:
     - BigQuery data editor
     - BigQuery job user
     - BigQuery Read Session User
 4. These tools expect re-blocked gVCF files as input, which are provided in this workspace
 
 ## 1. Import Data
-
 A sample set for the quickstart has already been created with 10 samples and paths to re-blocked gVCFs for each sample.  Run the two import workflows against this sample set by selecting "sample_set" as the root entity type ("Step 1") and `gvs_demo-10` for the data ("Step 2").  If you are creating your own sample set, note that the sample table should have a column for the re-blocked gVCFs (`hg38_reblocked_gvcf` or `reblocked_gvcf_path`) and their index files need to be in the same location.
 
-## 1.1 Assign Gvs IDs
+## 1.1 Assign Gvs IDs and Create Loading Tables
 To optimize the internal queries, each sample must have a unique and consecutive integer ID assigned. Run the `GvsAssignIds` workflow, which will create an appropriate ID for each sample in the sample set and update the BigQuery dataset with the sample name to ID mapping info.
 
-This workflow should be run on a **sample set** as the root entity, for the quickstart that is the `gvs_demo_10` sample set.
+This workflow should be run on a **sample set** as the root entity, for this workspace, the `gvs_demo_10` sample set.
 
 These are the required parameters which must be supplied to the workflow:
 
 | Parameter             | Description |
 | --------------------- | ----------- |
-| project_id            | The name of the google project containing the dataset |
-| dataset_name          | The name of the dataset you created above       |
-| external_sample_names | datamodel  (e.g `this.samples.sample_id`)     |
+| dataset_name          | the name of the dataset you created above       |
+| external_sample_names | `this.samples.sample_id` (the sample identifier column from the `gvs_demo_10` sample set) |
+| project_id            | the name of the google project containing the dataset |
 
 ## 1.2 Load data
+Next, your re-blocked gVCF files will be copied into the `ref_ranges_*` and `vet_*` tables by running the `GvsImportGenomes` workflow.
 
-Next, your re-blocked gVCF files should be imported into GVS by running the `GvsImportGenomes` workflow.
-
-This workflow should be run on a **sample set** as the root entity, for the quickstart that is the `gvs_demo_10` sample set.
+This workflow should be run on a **sample set** as the root entity, for this workspace, the `gvs_demo_10` sample set.
 
 These are the required parameters which must be supplied to the workflow:
 
-| Parameter      | Description |
-| ----------------- | ----------- |
-| dataset_name      | The name of the dataset you created above       |
-| project_id | The name of the google project containing the dataset |
-| external_sample_names | from datamodel  (e.g `this.samples.sample_id`)     |
-| input_vcf | reblocked gvcf for this sample; from datamodel (e.g. `this.samples.hg38_reblocked_gvcf`) |
-| input_vcf_indexes | reblocked gvcf indexes for this sample; from datamodel (e.g. `this.samples.hg38_reblocked_gvcf_index`) |
-| interval_list | Intervals to load (Use `gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list` for WGS) |
-
+| Parameter             | Description |
+| --------------------- | ----------- |
+| dataset_name          | the name of the dataset you created above       |
+| external_sample_names | `this.samples.sample_id` (the sample identifier from the `gvs_demo_10` sample set) |
+| input_vcf_indexes     | `this.samples.hg38_reblocked_gvcf_index` (reblocked gvcf index file for each sample) |
+| input_vcfs            | `this.samples.hg38_reblocked_gvcf` (reblocked gvcf file for each sample) |
+| project_id            | the name of the google project containing the dataset |
 
 ## 2. Create Alt Allele Table
 This step loads data into the ALT_ALLELE table from the `vet_*` tables.
 
+This workflow does not use the Terra data model to run, so be sure to select `Run workflow with inputs defined by file paths`.
+
 This is done by running the `GvsCreateAltAllele` workflow with the following parameters:
 
-| Parameter      | Description |
+| Parameter         | Description |
 | ----------------- | ----------- |
-| data_project | The name of the google project containing the dataset |
-| default_dataset      | The name of the dataset  |
-
-**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
+| dataset_name      | the name of the dataset you created above  |
+| project_id        | the name of the google project containing the dataset |
 
 ## 3. Create Filter Set
+This step calculates features from the ALT_ALLELE table, trains the VQSR filtering model along with site-level QC filters, and loads them into BigQuery into a series of `filter_set_*` tables.  
 
-This step calculates features from the ALT_ALLELE table, and trains the VQSR filtering model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.  
+This workflow does not use the Terra data model to run, so be sure to select `Run workflow with inputs defined by file paths`.
 
 This is done by running the `GvsCreateFilterSet` workflow with the following parameters:
 
-| Parameter      | Description |
-| ----------------- | ----------- |
-| data_project | The name of the google project containing the dataset |
-| default_dataset      | The name of the dataset  |
-| filter_set_name | A unique name to identify this filter set (e.g. `my_demo_filters` ); you will want to make note of this for use in step 4  |
-| output_file_base_name | TODO: should be defaulted and optional |
-| SNPsVariantRecalibratorClassic.max-gaussians | 4 |
-| wgs_intervals | Intervals to load (Use `gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list` for WGS) |
-
-**Note:** VQSR dies with the default/recommended configuration, so we set SNPsVariantRecalibratorClassic.max-gaussians to 4 here.
-
-**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
-
-## 5. Extract Cohort
-
-This step extracts the data in BigQuery into a sharded joint called VCF 
-
-To prepare the dataset for extraction by putting it into a single vet and ref_ranges table, run the `GvsPrepareRangesCallset` workflow with the following parameters:
+| Parameter                         | Description |
+| --------------------------------- | ----------- |
+| dataset_name                      | the name of the dataset you created above  |
+| filter_set_name                   | a unique name to identify this filter set (e.g. `my_demo_filters` ); you will want to make note of this for use in step 5 |
+| INDEL_VQSR_max_gaussians_override | you don't need to set this unless a previous run of IndelsVariantRecalibrator task failed to converge, start with 3 and lower as needed |
+| project_id                        | the name of the google project containing the dataset |
+| SNP_VQSR_max_gaussians_override   | you don't need to set this unless a previous run of SNPsVariantRecalibratorClassic task failed to converge, start with 5 and lower as needed |
 
+## 4. Prepare Callset
+This step performs the heavy lifting in BigQuery to gather all the data required to create a jointly called VCF.
 
-| Parameter      | Description    |
-| ---------------- |---------------------------------------------------|
-| data_project | The name of the google project containing the dataset |
-| default_dataset  | The name of the dataset  |
-| destination_cohort_table_prefix | The name of the preparation table that this step is creating (e.g. <destination_cohort_table_prefix>__REF_RANGES) |
-| fq_sample_mapping_table | The fully qualified table name of the samples to extract (e.g. <project>.<dataset>.sample_info)  |
-
-**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
+This is done by running the `GvsPrepareRangesCallset` workflow with the following parameters:
 
+| Parameter                       | Description |
+|--------------------- | ----------- |
+| dataset_name         | the name of the dataset you created above  |
+| extract_table_prefix | A unique, descriptive name for the tables containing the callset (for simplicity, you can use the same name you used for `filter_set_name` in step 3); you will want to make note of this for use in the next step |
+| project_id           | the name of the google project containing the dataset |
 
+## 5. Extract Cohort
 Now the data is ready to be extracted!
 
-This is done by running the `GvsExtractCallset` workflow with the following parameters:
+This workflow does not use the Terra data model to run, so be sure to select `Run workflow with inputs defined by file paths`.
 
+This is done by running the `GvsExtractCallset` workflow with the following parameters:
 
-| Parameter      | Description               |
-| ----------------- |--------------------------|
-| data_project | The name of the google project containing the dataset |
-| default_dataset      | The name of the dataset  |
-| output_file\_base\_name | Base name for generated VCFs  |
-| filter\_set_name | The name of the filter set identifier created in step #3  |
-| fq_samples_to_extract_table | The fully qualified table name of the samples to extract (e.g. <project>.<dataset>.sample_info) |
-| scatter_count | The scatter count for extract (e.g. 100 for quickstart)  |
-| wgs_intervals | Intervals to load (Use `gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list` for WGS) |
-
-**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
+| Parameter            | Description              |
+| -------------------- | -------------------------|
+| dataset_name         | the name of the dataset you created above  |
+| extract_table_prefix | the unique, descriptive name for the tables containing the callset you chose in step 4  |
+| filter_set_name      | the name of the filter set created in step 3  |
+| project_id           | the name of the google project containing the dataset |
+| scatter_count        | how wide to scatter the extract task (use 100 for the Quickstart)  |
 
-## 6. Your VCF is ready!!
+## 6. Your VCF files are ready!
 
-The sharded VCF output files are listed in the `ExtractTask.output_vcf` workflow output, and the associated index files are listed in `ExtractTask.output_vcf_index`
+The sharded VCF output files are listed in the `ExtractTask.output_vcf` workflow output, and the associated index files are listed in `ExtractTask.output_vcf_index`.