Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small updates to GvsExtractCallset from beta callset, new workflow for re-scattered shards #7493

Merged
merged 27 commits into from
Oct 7, 2021
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
acf1549
run SplitIntervals with localization enabled
rsasch Sep 23, 2021
7cdbf85
dockstore
rsasch Sep 23, 2021
1145547
removed retries and updated gatk docker for ExtractTask
rsasch Sep 23, 2021
67d5fc6
make use of retry with more memory and associated env variables
rsasch Sep 23, 2021
6b8b573
more fun with memory env vars
rsasch Sep 24, 2021
e48fad1
Merge branch 'ah_var_store' into rsa_split_intervals
rsasch Sep 27, 2021
c1d5d4b
roll back other changes to GvsExtractCallset besides SplitIntervals l…
rsasch Sep 27, 2021
6f8c0d2
added split_intervals_disk_size_override
rsasch Sep 27, 2021
75e6250
try to make the right thing parameterized
rsasch Sep 27, 2021
8b10698
Merge branch 'ah_var_store' into rsa_split_intervals
rsasch Oct 4, 2021
3014290
add service account json support to CreateManifest
rsasch Oct 4, 2021
6004773
copy-paste error
rsasch Oct 4, 2021
9856572
added wdl for sharded shards and associated python script
rsasch Oct 4, 2021
9989624
added for realsies
rsasch Oct 4, 2021
450b2ee
dockstore
rsasch Oct 5, 2021
d2ffbef
minor tweaks
rsasch Oct 5, 2021
9293c98
variable name fix
rsasch Oct 5, 2021
8c35677
use range correctly
rsasch Oct 5, 2021
ee9c96e
right docker
rsasch Oct 5, 2021
c678035
d'oh, forgot to set OUTPUT_GCS_DIR in ExtractTask
rsasch Oct 5, 2021
5910ee6
try without indexes
rsasch Oct 5, 2021
8bccff9
try without indexes input
rsasch Oct 5, 2021
647b5dd
try without indexes input
rsasch Oct 5, 2021
46ee9e2
try without indexes input again
rsasch Oct 5, 2021
b43ade8
turns out we don't need the index files to pass to GatherVcfsCloud
rsasch Oct 5, 2021
da61b5a
added merge_disk_override
rsasch Oct 5, 2021
9f42f59
change handling of merge_disk_override
rsasch Oct 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,14 @@ workflows:
branches:
- master
- ah_var_store
- name: GvsMergeScatteredCallsetShards
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsMergeScatteredCallsetShards.wdl
filters:
branches:
- master
- ah_var_store
- rsa_split_intervals
- name: GvsCreateFilterSet
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateFilterSet.wdl
Expand All @@ -74,6 +82,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_split_intervals
- name: GvsCreateAltAllele
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateAltAllele.wdl
Expand All @@ -92,6 +101,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_split_intervals
- name: GvsImportGenomes
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsImportGenomes.wdl
Expand Down
18 changes: 15 additions & 3 deletions scripts/variantstore/wdl/GvsExtractCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ workflow GvsExtractCallset {
File? excluded_intervals
Boolean? emit_pls = false
Int? extract_preemptible_override
Int? split_intervals_disk_size_override

String? service_account_json_path

Expand All @@ -49,6 +50,7 @@ workflow GvsExtractCallset {
ref_dict = reference_dict,
scatter_count = scatter_count,
output_gcs_dir = output_gcs_dir,
split_intervals_disk_size_override = split_intervals_disk_size_override,
service_account_json_path = service_account_json_path
}

Expand Down Expand Up @@ -105,7 +107,8 @@ workflow GvsExtractCallset {
call CreateManifest {
input:
manifest_lines = ExtractTask.manifest,
output_gcs_dir = output_gcs_dir
output_gcs_dir = output_gcs_dir,
service_account_json_path = service_account_json_path
}

output {
Expand Down Expand Up @@ -253,13 +256,15 @@ task ExtractTask {
File ref_dict
Int scatter_count
String? split_intervals_extra_args
Int? split_intervals_disk_size_override
String? output_gcs_dir

File? gatk_override
String? service_account_json_path
}

String has_service_account_file = if (defined(service_account_json_path)) then 'true' else 'false'
Int disk_size = if (defined(split_intervals_disk_size_override)) then split_intervals_disk_size_override else 10

parameter_meta {
intervals: {
Expand Down Expand Up @@ -306,7 +311,7 @@ task ExtractTask {
docker: "us.gcr.io/broad-gatk/gatk:4.2.0.0"
bootDiskSizeGb: 15
memory: "3 GB"
disks: "local-disk 10 HDD"
disks: "local-disk ~{disk_size} HDD"
preemptible: 3
cpu: 1
}
Expand Down Expand Up @@ -399,8 +404,11 @@ task CreateManifest {
input {
Array[String] manifest_lines
String? output_gcs_dir
String? service_account_json_path
}

String has_service_account_file = if (defined(service_account_json_path)) then 'true' else 'false'

command <<<
set -e
MANIFEST_LINES_TXT=~{write_lines(manifest_lines)}
Expand All @@ -410,7 +418,11 @@ task CreateManifest {
# Drop trailing slash if one exists
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')

if [ -n "${OUTPUT_GCS_DIR}" ]; then
if [ -n "$OUTPUT_GCS_DIR" ]; then
if [ ~{has_service_account_file} = 'true' ]; then
gsutil cp ~{service_account_json_path} local.service_account.json
gcloud auth activate-service-account --key-file=local.service_account.json
fi
gsutil cp manifest.txt ${OUTPUT_GCS_DIR}/
fi
>>>
Expand Down
123 changes: 123 additions & 0 deletions scripts/variantstore/wdl/GvsMergeScatteredCallsetShards.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
version 1.0

workflow GvsMergeScatteredCallsetShards {
input {
String input_vcfs_directory_plus_prefix
Int num_shards = 500
String output_vcf_base_name
String output_directory
String? service_account_json_path
}

call GenerateOrderedPaths as VCFpaths {
input:
root_path = input_vcfs_directory_plus_prefix,
num_files = num_shards,
path_suffix = ".vcf.gz"
}

call GenerateOrderedPaths as VCFIndexpaths {
rsasch marked this conversation as resolved.
Show resolved Hide resolved
input:
root_path = input_vcfs_directory_plus_prefix,
num_files = num_shards,
path_suffix = ".vcf.gz.tbi"
}

call MergeVCFs {
input:
input_vcfs = VCFpaths.paths,
input_vcfs_indexes = VCFIndexpaths.paths,
output_vcf_name = "${output_vcf_base_name}.vcf.gz",
output_directory = output_directory,
service_account_json_path = service_account_json_path
}
}

task MergeVCFs {
input {
Array[File] input_vcfs
Array[File] input_vcfs_indexes
String gather_type = "BLOCK"
String output_vcf_name
String output_directory
String? service_account_json_path
File? gatk_override
}

Int disk_size = ceil(size(input_vcfs, "GiB") * 2.5) + 10

parameter_meta {
input_vcfs: {
localization_optional: true
}
input_vcfs_indexes: {
localization_optional: true
}
}

String has_service_account_file = if (defined(service_account_json_path)) then 'true' else 'false'

command {
export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}

if [ ~{has_service_account_file} = 'true' ]; then
gsutil cp ~{service_account_json_path} local.service_account.json
gcloud auth activate-service-account --key-file=local.service_account.json
export GOOGLE_APPLICATION_CREDENTIALS=local.service_account.json
fi

gatk --java-options -Xmx3g GatherVcfsCloud \
--ignore-safety-checks --gather-type ~{gather_type} \
--create-output-variant-index false \
-I ~{sep=' -I ' input_vcfs} \
--output ~{output_vcf_name}

tabix ~{output_vcf_name}

# Drop trailing slash if one exists
OUTPUT_GCS_DIR=$(echo ~{output_directory} | sed 's/\/$//')

gsutil cp ~{output_vcf_name} $OUTPUT_GCS_DIR/
gsutil cp ~{output_vcf_name}.tbi $OUTPUT_GCS_DIR/
}

runtime {
docker: "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_d8a72b825eab2d979c8877448c0ca948fd9b34c7_change_to_hwe"
preemptible: 1
memory: "3 GiB"
disks: "local-disk ~{disk_size} HDD"
}

output {
File output_vcf = "~{output_vcf_name}"
File output_vcf_index = "~{output_vcf_name}.tbi"
}
}

task GenerateOrderedPaths {
input {
String root_path
String num_files
rsasch marked this conversation as resolved.
Show resolved Hide resolved
String path_suffix
}

command <<<
set -e

python3 /app/generate_ordered_paths.py \
--root_path ~{root_path} \
--path_suffix ~{path_suffix} \
--number ~{num_files} > file_names.txt
>>>

output {
Array[File] paths = read_lines("file_names.txt")
}

runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_20211005_2"
memory: "3 GB"
disks: "local-disk 10 HDD"
cpu: 1
}
}
1 change: 1 addition & 0 deletions scripts/variantstore/wdl/extract/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ COPY populate_alt_allele_table.py /app
COPY alt_allele_positions.sql /app
COPY alt_allele_temp_function.sql /app
COPY utils.py /app
COPY generate_ordered_paths.py /app

WORKDIR /app
ENTRYPOINT ["/bin/bash"]
17 changes: 17 additions & 0 deletions scripts/variantstore/wdl/extract/generate_ordered_paths.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import argparse

def generate_ordered_paths(root_path, path_suffix, number):
for i in range(0, number):
rsasch marked this conversation as resolved.
Show resolved Hide resolved
print(f"{root_path}{i}{path_suffix}")
rsasch marked this conversation as resolved.
Show resolved Hide resolved

if __name__ == '__main__':
parser = argparse.ArgumentParser(allow_abbrev=False, description='Extract subpopulation per sample data out of a callset TSV')
parser.add_argument('--root_path',type=str, metavar='string', help='path plus file prefix', required=True)
parser.add_argument('--path_suffix',type=str, metavar='integer', help='path suffix', required=True)
parser.add_argument('--number',type=int, metavar='string', help='number of files', required=True)

args = parser.parse_args()

generate_ordered_paths(args.root_path,
args.path_suffix,
args.number)