Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File of callset samples -> samples marked as 'withdrawn' in GVS [VS-436] #8009

Merged
merged 13 commits into from
Sep 8, 2022
1 change: 1 addition & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_vs_436_withdrawn_wdl
- name: GvsUnified
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsUnified.wdl
Expand Down
53 changes: 34 additions & 19 deletions scripts/variantstore/wdl/GvsWithdrawSamples.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,19 @@ workflow GvsWithdrawSamples {
String dataset_name
String project_id

Array[String] sample_names
String sample_name_column_name_in_file
File sample_names_to_include_file
# should be in the format "2022-01-01 00:00:00 UTC"
rsasch marked this conversation as resolved.
Show resolved Hide resolved
String withdrawn_timestamp
}

call WithdrawSamples {
input:
project_id = project_id,
dataset_name = dataset_name,
sample_names = sample_names
sample_name_column_name_in_file = sample_name_column_name_in_file,
sample_names_to_include_file = sample_names_to_include_file,
withdrawn_timestamp = withdrawn_timestamp
}

output {
Expand All @@ -26,40 +31,50 @@ task WithdrawSamples {
String project_id
String dataset_name

Array[String] sample_names
String sample_name_column_name_in_file
File sample_names_to_include_file
# should be in the format "2022-01-01 00:00:00 UTC"
String withdrawn_timestamp
}

meta {
description: "Withdraw Samples from GVS by marking them as 'withdrawn' in the sample_info table"
description: "Given a list of samples for a callset, withdraw samples from GVS that are not included by marking them as 'withdrawn' in the sample_info table with a passed in timestamp."
# Might not be strictly necessary to make this volatile, but just in case:
volatile: true
}

command <<<
set -e
set -x
set -o errexit -o nounset -o xtrace -o pipefail
echo "project_id = ~{project_id}" > ~/.bigqueryrc

# get just the sample_name values from sample_names_to_include_file based on the
# sample_name_column_name_in_file into sample_names.tsv
col_num=$(head -n1 ~{sample_names_to_include_file} | tr '\t' '\n' | grep -Fxn '~{sample_name_column_name_in_file}' | cut -f1 -d:)
awk "{print \$$col_num}" ~{sample_names_to_include_file} | sed 1d > sample_names.tsv

# make sure that sample names were actually passed, warn and exit if empty
num_samples=~{length(sample_names)}
# make sure that we end up with some samples to make the temp table, warn and exit if empty
num_samples=$(cat sample_names.tsv | wc -l)
rsasch marked this conversation as resolved.
Show resolved Hide resolved
if [ $num_samples -eq 0 ]; then
echo "No sample names passed. Exiting"
echo "No sample names for callset produced. Exiting"
exit 0
fi

echo "project_id = ~{project_id}" > ~/.bigqueryrc
# create the temp table (expires in 1 day)
bq --project_id=~{project_id} mk --expiration=86400 ~{dataset_name}.current_callset_samples "sample_name:STRING"
mcovarr marked this conversation as resolved.
Show resolved Hide resolved
# populate the temp table
bq load --project_id=~{project_id} -F "tab" ~{dataset_name}.current_callset_samples sample_names.tsv

# perform actual update
# join on the temp table to figure out which samples should be marked as withdrawn
bq --project_id=~{project_id} query --format=csv --use_legacy_sql=false \
'UPDATE `~{dataset_name}.sample_info` SET withdrawn = CURRENT_TIMESTAMP() WHERE sample_name IN ("~{sep='\", \"' sample_names}")' > log_message.txt;
"UPDATE \`~{dataset_name}.sample_info\` AS samples SET withdrawn = '~{withdrawn_timestamp}' \
WHERE NOT EXISTS \
(SELECT * \
FROM \`~{project_id}.~{dataset_name}.current_callset_samples\` AS callset \
WHERE \
samples.sample_name = callset.sample_name \
AND NOT samples.is_control)" > log_message.txt
rsasch marked this conversation as resolved.
Show resolved Hide resolved

cat log_message.txt | sed -e 's/Number of affected rows: //' > rows_updated.txt
typeset -i rows_updated=$(cat rows_updated.txt)

if [ $num_samples -ne $rows_updated ]; then
echo "Error: Expected to update $num_samples rows - but only updated $rows_updated."
exit 1
fi

>>>
runtime {
docker: "us.gcr.io/broad-gatk/gatk:4.2.5.0"
Expand Down