Yana Eglit; 30 Nov 2021
Document to cite for Snakemake: Sustainable data analysis with Snakemake | F1000Research
Attempting snakemake. Installed on local Ubuntu via conda.
make
itself relies on prespecified targets and dependencies.
target : dependency
There are shortcuts ($^
and $@
) and variables $(VAR)
but apparently Snakemake is much more intuitive for modern scientific computing.
To install Snakemake, first use conda to install (apparently much faster, written in C++) Mamba:
conda install mamba -n base -c conda-forge
Then run the following from the base conda environment:
mamba create -c conda-forge -c bioconda -n snakemake snakemake
(this is dizzyingly fast compared to regular conda, savour the parallel installation!)
Then, of course, source activate snakemake
Preexisting Snakemake workflow catalog and The Snakemake Wrappers repository.
Snakemake uses bash 'strict mode' (Bash Strict Mode) by default!
Basics: An example workflow -- example using readmapping with bwa:
#./Snakefile
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
{input}
and {output}
are lists of strings that are by default concatenated with a whitespace. I.e. here {input}
becomes data/genome.fa data/samples/A.fastq
. A dry run $ snakemake -np mapped_reads/A.bam
(-n
is dry run, -p
is print shell commands that will be executed) produces:
bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
Note that substituting mapped_reads/B.bam
or dependency data/genome.fa
leads to error as snakemake expects the target.
(note to self: use shell script to generate Snakefiles in batch?)
Snakemake defines 'job' as an application of a rule to generate a set of output files or targets.
Now execute the workflow -- the number of cores must always be specified:
$ snakemake --cores 1 mapped_reads/A.bam
If run again, snakemake will not recreate A.bam, unless one of the unput files is newer than the latest output file. Rerunning the command anyway returns the following:
(snakemake-tutorial) /workspace/snakemake-tutorial-data$ snakemake --cores 1 mapped_reads/A.bam
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
Complete log: /workspace/snakemake-tutorial-data/.snakemake/log/2021-12-01T000335.641810.snakemake.log
Rules can be generalised by applying the wildcard {sample}
:
#./Snakefile
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
You can have multiple wildcards, but if a rule has multiple output files, it must have the same wildcard to prevent two parallel jobs overwriting the same file.
Now executing $ snakemake -np mapped_reads/B.bam
returns:
bwa mem data/genome.fa data/samples/B.fastq | samtools view -Sb - > mapped_reads/B.bam
Snakemake matches the B in mapped_reads/B.bam
to replace {sample}
in the shell command. Multiple targets can be specified as well: $ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
; or, even combined with Bash brace expansion (or other Shell expansions), for example:
$ snakemake -np mapped_reads/{A,B}.bam
-- which will produce both options. Note: executing the latter only produces the shell command for B.bam as A.bam has already been created. You can touch
A.bam to update timestamp and re-run the rule:
$ touch data/samples/A.fastq
To sort alignments using samtools
, add this block below bwa_map
rule:
# ./Snakefile (continued)
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
Files from input directory taken, sorted, and saved as files in output directory.
Note: Snakemake creates new directories where any are missing!
Note {wildcards.sample}
: this runs the wildcard within the shell command; this is probably because {}
already means something in Bash itself, so Snakemake looks for the wildcards instruction?
Running snakemake -np sorted_reads/B.bam
now skips over the bwa_map
rule as their targets are already up to date. Snakemake automatically determines dependencies via filenames.
Now using samtools to index read alignments by genomic position:
#./Snakefile (continued)
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
To look at interdependencies between jobs, a directed acyclic graph can be generated with:
$ snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
The dot
command by Graphviz creates the visualisation (Snakemake's output is in .dot format, in standard output). Jobs that currently need not be run are represented by dashed frames. Example before and after running samtools_sort
on B:
Expand() with samples takes inputs and formats into an array of strings with a specified name:
eg expand("sorted_reads/{sample}.bam", sample=SAMPLES)
returns ["sorted_reads/A.bam", "sorted_reads/B.bam"]
under the name SAMPLES
.
while expand("sorted_reads/{sample}.{replicate}.bam", sample=SAMPLES, replicate=[0, 1])
iterates the substitutions for replicate over those of sample to produce: ["sorted_reads/A.0.bam", "sorted_reads/A.1.bam", "sorted_reads/B.0.bam", "sorted_reads/B.1.bam"]
This is exceptionally useful especially since Snakemake works backward from requested output, not available input files like a for loop over *.fas for example.
To the top of the snakefile add, in straight Python:
SAMPLES = ["A", "B"]
And the following rule:
# ./Snakefile (continued)
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
Note that order of files is not preserved, and positional input files precede named ones (don't fully understand this yet...).
In the example above, multiple inputs can be specifically referred to via {input.VARNAME}
, eg {input.fa}
. Note the shell command is split into multiple indented lines, to be merged together by Python. (I think this requires that pipe... otherwise use \
?) Input/output file lists can have any Python statements as long as they produce strings. Here, all three samples are iterated over each input file list.
$ snakemake --dag calls/all.vcf | dot -Tsvg > dag.svg
shows our graph now looks like this:
Snakemake has a script:
directive. Added to end of Snakefile, to plot histograms:
# ./Snakefile (continued)
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
script:
"scripts/plot-quals.py"
(wow, this does inputs and outputs for you instead of executing python script.blah input output
) Script path is relative to Snakefile (pay attention to location!) . In this case, under scripts/
create plot-quals.py
:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from pysam import VariantFile
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
plt.savefig(snakemake.output[0])
This way, there is no need to include command line argument parsing code in the script!!! Note snakemake.input[0]
(Suppose this also means the python script can be tailored to the Snakemake flow, eg positional arguments pre-defined in input:
files list. Could be risky though...)
R likewise contains an S4 object snakemake
, accessing the first element of input using snakemake@input[[1]]
(note first element starts with 1 in R, not zero -- this can be a source of error!) Named input/output can also be accessed via snakemake@input[["myfile"]]
(not sure if this means if the Snakefile has input: "myfile"
in the script rule...?) More info here.
If the requested rule does not have wildcards, Snakemake accepts rulenames as targets. By default, if no target is specified, Snakemake defines the 1st rule as target. It is good practice to have a rule all:
at the top of the Snakefile with the typically desired [final?] targets as input files. To top of Snakefile, add:
# ./Snakefile (continued)
rule all:
input:
"plots/quals.svg"
Simply running $ snakemake -n
will run the entire path of rules to generate the target plot from Step 6. Otherwise, the order of rules is arbitrary in Snakemake. One can also generate multiple target rules if a set of reasonable endpoints exists, which can be called via snakemake -n mytarget
.
A dry run -n
of Snakemake generates a planned job stats table; with rule all:
it is summarised. (indeed, commenting out rule all:
breaks this!)
snakemake --dag all | dot -Tsvg > dag.svg
leads to the full map:
snakemake --help
generates a fairly long helpfile. Eg:
--forcerun [TARGET ...], -R [TARGET ...]
Force the re-execution or creation of the given rules or files. Use
this option if you changed a rule and want to have all its output
in your workflow updated. (default: None)
snakemake --cores 1 --forcerun samtools_sort
executes up to and including the rule samtools_sort:
. adding --reason
adds a field explaining why this rule was run, eg:
rule plot_quals:
input: calls/all.vcf
output: plots/quals.svg
jobid: 1
reason: Input files updated by another job: calls/all.vcf
resources: tmpdir=/tmp
SAMPLES = ["A", "B"]
rule all:
input:
"plots/quals.svg"
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}" #this is a continuation of above line
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
script:
"scripts/plot-quals.py"
Next Tutorial: Advanced: Decorating the example workflow
Shell block must only contain one line within quotes (accepts only 1 argument!)... or a quoteblock like so:
shell:
"""
shell command 1
shell command 2
...
"""
Commas after inputs and outputs important:
input:
file1,
file2
output:
file3
shell:
"shell command 1"
The input line actually reads:
input: file1, file2
When running Snakemake:
-p
prints commands!
-n
is a dry run
-f
forces a run
Best way to run the whole pipeline from scratch is to touch
the first input files
Use the following function to extract shell scripts mentioned in file:
function genSnakeShellScripts{
sed -n '/^\#/! s/^.*[ "]\(.*\.sh\)[ $].*$/\1/p' Snakefile | uniq
}