Skip to content

TNTurnerLab/fitDNM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fitDNM

Evin Padhi

Washington University in St. Louis

Turner Lab

fitDNM was originally developed by the Allen lab (http://people.duke.edu/~asallen/Software.html) in Jiang et al 2015, Am. J. Hum. Genet. (https://www.cell.com/ajhg/fulltext/S0002-9297(15)00277-3) to incoporate functional information in test of excess de novo mutational load. Here we've adapted the pipeline to utilize CADD scores instead of PolyPhen-2 scores in order to run in noncoding regions of the genome and implemented a scalable verision of the pipeline to test many elements at once. Given a bedfile that contains the regions of interest one wants to test for a significant excess of de novo mutations and the corresponding variants to use, this pipeline will output two summary files that contain the p values and scores calculated by fitDNM for each element in the bed file in the .fitDNM.report file and a summary of all mutations found in these genomic regions in the .mutation.report file

Publication

Check out our usage of fitDNM in our paper in Human Genomics. See link.

Input files and download links for CADD score files:

File name Source MD5Sum annotation in configfile
whole_genome_SNVs.tsv.gz https://cadd.gs.washington.edu/download faaa80ef3948cf44e56a3629a90cdaaa cadd_score_file
whole_genome_SNVs.tsv.gz.tbi https://cadd.gs.washington.edu/download 4843cab24dd4992bb0cc5f1a7ebc807a NA
mutation_rate_by_trinucleotide_matrix.txt Here ca2faad78f1055c266b5a8451bebf1cb trinucleotide_mut_rate
Variant file User provided NA mutation_calls
Bed file User provided NA regions_of_interest

All CADD score files can be downloaded from https://cadd.gs.washington.edu/download using the All possible SNVs of GRCh38/hg38 US link, make sure to download both the score file and tabix index file or alternatively use the wget commands below and be sure to check MD5sums after downloading to ensure the download was successful.

To clone the repository, set up directories and download CADD scores please run:

git clone https://github.com/TNTurnerLab/fitDNM.git
cd fitDNM/input_data
mkdir -p CADD_scores
mkdir -p variants
cd CADD_scores
wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz
wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz.tbi

Currently we only support release v1.6 for GRCh38/hg38

Formatting the user provided inputs:
Bed file: For the user provided bed file please format using standard bed file format with an extra annotation column for each entry:
chr +'\t' + start + '\t' + stop + '\t' + annotation + '\n' and make sure that each entry has a unique annotation

Variant file: for the variant file to integrate automatically into the workflow the file must be ordered in the following way: holder_column + " " + chromosome + " " + position + " " + reference + " " + alternate + '\n' where the holder column can take on any value. They must also be on hg38

Usage details:

Given that the double saddle point model in fitDNM relies upon using variants that have been by scored through CADD, this pipeline requires that indels are filtered out. Either the user can provide a variant file that has been prefiltered, or this pipeline by default will do the filtering by default.

Additionally, here we use a 1-based BED coordinate system for retriving the sequences and calculating the length.

Requirements and usage:

Tabix (http://www.htslib.org/doc/tabix.html)
Snakemake (https://snakemake.readthedocs.io/en/stable/)
Bedtools version 2.27.0 or above (https://bedtools.readthedocs.io/en/latest/)
R and the following R packages:
foreach (https://cran.r-project.org/web/packages/foreach/foreach.pdf)
iterators (https://cran.r-project.org/web/packages/iterators/iterators.pdf)
doParallel (https://cran.r-project.org/web/packages/doParallel/doParallel.pdf)

Outline of how to run:

  1. Clone repository using git clone https://github.com/TNTurnerLab/fitDNM.git
  2. Download CADD scores and mutation rate file and check md5sums
  3. Build dockerfile and push to dockerhub, alternatively pull from tnturnerlab/fitdnm_snakemake:V1.0
  4. Format bed and variant file
  5. Modify fitDNM-main/fitDNM_code/fitDNM_snakemake/fitDNM_genome_wide.json to point to the described input files and change parameters also outlined below
  6. Run fitDNM snakemake. Once finished running, two files should be generated, .fitDNM.report and .muts.report

Example setups

Running an example analysis on the hs737 enhancer from Padhi et al 2020 (https://www.biorxiv.org/content/10.1101/2020.08.28.270751v1): After cloning the repository from github using git clone https://github.com/TNTurnerLab/fitDNM.git and creating the CADD_scores directory and downloading all files and place your variant call file within the variants directory inside of input data. The directory set up should then resemble the one below, where the . represents the parent directory fitDNM

.
├── Dockerfile
├── README.md
├── fitDNM_code
│   ├── fitDNM_R_code
│   │   ├── double_saddle_point_approx_8_7_2014.R
│   │   └── fitDNM.CADD.R
│   └── fitDNM_snakemake
│       ├── fitDNM_genome_wide.json
│       ├── fitDNM_genome_wide.smk
│       └── transform_CADD_scores.R
├── input_data
│   ├── CADD_scores
│   │   ├── whole_genome_SNVs.tsv.gz
│   │   └── whole_genome_SNVs.tsv.gz.tbi
│   ├── input_regions
│   │   └── hs737.bed
│   ├── mutation_rate_by_trinucleotide_matrix.txt
│   └── variants
│       └── user_variants.txt
└── run_fitDNM_snake.sh

Using this setup, change the config file to the same as below:

{
  "mutation_calls": "/input_data/variants/variants.txt",
  "cadd_score_file": "/input_data/CADD_scores/whole_genome_SNVs.tsv.gz",
  "trinucleotide_mut_rate": "/input_data/mutation_rate_by_trinucleotide_matrix.txt",
  "fitDNM_R_path": "/fitDNM_code/fitDNM_R_code/",
  "saddle_point_path": "/fitDNM_code/fitDNM_R_code/double_saddle_point_approx_8_7_2014.R",
  "males": "2666",
  "females": "0",
  "transform_cadd_scores_script_path":"/fitDNM_code/fitDNM_snakemake",
  "regions_of_interest": "/input_data/input_regions/hs737.bed"
}

Then to execute the code run one of the following

If running on an LSF server run the following (can be run from anywhere):

export LSF_DOCKER_VOLUMES="/home/user/fitDNM/fitDNM_code:/fitDNM_code /home/user/fitDNM/input_data:/input_data"
bsub  -R 'rusage[mem=10GB]' -n 1 -oo fitDNM_run.log  -a 'docker(tnturnerlab/fitdnm_snakemake:V1.0)' /opt/conda/envs/snakemake/bin/snakemake -s /fitDNM_code/fitDNM_snakemake/fitDNM_genome_wide.smk --cores 1

Alternatively, running it locally and assuming the same file structure the command would look like:

docker run -v "/home/user/fitDNM/fitDNM_code:/fitDNM_code" -v "/home/user/fitDNM/input_data:/input_data" tnturnerlab/fitdnm_snakemake:V1.0 /opt/conda/envs/snakemake/bin/snakemake -s /fitDNM_code/fitDNM_snakemake/fitDNM_genome_wide.smk --cores 1

For fitDNM to run it is essential the correct paths are mounted in docker image. To determine the correct path please run pwd in the top level fitDNM directory to get the path. Then in both the /home/user/fitDNM/fitDNM_code:/fitDNM_code and /home/user/fitDNM/input_data:/input_data part of the code, replace the the lefthand side of the semicolon with the output from pwd. This should be done regardless of wether or not you are using an LSF server.

Running on user files To run on user files, please place your variant file in the variants directory and your input bedfile into the input_regions directory and change the config file to point to these files.

Wrapper script: we also provide an example of a wrapper script that could be used on an LSF sever after updating the config file to point to all files needed except for the bedfile. The idea of this script is that it makes it easier to analyze different bed files using the same set of variants by using -f instead of having to change the config file each time. To use this script first change the memory and cpu usage to the desired setting and update the paths and then run: bash run_fitDNM.sh -f /path/to/bed/file it will then create a directory for the run execution of the job and when finished contain all necessary output plus a copy of the bed file

Pipeline overview:

For each entry in the bedfile, this pipeline creates the following temporary files that are eventually used by fitDNM

  1. <annotation>.CADD.txt: comprehensive list of CADD scores, this table is in a different format then the one downloaded from the CADD website, where instead of each possible change for a given nucleotide is an entry, each nucleotide is an entry and the changes are columns. The PHRED score is used for fitDNM, not the raw score.
  2. <annotation>.lis: list of mutations in your region of interest, columns should be chr, pos, ref, alt, gene
  3. <annotation>.mu.lis: utilizes the trinucleotide mutation rate frequencies to calculate the mutation rate for every possible change

After generating all neccesarry files and running fitDNM, it combines the fitDNM output of all entries into one file and all of the identified mutations into one file

  1. .fitDNM.report contains the results of fitDNM for all elements in the bedfile, which should consist of 8 columns for elements that have SNVs
  2. .muts.report summarizes the mutations in each element.

Docker:

For those not familiar with docker please see https://docs.docker.com/get-started/overview/

  • Example docker build command, first create a folder called fitDNM_snakemake that contains the Dockerfile from this repository in it. Then run
    1. Docker build fitDNM_snakemake
    2. Docker images to get the image ID
    3. Docker tag <image_ID> <user_name>/fitDNM_snakemake:initial
    4. Docker push <user_name>/fitDNM_snakemake:initial