-
Notifications
You must be signed in to change notification settings - Fork 100
Downstreamer
🚧 Under construction 🚧
Downstreamer can be used to to perform pathway enrichments and core-gene prioritizations using GWAS summary statistics. Downstreamer does not rely on known pathway assignments of genes but instead uses predicted gene-pathway assignments [1]. The core-gene prioritizations of genes are based on co-regulation. This co-regulation is a measure related to co-expression but less driven by tissue differences [2]. Both the gene-pathway assignments and co-expression scores are calculated using 31,499 public RNA-seq samples from many different tissues [1].
A manuscript is in preparation
1️⃣ Getting started 2️⃣ Running Downstreamer 3️⃣ Key gene enrichment analysis 5️⃣ Most important output files
Download tool here: TODO
Download reference data here: TODO
The main function of Dowstreamer is using GWAS summary statistics to predict relevant pathways and genes. The commands needed to do this using the by us provided pathway databases are listed below. Other modes of Downstreamer are listed here: DS additional modes
All Downstreamer commands have the following basis:
java -Xmx10g -jar Downstreamer.jar --mode [MODE] --output
On top of this you might need to allocate more memory when running Downstream. If you get an heapspace error, then increase the value of -Xmx10g to increase the amout of memory that downstreamer can use.
First the GWAS summary statistics should be saved in the folling manner:
- TAB seperated text file
- First column must contain the variant identifiers as used in the LD reference panel. If the reference panel uses RS identifiers then these should be used and if chr:pos indentifiers are used by the reference panel then these should be used here.
- The following columns should contain the GWAS variant p-values. The header of these columns should be name of the GWAS
- It is not needed to filter on variants present in the reference data, these will automaticly be excluded.
Secondly this text file needs to be convered to a binary file for quick access by Downsreamer. That can be done using the CONVERT_TXT
mode with the following command:
❗ While it is possible to combine multiple GWAS summary statistics in a single file this is only recommended if they are performed on the same cohort using the same genotyping data.
java -Xmx10g -jar Downstreamer.jar --mode CONVERT_TXT --output [PATH_TO_OUTPUT_FILE] --pvalueToZscore --gwas [PATH_TO_GWAS_TEXT_FILE]
You now get 3 files, 1 binary file with the summary statistics matrix and 2 text files depicting the rows and columns. If al went well the columns files will contain the names of the studies in the orignal text file.
The first part of Downstreamer, STEP1
mode, is the most computationally and memory intensive. Here the GWAS SNP p-values are converted to GWAS gene p-values. Additionally this step does a very basis pruning to identify independent top GWAS hits. It is possible to provide the STEP2 arguments when running STEP1, in this case Downstreamer will automatically continue with STEP2.
Arguments used by STEP1
Argument | ShortArgument | Type | Describtion |
---|---|---|---|
--correctLambda | -cl | flag | Correct the GWAS for the lambda inflation |
--debug | -d | flag | Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis. |
--gwas | -g | path | GWAS Z-sccore binary matrix. Rows variants, Cols phenotypes. Without .dat |
--genes | -ge | path | File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm |
--maf | -maf | double | Minimum MAF |
--output | -o | path | The output path |
--permutations | -p | int | Number of initial permutations before using Farebrother's RUBEN algorithm to determine gene p-values |
--permutationFDR | -pfdr | int | Number of random phenotypes to use to determine null distribution for FDR calculation |
--permutationGeneCorrelations | -pgc | int | Number of random phenotypes to use to determine gene correlations |
--permutationPathwayEnrichment | -ppe | int | Number of random phenotypes to use to determine null distribution pathway enrichment |
--permutationsRescue | -pr | int | Number of permutations to use as fallback incase Farebrother's RUBEN algorithm failed. Optional but recommende to do atleast: 100,000,000. |
--referenceGenotypeFormat | -R | type | The reference genotype data type. If not defined will attempt to automatically select the first matching dataset on the specified path * PED_MAP - plink PED MAP files. * PLINK_BED - plink BED BIM FAM files. * VCF - bgziped vcf with tabix index file * VCFFOLDER - matches all bgziped vcf files + tabix index in a folder * SHAPEIT2 - shapeit2 phased haplotypes .haps & .sample * GEN - Oxford .gen & .sample * BGEN - Oxford .bgen & optionally .sample * TRITYPER - TriTyper format folder |
--referenceGenotypes | -r | basePath | The reference genotypes |
--referenceSamples | -rs | path | Samples to include from reference genotypes |
--threads | -t | int | Maximum number of calculation threads |
--saveUsedVariantsPerGene | -uvg | flag | Save all the variants used per gene to calculate the gene p-value (Warning this will create a very large file) |
--variantCorrelation | -v | double | Max correlation between variants to use (recommend = 0.95) |
--variantFilter | -vf | path | File with variants to include in gene p-value calculation (optional) |
--variantGene | -vg | path | Variant gene mapping. col 1: variant ID col 2: ENSG ID. (optional) |
--window | -w | int | Number of bases to add left and right of gene window for step 1, use -1 to not have any variants in window |
To run the STEP2
mode the programs needs the gene p-values of the real GWASes and the randomized null GWASes. It also takes the top hits as determined in STEP1
but it is possible to provide your own list of top hits --alternaitveTopHits
.
The gene p-values are the basis of the pathway enrichment analysis and gene prioritizations in STEP2. Because it is not needed to recalcute the gene p-values when performing multiple enrichment analysis this step only needs to be run once.
Argument | ShortArgument | Type | Describtion |
---|---|---|---|
--annotDb | -adb | String | Name of Pathway databases that you want to annotate with GWAS info. Rownames MUST be enembl Id's |
--alternaitveTopHits | -ath | name=path | Alternative file with top GWAS hits id chr pos p-value. name must be name as in oter output files for trait |
--covariates | -cov | path | File with covariates used to correct the gene p-values. Works in conjunction with -rgl. Residuals of this regression are used as input for the GLS |
--cisWindowExtent | -cwe | int | Cis window is defined as [startOfGene-cwe, endOfGene+cwe]. Defaults to: 250000 |
--debug | -d | flag | Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis. |
--excludeHla | -eh | flag | Exclude HLA locus during pathway enrichments (chr6 20mb - 40mb) |
--forceNormalGenePvalues | -fngp | flag | Force normal gene p-values before pathway enrichtment |
--forceNormalPathwayPvalues | -fnpp | flag | Force normal pathway p-values before pathway enrichtment |
--geneCorrelationWindow | -gcw | int | Window in bases to calculate gene correlations for GLS of pathway enrichment |
--genes | -ge | path | File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm |
--genePruningR | -gpr | double | Exclude correlated genes in pathway enrichments |
--ignoreGeneCorrelations | -igc | flag | Ignore gene correlations in pathway enrichment (not recommended) |
--output | -o | path | The output path |
--pathwayDatabase | -pd | name=path | Pathway databases, binary matrix with either z-scores for predicted gene pathway associations or 0 / 1 for gene assignments |
--permutationFDR | -pfdr | int | Number of random phenotypes to use to determine null distribution for FDR calculation |
--permutationGeneCorrelations | -pgc | int | Number of random phenotypes to use to determine gene correlations |
--permutationPathwayEnrichment | -ppe | int | Number of random phenotypes to use to determine null distribution pathway enrichment |
--qnorm | -qn | flag | Quantile normalize the permutations gene p-values before pathway enrichments to fit the real gene p-value distribution |
--regress-gene-lengths | -rgl | flag | Linearly correct for the effect of gene lengths on gene p-values |
--saveExcel | -se | flag | Save enrichement results also as excel files. Will generate 1 file per input phenotype |
--stepOneOutput | -soo | path | The output path of STEP1 |
--threads | -t | int | Maximum number of calculation threads |
After the key gene prioritization performed in STEP2
it is posible to test determine the enrichments of these key-genes in pathways or HPO-terms. This is done using the PRIO_GENE_ENRICH
mode.
❗ The --pathwayDatabase
beheaves slightly different here. Now it is used to define on which predictions the enrichment should be performed. This ofcrouse can only be done on gene prioritizations. If you also performed pahtway or tissue prioritization these should be omited here.
Argument | ShortArgument | Type | Describtion |
---|---|---|---|
--debug | -d | flag | Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis. |
--excludeHla | -eh | flag | Exclude HLA locus during pathway enrichments (chr6 20mb - 40mb) |
--genes | -ge | path | File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm |
--output | -o | path | The output path |
--pathwayDatabase | -pd | name=path | Pathway databases, binary matrix with either z-scores for predicted gene pathway associations or 0 / 1 for gene assignments |
--pathwayDatabase2 | -pd2 | name=path | Pathway databases, binary matrix with 0 or 1 for gene pathway assignments. Only used by PRIO_GENE_ENRICH mode |
--stepOneOutput | -soo | path | The output path of STEP1 |
Among the different intermediate files, two files contain the primary output.
This file contains the results of the different enrichment analysis. Each sheet contains the results of single enrichment source. By default we run enrichment on the following types of datasets. But in principle it can be done on any genes times X matrix.
As an input we use a gene by gene matrix to expression the relation between genes. We typically use co-regulation to do this (see introduction above). This means that for each gene we have a metric how it relates to all other genes, there scores are then correlated to the GWAS gene scores. The idea being that genes that are co-regulated with many genes that are picked up by the GWAS are more important to the studied trait. Since this is done in a genome wide manner we can prioritize trans acting genes outside of the GWAS loci.
The pathways enrichments are performed using the predicted pathways assignments from [1]. Here we test if the predicted gene assignments to a pathway are correlate to GWAS gene signal.
The sample enrichments (in the expression tab) indicate the correlation between expression levels of each of the 31,499 samples in our gene-network and the GWAS gene p-values. We found that to top enriched samples are often very relevant for the trait studied trait.
The GTEx sheet is used to determine the enrichment of primary tissues by correlating the GWAS to the average expression profile of each tissue in GTEx.
The co-regulation scores can also be used to prioritize candidate genes within the loci identified by the GWAS. In this file we list all the genes within a window around the independent top variants and we prioritize these variants based on the overal gene prioritization. These prioritization scores do not nesearally have to be very strong since they these genes might only have a small effect on the overall outcome of the traits. We do however suspect that the causal genes within a cis locus will, on average, have higher scores. We are currently investigating this further.
This file is created using the PRIO_GENE_ENRICH
mode and contains the enrichments of the Bonferonni signficant and FDR 5% signficant key genees. The bracked part of the file name is arbitrary and depends on the name you have used for these predictions in the --pathwayDatabase
argument. This file can contain multiple sheets, one for each database specified using --pathwayDatabase2
.
Additionally, there will also be a file with GenePvalues instead of a database name. These are the enrichments when using the PASCAL based gene p-values directly.
- QTL mapping pipeline
- Genotype Harmonizer
- Genotype IO
- ASE
- GADO Command line
- Downstreamer
- GeneNetwork Analysis
Analysis plans
Other