Skip to content

Downstreamer

Patrick Deelen edited this page Sep 19, 2021 · 55 revisions

🚧 Under construction 🚧

Downstreamer can be used to to perform pathway enrichments and core-gene prioritizations using GWAS summary statistics. Downstreamer does not rely on known pathway assignments of genes but instead uses predicted gene-pathway assignments [1]. The core-gene prioritizations of genes are based on co-regulation. This co-regulation is a measure related to co-expression but less driven by tissue differences [2]. Both the gene-pathway assignments and co-expression scores are calculated using 31,499 public RNA-seq samples from many different tissues [1].

A manuscript is in preparation

Content

1️⃣ Getting started 2️⃣ Running Downstreamer 3️⃣ Key gene enrichment analysis 5️⃣ Most important output files

1. Getting started

Download tool here: TODO

Download reference data here: TODO

2. Running Downstreamer

The main function of Dowstreamer is using GWAS summary statistics to predict relevant pathways and genes. The commands needed to do this using the by us provided pathway databases are listed below. Other modes of Downstreamer are listed here: DS additional modes

All Downstreamer commands have the following basis:

java -Xmx10g -jar Downstreamer.jar --mode [MODE] --output

On top of this you might need to allocate more memory when running Downstream. If you get an heapspace error, then increase the value of -Xmx10g to increase the amout of memory that downstreamer can use.

2.a. Preparing GWAS summary statistics

First the GWAS summary statistics should be saved in the folling manner:

  • TAB seperated text file
  • First column must contain the variant identifiers as used in the LD reference panel. If the reference panel uses RS identifiers then these should be used and if chr:pos indentifiers are used by the reference panel then these should be used here.
  • The following columns should contain the GWAS variant p-values. The header of these columns should be name of the GWAS
  • It is not needed to filter on variants present in the reference data, these will automaticly be excluded.

Secondly this text file needs to be convered to a binary file for quick access by Downsreamer. That can be done using the CONVERT_TXT mode with the following command:

❗ While it is possible to combine multiple GWAS summary statistics in a single file this is only recommended if they are performed on the same cohort using the same genotyping data.

java -Xmx10g -jar Downstreamer.jar --mode CONVERT_TXT --output [PATH_TO_OUTPUT_FILE] --pvalueToZscore --gwas [PATH_TO_GWAS_TEXT_FILE]

You now get 3 files, 1 binary file with the summary statistics matrix and 2 text files depicting the rows and columns. If al went well the columns files will contain the names of the studies in the orignal text file.

2.b. Dowstreamer STEP1

The first part of Downstreamer, STEP1 mode, is the most computationally and memory intensive. Here the GWAS SNP p-values are converted to GWAS gene p-values. Additionally this step does a very basis pruning to identify independent top GWAS hits. It is possible to provide the STEP2 arguments when running STEP1, in this case Downstreamer will automatically continue with STEP2.

Arguments used by STEP1

Argument ShortArgument Type Describtion
--correctLambda -cl flag Correct the GWAS for the lambda inflation
--debug -d flag Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis.
--gwas -g path GWAS Z-sccore binary matrix. Rows variants, Cols phenotypes. Without .dat
--genes -ge path File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm
--maf -maf double Minimum MAF
--output -o path The output path
--permutations -p int Number of initial permutations before using Farebrother's RUBEN algorithm to determine gene p-values
--permutationFDR -pfdr int Number of random phenotypes to use to determine null distribution for FDR calculation
--permutationGeneCorrelations -pgc int Number of random phenotypes to use to determine gene correlations
--permutationPathwayEnrichment -ppe int Number of random phenotypes to use to determine null distribution pathway enrichment
--permutationsRescue -pr int Number of permutations to use as fallback incase Farebrother's RUBEN algorithm failed. Optional but recommende to do atleast: 100,000,000.
--referenceGenotypeFormat -R type The reference genotype data type. If not defined will attempt to automatically select the first matching dataset on the specified path * PED_MAP - plink PED MAP files. * PLINK_BED - plink BED BIM FAM files. * VCF - bgziped vcf with tabix index file * VCFFOLDER - matches all bgziped vcf files + tabix index in a folder * SHAPEIT2 - shapeit2 phased haplotypes .haps & .sample * GEN - Oxford .gen & .sample * BGEN - Oxford .bgen & optionally .sample * TRITYPER - TriTyper format folder
--referenceGenotypes -r basePath The reference genotypes
--referenceSamples -rs path Samples to include from reference genotypes
--threads -t int Maximum number of calculation threads
--saveUsedVariantsPerGene -uvg flag Save all the variants used per gene to calculate the gene p-value (Warning this will create a very large file)
--variantCorrelation -v double Max correlation between variants to use (recommend = 0.95)
--variantFilter -vf path File with variants to include in gene p-value calculation (optional)
--variantGene -vg path Variant gene mapping. col 1: variant ID col 2: ENSG ID. (optional)
--window -w int Number of bases to add left and right of gene window for step 1, use -1 to not have any variants in window

2.c.Downstreamer STEP2

To run the STEP2 mode the programs needs the gene p-values of the real GWASes and the randomized null GWASes. It also takes the top hits as determined in STEP1 but it is possible to provide your own list of top hits --alternaitveTopHits.

The gene p-values are the basis of the pathway enrichment analysis and gene prioritizations in STEP2. Because it is not needed to recalcute the gene p-values when performing multiple enrichment analysis this step only needs to be run once.

Argument ShortArgument Type Describtion
--annotDb -adb String Name of Pathway databases that you want to annotate with GWAS info. Rownames MUST be enembl Id's
--alternaitveTopHits -ath name=path Alternative file with top GWAS hits id chr pos p-value. name must be name as in oter output files for trait
--covariates -cov path File with covariates used to correct the gene p-values. Works in conjunction with -rgl. Residuals of this regression are used as input for the GLS
--cisWindowExtent -cwe int Cis window is defined as [startOfGene-cwe, endOfGene+cwe]. Defaults to: 250000
--debug -d flag Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis.
--excludeHla -eh flag Exclude HLA locus during pathway enrichments (chr6 20mb - 40mb)
--forceNormalGenePvalues -fngp flag Force normal gene p-values before pathway enrichtment
--forceNormalPathwayPvalues -fnpp flag Force normal pathway p-values before pathway enrichtment
--geneCorrelationWindow -gcw int Window in bases to calculate gene correlations for GLS of pathway enrichment
--genes -ge path File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm
--genePruningR -gpr double Exclude correlated genes in pathway enrichments
--ignoreGeneCorrelations -igc flag Ignore gene correlations in pathway enrichment (not recommended)
--output -o path The output path
--pathwayDatabase -pd name=path Pathway databases, binary matrix with either z-scores for predicted gene pathway associations or 0 / 1 for gene assignments
--permutationFDR -pfdr int Number of random phenotypes to use to determine null distribution for FDR calculation
--permutationGeneCorrelations -pgc int Number of random phenotypes to use to determine gene correlations
--permutationPathwayEnrichment -ppe int Number of random phenotypes to use to determine null distribution pathway enrichment
--qnorm -qn flag Quantile normalize the permutations gene p-values before pathway enrichments to fit the real gene p-value distribution
--regress-gene-lengths -rgl flag Linearly correct for the effect of gene lengths on gene p-values
--saveExcel -se flag Save enrichement results also as excel files. Will generate 1 file per input phenotype
--stepOneOutput -soo path The output path of STEP1
--threads -t int Maximum number of calculation threads

3. Key gene enrichment analysis

After the key gene prioritization performed in STEP2 it is posible to test determine the enrichments of these key-genes in pathways or HPO-terms. This is done using the PRIO_GENE_ENRICH mode.

❗ The --pathwayDatabase beheaves slightly different here. Now it is used to define on which predictions the enrichment should be performed. This ofcrouse can only be done on gene prioritizations. If you also performed pahtway or tissue prioritization these should be omited here.

Argument ShortArgument Type Describtion
--debug -d flag Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis.
--excludeHla -eh flag Exclude HLA locus during pathway enrichments (chr6 20mb - 40mb)
--genes -ge path File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm
--output -o path The output path
--pathwayDatabase -pd name=path Pathway databases, binary matrix with either z-scores for predicted gene pathway associations or 0 / 1 for gene assignments
--pathwayDatabase2 -pd2 name=path Pathway databases, binary matrix with 0 or 1 for gene pathway assignments. Only used by PRIO_GENE_ENRICH mode
--stepOneOutput -soo path The output path of STEP1

4. Most important output files

Among the different intermediate files, two files contain the primary output.

TraitName_enrichtments.xlsx

This file contains the results of the different enrichment analysis. Each sheet contains the results of single enrichment source. By default we run enrichment on the following types of datasets. But in principle it can be done on any genes times X matrix.

Gene co-regulation enrichments

As an input we use a gene by gene matrix to expression the relation between genes. We typically use co-regulation to do this (see introduction above). This means that for each gene we have a metric how it relates to all other genes, there scores are then correlated to the GWAS gene scores. The idea being that genes that are co-regulated with many genes that are picked up by the GWAS are more important to the studied trait. Since this is done in a genome wide manner we can prioritize trans acting genes outside of the GWAS loci.

Pathway enrichments

The pathways enrichments are performed using the predicted pathways assignments from [1]. Here we test if the predicted gene assignments to a pathway are correlate to GWAS gene signal.

Sample & tissue enrichments

The sample enrichments (in the expression tab) indicate the correlation between expression levels of each of the 31,499 samples in our gene-network and the GWAS gene p-values. We found that to top enriched samples are often very relevant for the trait studied trait.

The GTEx sheet is used to determine the enrichment of primary tissues by correlating the GWAS to the average expression profile of each tissue in GTEx.

TraitName_cisPrio.xlsx

The co-regulation scores can also be used to prioritize candidate genes within the loci identified by the GWAS. In this file we list all the genes within a window around the independent top variants and we prioritize these variants based on the overal gene prioritization. These prioritization scores do not nesearally have to be very strong since they these genes might only have a small effect on the overall outcome of the traits. We do however suspect that the causal genes within a cis locus will, on average, have higher scores. We are currently investigating this further.

TraitName_[KeyGene]_Enrichment.xlsx

This file is created using the PRIO_GENE_ENRICH mode and contains the enrichments of the Bonferonni signficant and FDR 5% signficant key genees. The bracked part of the file name is arbitrary and depends on the name you have used for these predictions in the --pathwayDatabase argument. This file can contain multiple sheets, one for each database specified using --pathwayDatabase2.

Additionally, there will also be a file with GenePvalues instead of a database name. These are the enrichments when using the PASCAL based gene p-values directly.

References

[1] https://www.nature.com/articles/s41467-019-10649-4

[2] https://www.nature.com/articles/ng.3173

Clone this wiki locally