Skip to content

Satellite Functions

Marie Oestreich edited this page May 31, 2022 · 5 revisions

For more information on each function's parameters, use R's help function: help("NameOfFunction") or ?NameOfFunction.

Data-Based Definition of Top Most Variant Genes

To determine the most variable genes used as input for network calculations suggest_topvar() identifies the inflection point in a curve of the logged variance of ranked genes. The calculation of the inflection point is a fast approach to identify a first threshold for potentially interesting genes that describe differences in the data while excluding non-variable genes and thereby significantly reducing the calculation time during further analyses.

Check Data Distribution

To provide a better intuition for the datasets and to detect possible outliers or prominent differences between datasets, the distribution of count values in each sample for all datasets can be visualized either as boxplots or frequency distributions with plot_sample_distributions()

Principal Component Analysis

The function PCA() conducts a Principle Component Analysis (PCA) for each data set and returns a plot of the samples in the space of the first two PCs each. The PCA calculations can be based on three different types of data: i) all genes present in the dataset; ii) the top most variant genes as filtered for in the main script; iii) the genes present in the layer-specific networks built in the pre-integration phase. The data points can be coloured by any meta information present in the provided annotation files.

Meta Data Plots

After choosing the sample grouping variable, or variable of interest, for the analysis (in an arbitrary example that might be condition which takes on the values case or control for each sample), it is often of interest how other meta information, e.g. sex or age, is distributed among those sample groups. In the case of a categorical variable, this will be displayed as a stacked boxplot, for numerical variables it will be provided as box plots by running the function meta_plot().

Cytoscape Export

Cytoscape is a software tool for visualizing large networks. Its layout algorithms allow for intuitive 2D visualizations that often excel compared to many algorithms accessible within R. The function export_to_cytoscape() exports the gene expression network created during the integration phase to the cytoscape software and import_layout_from_cytoscape() re-imports it after applying a user-defined layout algorithm. It must be noted that a Cytoscape installation is required for this step and that the software must be running for successful communication.

Correlate Numerical Meta Data with Clusters

meta_correlation_num() evaluates how numeric meta-information, e.g. age or height, correlates with the expression patterns of the detected gene clusters to find potential drivers for the cluster formation. For each cluster, it calculates the Pearson correlation between the mean expression of its genes across all samples with the values of the numeric mate information across all samples. The result is a matrix with clusters as rows and meta categories, e.g. age and height, as columns. The cell values hold the Pearson correlation coefficient and are shaded grey if the correlation is not significant with respect to a user-defined p-value threshold.

Correlate Categorical Meta Data with Clusters

meta_correlation_cat ()is similar to the numerical case but the calculation differs due to the non-numerical nature of the data. Again, the output is a matrix where the cells hold Pearson correlation coefficients. However, the correlations are calculated between i) per cluster the mean expression value of the cluster’s genes in the sample groups as defined by the variable of interest and ii) the ratio of samples in each variable-of-interest group that have a particular value of a categorical variable. For example, if the variable of interest has the values case and control and the categorical variable is sex with values female and male then it will define the fraction of females on the whole number of cases and on the whole number of controls and do the same for males. In this example, the columns of the final matrix are female and male and the rows are clusters.

Calculate Cluster Scores

The cluster scores quantify how distinct the found clusters are based on the network structure. get_cluster_scores() assigns a score close to 1 to very isolated clusters whereas clusters that are highly connected to others and therefore do not form a distinct structural component in the network are given a value closer to 0.

Hub Gene Detection

Hub genes are genes that are considered to play a central role based on their location in the network topology. Here, they are determined by find_hubs() using a combined ranking based on weighted degree centrality, weighted closeness centrality and weighted betweenness centrality. These are modified versions of the measures introduced in this paper by additionally incorporating the edge weights into the computation. The user can set the clusters for which to find the hub genes and the maximum number of hub genes to be returned per cluster, a table of the hub genes in each cluster is exported to an Excel file and the scaled expression values of each hub gene across the groups of the variable of interest are plotted as a heatmap.

Colour Single Cluster or Specific Gene Set

This option allows for the selective colouring of clusters (colour_single_cluster()) or provided lists of genes (highlight_geneset()) in the integrated gene expression network.

Evaluating Different Community Detection Algorithms

The post-integration phase of the main analysis offers a series of different community detection algorithms to determine clusters of strongly co-expressed genes based on the network topology. If the user is not certain which algorithm to use, these functions allow for the comparison of the one chosen in the main analysis (by default the Leiden algorithm) with the other ones offered by the tool. Therefore, alluvial plots are generated by algo_alluvial() that depict how the genes disperse into clusters of the new algorithm compared to the reference. Combined with information from the coloured integrated network this illustrates which clusters might merge or split and if that is reasonable from a structural perspective. Further, for every other algorithm, a PCA is calculated with PCA_algo_compare() based on the samples expression across clusters as well as a PCA simply based on the top most variant genes, independent of clustering. If the clustering algorithm captures the underlying structure of the data with the clusters it detected, then samples that are placed in close proximity in the PCA based of the most variant genes are expected to also be placed in close proximity in the clustering-based PCA. Finally, the clustering algorithm can be changed using update_clustering_algorithm().

Export Clustering and Import Clustering

These functions allow you to export (export_clusters()) the current clustering model or import (import_clusters()) a model from a previous analysis. This makes it easy to share the model but also to use a model from another dataset and study its behaviour in the context of different data.

Network comparisons

Option 1

Two functions are available to compare two different co-expression networks. network_comparison_1() accepts the exported 'gtc'-files (created by export_clusters()) of two networks and compares their clusters. These 'gtc'-files can also be created manually if you have gene clusters from an analysis other than hCoCena: They simply need to be tab-separated text-files, each with 2 columns, the first containing gene symbols, the second containing for each gene symbol the cluster it belongs to (that can be colours as in hCoCena, but custom naming schemes are possible). The function then calculates the Jaccard-Index between all pairs of clusters of the 2 networks and visualizes the result. The Jaccard-Index is calculated as follows: (intersection of genes in clusters A and B)/(union of genes in clusters A and B). The constructed figure looks like this:

Option 2

The second function, network_comparison_2(), accepts two igraph-objects of the networks and as a third parameter a vector of gene names. It then retrieves the dree of each gene in both of the networks and calculates the Jaccard-Index of the gene's neighbourhoods in the networks. The function then generates a plot with log2(degree) of the networks on the x- and y-axis. Genes are represented as dots, size and colour of the dot indicate the neighbourhoods' Jaccard-Index. The diagonal line represents degree equality, thus if a gene lies above or below that line, it has a higher degree in one of the networks. The generated plot looks like this:

Regrouping Samples

The variable-of-interest is usually chosen under the assumption that it is the driving force of the signatures found in the data. For example, if a study has been designed such that one sample group is facing a stimulus whereas the other is not, the stimulus would be the expected driving force in the gene expression changes. However, especially in cases that are more multi-dimensional or when the knowledge base for the studied condition is not as evolved yet, there may be underlying factors that dominate the observed signature and which are unknown. In such scenarios, the samples can be regrouped in a data-driven way and assigned labels for which no prior knowledge exists. The regrouping can be performed using cut_hclust() on different resolutions of the underlying data: i) either based on the expression of all present genes, ii) based on the genes present in the co-expression network or iii) based on the samples’ mean expression patterns across modules.

Module Analysis and Meta Annotation

user_specific_cluster_profiling() allows for further functional annotation of the co-expression clusters. In the module analysis section, the user has two options. They can either provide a file with custom gene sets for which the clusters will be scanned or they choose a database, e.g. Gene Ontology, and provide a keyword for which the enriched terms are filtered and the results are presented per cluster. Besides annotating the clusters as just described, there is also the option to use col_anno_categorical() and col_anno_numerical() to add metadata annotation with respect to the sample groups as they are defined by the variable of interest. Categorical or numerical data from the annotation file(s) can be used and will be depicted as bar or line plots underneath the module heatmap.

Change Grouping Parameter

With change_grouping_parameter() you can change the variable by which the samples are grouped and based on which the GFCs are calculated. This is particularly useful in cases where different variables are potential candidates for driving the genes’ expression changes in the data and an explorative approach is required to decide on the most suitable. Note that any previously generated column annotation will not be plotted, since the grouping will change. If you eventually decide on another grouping variable, please run the analysis again entirely with the changed variable-of-interest from the very beginning.

Plot Network Coloured by GFC

For visualization of the GFC for every gene under the different observed groups, the network can be additionally replotted once for every group, with nodes being coloured according to their GFC value using plot_GFC_network(). This provides a more detailed resolution of the information acquired from the module heatmap.

Transcription Factor Query

check_tf() leverages the information collected during the transcription factor enrichment analysis with TF_overrep() and TF_enrich_all() in the main markdown to allow the user to query specific transcription factors of interest and see how their top targets are spread across modules. The goal is to uncover potential co-regulation between clusters.

Write Session Info

The parameters of the analysis session are written to a text file to enhance reproducibility without keeping a markdown for every analysis. write_session_info() documents the name of the files and their location used as count and annotation files, the global settings set in the session, the layer settings set for each dataset as well as the cut-offs and the clustering algorithm used.