Infusing structural assumptions into dimension reduction for single-cell RNA sequencing data to identify small gene sets

Overview

This GitHub repository contains the code and scripts to define and train a boosting autoencoder (BAE) and to reproduce the results presented in our manuscript.

What is it about?

Dimension reduction approaches are widely used for exploring cellular heterogeneity in single-cell RNA sequencing (scRNA-seq) data, e.g., for identifying two-dimensional visual representations where cell groups can be disentangled, followed by post-hoc analyses. While most approaches are data-driven or are challenging to interprete, it might still be useful to incorporate assumptions that reflect intuition on the underlying structure or the experimental design already as part of the dimension reduction. E.g., dimensions that help to distinguish between cell groups intuitively should be characterized by distinct small sets of genes, or the design in a time-series experiment should be incorporated such that temporal changes of cell states are characterized by gradual changes in corresponding gene sets.
We combine the advantages of two machine learning approaches, namely autoencoders for dimension reduction via deep learning and boosting for formalizing assumptions. Specifically, we use a componentwise boosting approach, which selects small sets of characteristic genes for each dimension, and allows for tailoring the selection logic to encode further assumptions, such as distinct cell groups or temporal patterns. Our approach facilitates interpretability by selecting different small sets of genes during optimization, where the gene sets explain the learned patterns in latent dimensions.

We illustrate the approach in a scRNA-seq dataset of cortical neurons, where it captures different cell types in distinct dimensions and identifies corresponding marker genes. In particular, we could also capture very small cell groups. Similarly, encoding assumptions that reflect the experimental design allowed for extracting temporal development patterns and corresponding gene programs in an application to time-series data. These examples demonstrate the general benefit of incorporating structural knowledge into dimension reduction for scRNA-seq data.

Repository structure

The scripts subfolder consists of scripts for:

1. Preprocessing: Julia and Python scripts for downloading and preprocessing the scRNA-seq datasets.
1. Simulation: Julia scripts for generating two different scRNA-seq-like datasets.
1. BAE application: Julia scripts for the BAE and timeBAE application to the preprocessed simulated and real-world scRNA-seq datasets.

The tutorials subfolder consists of a Julia Jupyter notebook illustrating the functionality of the BAE on simulated scRNA-seq data.

The src subfolder consists of Julia source code files for the BAE approach.

All plots and data downloaded or generated while running the scripts are stored in the subfolder figures or data, respectively.

Installation

Git should be installed on your computer. You can download and install it from Git's official website.

0.1. Open your terminal

On macOS or Linux, open the Terminal application.
On Windows, you can use Command Prompt, PowerShell, or Git Bash.

0.2. Navigate to your desired directory

Use the cd command to change to the directory where you want to clone the repository.
Example (macOS): To change to a directory named MyProjects on your desktop, you would use:
```
cd ~/Desktop/MyProjects
```
Example (Windows): To change to a directory named MyProjects on your desktop, you would use:
```
cd C:\Users\[YourUsername]\Desktop\MyProjects
```

0.3. Clone the repository

Use the git clone command followed by the URL of the repository.
You can find the URL on the repository's GitHub page.

Example:

git clone https://github.com/NiklasBrunn/BoostingAutoencoder/tree/main

Install Julia
- To run the Julia scripts, Julia v1.6.7 has to be downloaded and installed manually by the user. The required packages and their versions are specified in the Project.toml and Manifest.toml files in the main folder and automatically loaded/installed at the beginning of each script with the Pkg.activate() and Pkg.instantiate() commands. See here for more info on Julia environments.
Install Python
- To run the Python scripts, we included details about a conda environment in (environment.yml) consisting of information about the Python version and used packages. A new conda environment can be created from this file. See here for more details about managing and creating conda environments. Follow these steps to set up your development environment:

2.1. Navigate to the project directory

Navigate to the directory of the cloned GitHub repository (macOS):
```
cd ~/BoostingAutoencoder
```
(Windows):
```
cd \BoostingAutoencoder
```

2.2. Create the conda environment

Create a new conda environment that is named as specified in the environment.yml file (in this case it is named BAE-env):
```
conda env create -f environment.yml
```

2.3. Use the BAE conda environment for running python code

Once the environment is created, select it as the kernel for running the python code in the repository.

Instructions for running scripts

Simulated scRNA-seq data
- For running the BAE and timeBAE analysis on the simulated scRNA-seq data, you can directly run the files main_sim10stagesScRNAseq.jl, modelcomparison_sim10stagesScRNAseq.jl, main_sim3cellgroups3stagessScRNAseq.jl.
Cortical mouse scRNA-seq data
- For running the BAE analysis on the cortical mouse data from Tasic et al. first, run the script get_corticalMouseData.jl followed by preprocess_corticalMouseData.py for downloading and preprocessing the data. Subsequently, analysis can be performed by running the scripts main_corticalMouseData.jl and subgroupAnalysis_corticalMouseData.jl.
Embryoid body scRNA-seq data
- For running the timeBAE analysis on the embryoid body data from Moon et al. first run the preprocessing scripts get_and_preprocess_embryoidBodyData.py and cluster_and_filter_embryoidBodyData.py followed by main_embryoidBodyData.jl for generating the results.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
data		data
figures		figures
scripts		scripts
src		src
tutorials		tutorials
.gitattributes		.gitattributes
.gitignore		.gitignore
Environment.yml		Environment.yml
LICENSE		LICENSE
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Infusing structural assumptions into dimension reduction for single-cell RNA sequencing data to identify small gene sets

Overview

What is it about?

Repository structure

Installation

Instructions for running scripts

About

Releases

Packages

Languages

License

NiklasBrunn/BoostingAutoencoder

Folders and files

Latest commit

History

Repository files navigation

Infusing structural assumptions into dimension reduction for single-cell RNA sequencing data to identify small gene sets

Overview

What is it about?

Repository structure

Installation

Instructions for running scripts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages