Single-cell and single-cell lineage tracing analysis in python (soon a Nextflow pipeline).
Aim of this project is to provide a toolkit for the exploration of scRNA-seq. These tools perform common single-cell analysis tasks (i.e., data pre-processing and integration, cell clustering and annotation (*)) with the following features:
- Multiple methods for each analytical step, to ensure that, for a given task, the best performing method have a fair chance of being selected
- Evaluation steps, to benchmark methods performance at some task (N.B., without any ground truth reference available)
- Fine control at minimal effort, thanks to flexible Command Line Interfaces (CLIs) through which users may perform either individual analytical tasks or complete analysis on their data, with minimal (or none at all) coding required.
- Focus on gene modules and signatures scoring
- Focus on classification models, used to prioritize "distinguishing features" (i.e., transcriptional features with high discriminative power in classification tasks defined over categorical cells metadata) and validate clustering results (*)
- Utils for lentiviral-based single-cell lineage tracing data (sclt) analysis
- Scalability, to thousands of cells
- Automatic handling of folders creation/removal, as individual CLIs are designed to create and populate folders at some user defined location, and to switch among analysis versions without the need for user to handle Input/Output operations manually
- Graphical User Interfaces (GUIs), to visually explore results
For the time being, the main Cellula workflow implements the following tasks:
- (By sample) cell and gene quality Control (QC), followed by expression matrices merging (
qc.py
), data pre-processing (pp.py
) and batch effects assessment (kBET.py
) - (Optional, if needed) correction of batch effects (
integration.py
) followed by the assembly of the final pre-processed data (integration_evaluation.py
) - (Leiden) cell clustering at multiple, tunable resolutions, coupled to cluster markers computation (
clustering.py
) - Clustering solutions evaluation and choice (
clustering_diagnostics.py
) - Signatures (i.e., gene sets, either defined by the user or retrieved by data-driven approaches) scoring (
signatures.py
) - Distinguishing features ranking, through Differential Expression (DE) and classification methods (
dist_features.py
) - Interactive exploration of the results (
cellula_app.py
)
Cellula
has been designed for command-line usage. However, individual functions and classes can be imported individually by users that look for even more flexibility.
A complete documentation (with nice tutorials and so on) will be provided when the project will reach sufficient stability for being packaged and released (we will get there soon :)).
For now, the following serves as a simple quickstart, including instructions to install (*) and run Cellula
. To get better understanding on individual CLIs, modules, classes and function features, consider CLIs help messages, source code comments and docstrings, and the (temporary) documentation provided in this repo.
Even if Cellula
cannot be installed from source, it is already possibile to download its code, and make it work on a local machine or an HPC cluster with few simple commands.
In order to do that, first thing first, clone this repo locally:
git clone [email protected]:andrecossa5/Cellula.git
# or git clone https://github.com/andrecossa5/Cellula.git
Then, cd
to ./Cellula/envs
and create the conda environment for your operating system (N.B.: Cellula has been tested only on Linux and macOS machines. In Cellula/envs
you can find different OSX and Linux .yml files, storing receipts for both OS environments. mamba
is used here for performance reasons, but conda
works fine as well).
For a Linux machine:
cd ./Cellula/envs
mamba env create -f environment_Linux.yml -n cellula_example
After that, you have to link the cloned repository path to your newly created environment:
mamba activate cellula_example
mamba develop . # you have to be in the cloned repo path
That's it. To check if you are able to run Cellula's code, run the newly installed python
interpreter
python
and
import Cellula
If you are not seeing any errors, you are ready to go.
To begin a new single-cell analysis, cd
to a location of choice on your machine, and create a new folder, This folder will host all data and results of your project. We will refer to this main folder with its absolute path, and we will assign this path to a bash environment variable, $path_main
.
cd <your_coince_here>
mkdir $main_folder_name
cd $main_folder_name
path_main=`pwd`/
Once in $path_main
, we need to setup this folder in order for the analysis. At the bare minimum, the user to needs to create two new folders in $path_main
, matrices
and data
:
-
matrices
hosts all sample matrices for the project (i.e., CellRanger [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger] or STARsolo [https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md] outputs, including, for each sample, 3 files: barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz). If one has to deal with sclt data, each sample directory need to store additional lentiviral-clones info (see below). In this repo,test_data
contains a minimal example ofmatrices
folder, with data from two samples, a and b. Please, follow the same directory structure to create yourmatrices
folder with your data. -
data
will host all the intermediate files fromCellula
analysis steps. In the most simple case, one can just initialize this as an empty folder. However, one may want to include other project-specific data, e.g., a list of curated gene sets to score. In this repo, thetest_data
folder contains a simple example on howdata
needs to be structured in this case, with 6 manually curated gene sets stored indata/curated_signatures
in.txt
format. Please, use the same folder structure with your data.
To setup your $path_main
folder:
cd
to$path_main
- create and fill
matrices
anddata
folders (for the following demo, just copy/linktest_data/matrices
andtest_data/data
to$path_main
). cd
to your Cellula repository clone,cd
to thescripts
folder, and run:
bash prepare_folder.sh $path_main
You should be able to see two new folders created at $path_main: results_and_plots
and runs
.
A properly configured $path_main
folder for a Cellula analysis should look something like this (using tree
):
├── data
│ ├── curated_signatures
│ │ ├── Inflammation.txt
│ │ ├── Invasion.txt
│ │ ├── Metastasis.txt
│ │ ├── Proliferation.txt
│ │ ├── Quiescence.txt
│ │ └── Stemness.txt
│ └── removed_cells
├── matrices
│ ├── a
│ │ └── filtered_gene_bc_matrix
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ └── b
│ └── filtered_gene_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── results_and_plots
│ ├── clustering
│ ├── dist_features
│ ├── pp
│ ├── signatures
│ └── vizualization
│ ├── QC
│ ├── clustering
│ ├── dist_features
│ ├── pp
│ └── signatures
└── runs
With $path_main correctly configured, we can proceed with the analysis.
We will first perform Quality Control and matrix pre-processing.
Note 1 In a single Cellula workflow (i.e., its CLIs calls and results) one choose a unique set of options for each task. These options will likely affect the final results. Therefore, one is commonly interested in varying them and compare their results without loosing previously precomputed analysis. In order to do this, all Cellula
CLIs all have a --version
(or -v
) argument to activate and write on a specific version folder. This way, a single place (i.e., the main folder) can store and organize all the results obtained on the same data with different, user-defined strategies. Run the same task changing -v
and see how the $path_main
folder structure is modified. We are currently implementing a new CLI to create a new branche starting from another, existing one (i.e., wihtout having the re-run all the steps from qc.py on).
Note 2 One might want to inspect every output of a Cellula CLI before running the next one, or might want to run an entire analysis with the least number of CLIs calls possible, inspecting results only at the end. In this quickstart, we propose a recipe for the second scenario, but human inspection is always encouraged (expecially at this stage of the project).
For now, all CLIs must be called form the Cellula/script
directory (i.e., one still has to cd
to this folder to launch these scripts in a batch job on a HPC cluster).
To perform cell and gene QC and merge expression data, run:
python qc.py -p $path_main -v default --mode filtered --qc_mode seurat
Here we have specifically activated the 'default' version. You would see two newly created files in data/default
, QC.h5ad
and cells_meta.csv
. Here, you have two choices:
- Go on with pre-processing (as we will do with this demo), using default cells meta-data
- Format your cells meta-data before pre-processing, adding/removing columns to
cells_meta.csv
.
This is important with complex single-cell studies where cells/samples are grouped by a number of categorical covariates dependent on the study design. The newly formattedcells_meta.csv
file will be read bypp
afterwords if--custom_meta
is specified
For this demo, we will go with default cells meta-data, and run:
python pp.py -p $path_main -v default --norm scanpy --n_HVGs 2000 --score scanpy --embs
After pre-processing, in this case we will skip batch effects evaluation and data integration sections, as a and b samples come from the same experiment, lab and sequencing run (tutorials on how to handle more complicated situations leveraging Cellula
functionalities at full will be soon available). Here, we will choose to retain the original 'PCA' embedding obtained by reducing (and scaling) the full gene expression matrix to the top 2000 hyper-variable genes (HVGs), a common choice in single_cell analysis (see pp.py
, kBET.py
and integration scripts for further details and alternatives). This data representation will be used for kNN construction, multiple resolution clustering and markers computation. All clustering solutions will be then evaluated for their quality. These three steps (i.e., choice of a cell representation to go with, clustering and initial clustering diagnostics) can be obtained by running:
python integration_diagnostics.py -p $path_main -v default --chosen scaled:original
python clustering.py -p $path_main -v default --range 0.2:1.0 --markers
python clustering_diagnostics.py -p $path_main -v default
The user can inspects the clustering and clustering visualization folder to visualize properties of the "best" clustering solutions obtained, and then choose one to perform the last steps of Cellula workflow. In this case we will select the 30_NN_30_0.29 solution.
python clustering_diagnostics.py -p $path_main -v default --chosen 30_NN_30_0.29
Lastly, we will retrieve and score potentially meaningful gene sets in our data, and we will search for features (i.e., single genes, Principal Components or Gene Sets scores) able to distinguishing groups of cells in our data. First, we will retrieve and score Gene Sets with
python signatures.py -p $path_main -v default --Hotspot --barkley --wu --curated --scoring scanpy
Then, we will look for distinguishing features. Specifically, here we will look for distinguishing features discriminating individual samples and leiden clusters (chosen solution) with respect to all the other cells. We will make use of DE and classification models for both tasks. In order to do that, we need to pass a configuration file to dist_features.py
, encoding all the info needed to retrieve cell groups and specify the type of features and models one would like to use to rank distinguishing features.
For this demo, we will pass the example configuration file stored in test_data/contrasts
, sample_and_leiden.yml
. dist_features.py
look at .yml files in $path_data/contrasts/
, so:
- Create a
contrasts
folder in$path_main
- Copy
test_data/contrasts/samples_and_leiden.yml
in$path_main/contrasts/
and run
python dist_features.py -p $path_main -v default --contrasts sample_and_leiden
If you want to explore other distinguishing features, just create and pass your own file. Arbitrary analyses can be specified by changing the provided .yml file.
For example, consider the case when one would be interest in the distinguishing features among cluster 0 and 1 from sample a and b, respectively, using both DE and classification, and all the available feature types. The related .yml file would be something like:
custom: # Contrast "family" name
<example_query>: # Name you would like to give to the new contrast
query:
a: leiden in ["0", "1"] & sample == "a" # Cell groups. <name> : string eval expression
b: leiden in ["0", "1"] & sample == "b"
methods: # Methods used
DE: wilcoxon # Uses only genes, by default
ML:
features: # Features to use as input of ML models
- genes # may be added, but requires >> time in full mode
- PCs
- signatures # (i.e., gene sets from signatures.py)
models: # Classifiers used
- logit
- xgboost
mode: fast # Training mode. For a full hyperparameters optimization, write 'full' here
Save your custom .yml files in test_data/contrasts/
and pass it to dist_features.py
to run your analyses.
Lastly, Cellula
comes with two streamlit
GUIs to interactivaly explore its outputs. This may be useful also for non-computational users exploration. Indeed, once Cellula
has been setup (see the Installation paragraph) and run on some data (see above demo), results can be shared and queried as follows (always in the same environment that you have created during the installation):
cd
to your Cellula repository clone,cd
to thescripts
folder, and run:
python prepare_archive.py -p $path_main -n <your-project-name-here>
This will generate a <your-project-name-here>.tar.gz
file in $path_main
that contains all the info needed by the GUIs.
- Upload the
<your-project-name-here>.tar.gz
file to<some-path-here>
and un-tar the archive:
tar -xf <your-project-name-here>.tar.gz
rm <your-project-name-here>.tar.gz
- In the same environment,
cd
to the locally clonedCellula
repo,cd
toapps
and launch one of the two GUIs by running:
streamlit run cellula_app.py <some-path-here>
A multi-page GUI will automatically starts on your web browser.
This folder is organized as follows:
.
├── Cellula
├── apps
├── docs
├── envs
├── scripts
└── tests
envs
contains the .yml file of the conda environment needed for package setup.docs
contais all documentations files.tests
contains all package unit tests.apps
contains the .py scripts that launchstreamlit
GUIs.scripts
contains all the CLIs which produce Cellula workflow.Cellula
contains all the modules needed byscripts
.
- This is still a preliminary version of this project, undergoing major and minor refractoring.
- The Cellula.drawio.html sketch represents the data flow across Cellula CLIs, along with their dependencies.
tests
,docs
andsetup.py
needs to be fully implemented yet.