Source code (R statistical programming language, v3.6) to reproduce the results described in the article:
Avila Cobos F, Alquicira-Hernandez J, Powell JE, Mestdagh P and De Preter K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. (Nature Communications;
Here we provide an example folder (named "example"; see "Folder requirements & running the deconvolution") that can be directly used. It contains an artificial single-cell RNA-seq dataset made of 5 artificial cell types; 200 cells per cell type and 80 genes.
The other five external datasets (together with the necessary metadata) can be downloaded from their respective sources:
- Baron: (Specifically, GSM2230757 to GSM2230760 for human pancreatic islands)
- GSE81547:
- E-MTAB-5061:
- PBMCs:
- kidney.HCL:
Regarding E-MTAB-5061: cells with "not_applicable", "unclassified” and “co-expression_cell" labels were excluded and only cells coming from six healthy patients (non-diabetic) were kept.
The following line is needed for fresh installations of Linux (Debian):
sudo apt-get install curl libcurl4-openssl-dev libssl-dev zlib1g-dev r-base-dev libxml2-dev
Code to be run before running any deconvolution (to be run in R >= 3.6.0):
packages <- c("devtools", "BiocManager","data.table","ggplot2","tidyverse",
"foreach","doMC","doSNOW", #for parallelism
"Seurat","sctransform", #sc-specific normalization
"nnls","FARDEEP","MASS","glmnet","ComICS","dtangle") #bulk deconvolution methods
for (i in packages){ install.packages(i, character.only = TRUE)}
# Installation using BiocManager:
# Some packages that didn't work with install.packages (e.g. may not be present in a CRAN repository chosen by the user)
packages3 = c('limma','edgeR','DESeq2','pcaMethods','BiocParallel','preprocessCore','scater','SingleCellExperiment','Linnorm','DeconRNASeq','multtest','GSEABase','annotate','genefilter','preprocessCore','graph','MAST','Biobase') #last two are required by DWLS and MuSiC, respectively.
for (i in packages3){ BiocManager::install(i, character.only = TRUE)}
# Dependencies for CellMix: 'NMF', 'csSAM', 'GSEABase', 'annotate', 'genefilter', 'preprocessCore', 'limSolve', 'corpcor', 'graph', 'BiocInstaller'
packages2 = c('NMF','csSAM','limSolve','corpcor')
for (i in packages2){ install.packages(i, character.only = TRUE)}
# Special instructions for CellMix and DSA
install.packages("BiocInstaller", repos="")
system("R CMD INSTALL CellMix_1.6.2.tar.gz")
system("R CMD INSTALL DSA_1.0.tar.gz")
# Following packages come from Github
devtools::install_github("GfellerLab/EPIC", build_vignettes=TRUE) #requires knitr
devtools::install_bitbucket("yuanlab/dwls", ref="default")
devtools::install_github("dviraran/[email protected]")
Users interested in the generation of pseudo-bulk mixtures from scRNA-seq data can use the "Generator" function that is located inside helper_functions.R
While our work has a BSD (3-clause) license, you may need to obtain a license to use the individual normalization/deconvolution methods (e.g. CIBERSORT. The source code for CIBERSORT needs to be asked to the authors at
a) Folder structure:
├── example
│ ├── example.rds
│ └── example_phenoData.txt
├── baron
│ ├── sc_baron.rds
│ └── baron_phenoData.txt
├── GSE81547
│ ├── sc_GSE81547.rds
│ └── GSE81547_phenoData.txt
├── helper_functions.R
├── Master_deconvolution.R
b) Minimally the following (tab-separated) columns being part of the metadata: "cellID", "cellType", "sampleID". Optionally, other columns may be present (e.g. "gender","disease").
# For the baron dataset, it should look like:
cellID cellType sampleID
human1_lib3.final_cell_0178 delta human1
human1_lib2.final_cell_0498 delta human1
c) Each single-cell RNA-seq input ("sc_input") dataset is a integer matrix containing gene names as rows and cellID as columns.
d) Make the following choices:
i) a specific dataset (from "example","baron","GSE81547","E-MTAB-5061","PBMCs")
ii) data transformation (from "none","log","sqrt","vst"); with "none" meaning linear scale
iii) type of deconvolution method (from "bulk","sc")
iii.1) For "bulk" methods:
iii.1.1) choose normalization method among: "column","row","mean","column_z-score","global_z-score","column_min-max","global_min-max","LogNormalize","QN","TMM","UQ", "median_ratios", "TPM"
iii.1.2) Marker selection strategy from "all", "pos_fc", "top_50p_logFC", "bottom_50p_logFC", "top_50p_AveExpr", "bottom_50p_AveExpr", "top_n2", "random5" (see main manuscript for more details).
iii.1.3) choose deconvolution method among: "CIBERSORT","DeconRNASeq","OLS","nnls","FARDEEP","RLR","DCQ","elastic_net","lasso","ridge","EPIC","DSA","ssKL","ssFrobenius","dtangle".
iii.2) For "sc" methods:
iii.2.1) choose normalization method for both the reference matrix (scC) and the pseudo-bulk matrix (scT) among: "column","row","mean","column_z-score","global_z-score","column_min-max","global_min-max","LogNormalize","QN","TMM","UQ", "median_ratios", "TPM", "SCTransform","scran","scater","Linnorm" (last 4 are single-cell-specific)
iii.2.2.) choose deconvolution method among: "MuSiC","BisqueRNA","DWLS","deconvSeq","SCDC"
iv) Number of cells to be used to make the pseudo-bulk mixtures (multiple of 100)
v) Cell type to be removed from the reference matrix ("none" for the full matrix; this is dataset dependent: e.g. "alpha" from baron dataset)
vi) Number of available cores (by default 1, can be enlarged if more resources available)
# With the example we provided with this repository + no cell type removed:
Rscript Master_deconvolution.R example none bulk TMM all nnls 100 none 1
#Expected output:
# RMSE Pearson
#1 0.0351 0.9866
# With the example we provided with this repository + "cell_type_1" removed:
Rscript Master_deconvolution.R example none bulk TMM all nnls 100 cell_type_1 1
#Expected output:
# RMSE Pearson
#1 0.1038 0.9379
# With baron (or GSE81547, E-MTAB-5061, PBMCs) + no cell type removed:
Rscript Master_deconvolution.R baron none bulk TMM all nnls 100 none 1
#Expected output:
# RMSE Pearson
#1 0.0724 0.8961
# With baron + delta cells removed:
Rscript Master_deconvolution.R baron none bulk TMM all nnls 100 delta 1
#Expected output:
# RMSE Pearson
#1 0.0887 0.8197
# With the example we provided with this repository + no cell type removed::
Rscript Master_deconvolution.R example none sc TMM TMM MuSiC 100 none 1
#Expected output:
# RMSE Pearson
#1 0.0351 0.9866
# With the example we provided with this repository + "cell_type_1" removed:
Rscript Master_deconvolution.R example none sc TMM TMM MuSiC 100 cell_type_1 1
#Expected output:
# RMSE Pearson
#1 0.1044 0.9376
# With baron (or GSE81547, E-MTAB-5061, PBMCs) + no cell type removed:
Rscript Master_deconvolution.R baron none sc TMM TMM MuSiC 100 none 1
#Expected output:
# RMSE Pearson
#1 0.0488 0.953
# With baron + delta cells removed:
Rscript Master_deconvolution.R baron none sc TMM TMM MuSiC 100 delta 1
#Expected output:
# RMSE Pearson
#1 0.073 0.8799
Please see "sessionInfo_Linux.txt" and "sessionInfo_macOS.txt" in this repository.