-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathscfind.Rmd
325 lines (233 loc) · 14.3 KB
/
scfind.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
title: "`scfind` package vignette"
author: "Vladimir Kiselev and Jimmy Tsz Hang Lee"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc: true
vignette: >
%\VignetteIndexEntry{`scfind` package vignette}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE}
library("knitr")
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
op <- options(gvis.plot.tag='chart')
```
# Introduction
The number of cell atlases is growing rapidly, as a result of advances in single-cell sequencing techniques and cheaper sequencing cost. To search through these large single-cell datasets for analysis, however, is time-consuming and inefficient because these types of dataset usually take up large amounts of memory. `scfind` has adopted an efficient compression strategy which makes it suitable for real-time queries of millions of cells.
`scfind` is a method for searching specific cell types from large single-cell datasets by a query of gene list, in which `scfind` can suggest subquries score by TF-IDF method. `scfind` can perform hypergeometric test which allows the evaluation of marker genes specific to each cell type within a dataset. A manuscript describing `scfind` in details is available in [bioRxiv](https://doi.org/10.1101/788596)
# `SingleCellExperiment` class
If you already have an `SingleCellExperiment` object, then proceed to the next chapter.
`scfind` is built on top of the Bioconductor’s [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) class. `scfind` operates on objects of class `SingleCellExperiment` and writes all of its results back to an object. Please read corresponding vignettes on how to create a `SingleCellExperiment` from your own data. For illustrative purposes, a list of `SingleCellExperiment` objects of the `Tabula Muris (FACS)` and the `Tabula Muris (10X)` datasets are provided. The investigators ([The Tabula Muris Consortium](https://doi.org/10.1038/s41586-018-0590-4)) have profiled almost every cell-type in the mouse using high-coverage FACS-sorted cells + Smartseq2. in the original publication. We will combine the indices of 18 tissues into a super-index.
```{r}
library("scfind")
suppressPackageStartupMessages(library("SingleCellExperiment"))
# List of `Tabula Muris (FACS)` `SingleCellExperiment` objects
data(tmfacs)
# List of `Tabula Muris (10X)` `SingleCellExperiment` objects
data(tm10x)
# Download and load `SingleCellExperiment` object of the `Heart` dataset
sce.heart <- readRDS(url(tmfacs["Heart"]))
sce.heart
colData(sce.heart)
```
# `scfind` Input
## Index
Once we have a `SingleCellExperiment` object, we can run `scfind`. Firstly, we need to build the `scfind` index from our input dataset.
By default `scfind` uses the `cell_type1` column of the `colData` slot in the reference to identify cell type names.
```{r}
scfind.index <- buildCellTypeIndex(sce = sce.heart,
cell.type.label = "cell_type1", # If you have your own clustering result or manually defined groups, you may change this parameters with the corresponding column name of the `colData`
dataset.name = "Heart",
assay.name = "counts")
# if the dataset contains more than one category, users can build index for each tissue individually and merge all indices into 1 super-index using `mergeDataset` as following:
sce.thymus <- readRDS(url(tmfacs["Thymus"]))
scfind.index.new <- buildCellTypeIndex(sce = sce.thymus,
cell.type.label = "cell_type1", # If you have your own clustering result or manually defined groups, you may change this parameters with the corresponding column name of the `colData`
dataset.name = "Thymus",
assay.name = "counts")
merged.index <- mergeDataset(scfind.index, scfind.index.new)
merged.index
```
```{r eval=FALSE, include=FALSE}
# The `scfind` index can be saved as a RDS object
saveObject(object = merged.index,
file = "scfindIndex.rds")
```
# Retrieve quantized expression matrix
```{r}
merged.index@index$getCellTypeExpression("Thymus.T cell")[1:5, 1:5]
```
# Cell Type Search
Once the `scfind` index is built, one can view all existing genes in the datasets using `scfindGenes`
```{r}
sample(scfindGenes(merged.index), 20)
```
or view all existing cell type names in the database using `cellTypeNames`
```{r}
cellTypeNames(merged.index)
# To specify the dataset
cellTypeNames(merged.index, "Thymus")
```
## Evaluate Marker Genes
Once we have a `scfind` index, we can find the cell types that most likely represent your gene set from your dataset very quickly:
For illustrative purposes, we will use a preprocessed scfind index of the `Tabula Muris (FACS)`.
```{r}
# `scfind` indexes of the `Tabula Muris (FACS)` & `Tabula Muris (10X)` datasets
data(ExampleIndex)
geneIndex <- loadObject(file = url(ExampleIndex["TabulaMurisFACS"]))
# try `geneIndex <- loadObject(file = url(ExampleIndex["TabulaMuris10X"]))` for another Tabula Muris Dataset
geneIndex@datasets
```
`scfind` is a search engine for single cell datasets. The central operation carried out by `scfind` is to identify the set of cells that express a set of genes or peaks (i.e. the query) specified by the user.
```{r}
query <- c("Il2ra", "Ptprc", "Il7r", "Ctla4")
# The p-values is calculated by hypergeometric test
hyperQueryCellTypes(object = geneIndex,
gene.list = query)
# Use the `datasets` argument to specify the datasets
result <- hyperQueryCellTypes(object = geneIndex,
gene.list = query,
datasets = c("Marrow", "Thymus", "Fat"))
# The calculation shows that the query gene set is specific for the `Marrow.T cell` cell type with the lowest p-value.
barplot(-log10(result$pval), ylab = "-log10(pval)", las = 2, names.arg = result$cell_type)
```
## Query optimisation routine for long query
To allow search of enriched cell type from a long list of gene query, `scfind` features a query optimization routin. First, the function `markerGenes` will counter suggest subqueries with the highest support in the dataset. The TF-IDF score for each gene set allows user to identify the best subquery for finding the most relevant cell type. We will use the marker genes identified in an original publication Yanbin et al. 2015. Cardiomyocyte-specific markers used in immunostaining.
```{r}
long.query <- c("Mef2c", "Gata4", "Nkx2.5", "Myh6", "tnnt2", "tnni3", "CDH2", "Cx43", "GJA1")
# In which, `scfind` will suggest subqueries for the initial gene sets along with number of cells and cell types that express the genes
result <- markerGenes(object = geneIndex,
gene.list = long.query)
# Showing first 10 rows of the subqueries
head(result, 10)
subQueriesTfidf <- setNames(result$tfidf, gsub(",", "&", result$Query))
UpSetR::upset(UpSetR::fromExpression(subQueriesTfidf), mainbar.y.label = "TFIDF score", order.by = "freq", show.numbers = FALSE)
subQueriesCells <- setNames(result$Cells, gsub(",", "&", result$Query))
UpSetR::upset(UpSetR::fromExpression(subQueriesCells), mainbar.y.label = "No. of cells", order.by = "freq", show.numbers = FALSE)
```
By ranking the tfidf score, the best subquery can be used to search for cell types that is enriched.
```{r}
bestQuery <- strsplit(as.character(result[which.max(result$tfidf),'Query']), ",")[[1]]
bestQuery
# The calculation above shows that a list of genes containing `Myh6` and `Tnni3` is specific for the `Heart.cardiac muscle cell` cell type with the lowest p-value.
enrichedCellTypes <- hyperQueryCellTypes(object = geneIndex,
gene.list = bestQuery)
# Showing first 10 rows of the enriched cell types
head(enrichedCellTypes, 10)
```
To further evaluate a specific query by calculating the precision recall metrics
```{r}
evaluateGenes <- evaluateMarkers(object = geneIndex,
gene.list = bestQuery,
cell.types = "Heart.cardiac muscle cell")
evaluateGenes
ggplot2::qplot(precision, recall, data = evaluateGenes,colour = f1, label = genes) +
ggplot2::geom_text(vjust = 0, nudge_y = 5e-3, colour = "black", check_overlap = T)
```
# Marker Gene Search
If one is more interested in finding out which marker genes best represent a cell type in the dataset, `cellTypeMarkers` function should be used for searching the index:
```{r}
interestedCellType <- c("Kidney.leukocyte")
findMarkers <- cellTypeMarkers(object = geneIndex,
cell.types = interestedCellType)
ggplot2::qplot(precision, recall, data = findMarkers,colour = f1, label = genes) +
ggplot2::geom_text(vjust = 0, nudge_y = 5e-3, colour = "black", check_overlap = T)
# You can also specify the background cell types
findMarkers <- cellTypeMarkers(object = geneIndex,
cell.types = interestedCellType,
background.cell.types = cellTypeNames(object = geneIndex,
datasets = "Kidney"))
ggplot2::qplot(precision, recall, data = findMarkers,colour = f1, label = genes) +
ggplot2::geom_text(vjust = 0, nudge_y = 5e-3, colour = "black", check_overlap = T)
```
# *In silico* cell sorting with logical operators
`scfind` keeps track on individual cells. This allows cell sorting by adding the logical operators ("*" as OR, "-" as NOT) in front of each gene name.
```{r}
## To find cells that has the expression pattern of NOT expressing the gene `Il2ra` and OR expressing the gene `Ptprc` in the Thymus dataset
findCellTypes(object = geneIndex,
gene.list = c("-Il2ra", "*Ptprc", "Il7r"),
datasets = "Thymus")
hyperQueryCellTypes(object = geneIndex,
gene.list = c("-Il2ra", "*Ptprc", "Il7r"),
datasets = "Thymus")
## Without logical operators
findCellTypes(object = geneIndex,
gene.list = c("Il2ra", "Ptprc", "Il7r"),
datasets = "Thymus")
hyperQueryCellTypes(object = geneIndex,
gene.list = c("Il2ra", "Ptprc", "Il7r"),
datasets = "Thymus")
```
# Super Index
With the function `mergeDataset`, users can build a super index which contains multiple datasets for analysis. The example in [bioRxiv](https://doi.org/10.1101/788596) shows that `scfind` is an ideal tool for multimodal analysis. Here, we are using the example super index to demonstrate how one can perform analysis on multi-datasets at the same time.
```{r}
superIndex <- loadObject(url(ExampleIndex["TabulaMurisSuperIndex"]))
## Now you could view all tissues from both datasets here
superIndex@datasets
```
```{r}
## And view the cell types here
sample(cellTypeNames(superIndex), 20)
```
Let's use our long query as example again to perform search with the super index.
```{r}
long.query <- c("Mef2c", "Gata4", "Nkx2.5", "Myh6", "tnnt2", "tnni3", "CDH2", "Cx43", "GJA1")
result.superIndex <- markerGenes(object = superIndex,
gene.list = long.query)
head(result.superIndex, 10)
subQueriesTfidf.superIndex <- setNames(result.superIndex$tfidf, gsub(",", "&", result.superIndex$Query))
UpSetR::upset(UpSetR::fromExpression(subQueriesTfidf.superIndex), mainbar.y.label = "TFIDF score", order.by = "freq", show.numbers = FALSE)
subQueriesCells.superIndex <- setNames(result.superIndex$Cells, gsub(",", "&", result.superIndex$Query))
UpSetR::upset(UpSetR::fromExpression(subQueriesCells.superIndex), mainbar.y.label = "No. of cells", order.by = "freq", show.numbers = FALSE)
```
To do analysis with multiple cell types, users are allow to select background datasets or cell types as following. For example, we can narrow down our analysis to one of the dataset `TMFACS` only.
```{r}
## Obtain all cell type names in `TM10X`
select.dataset <- grep("TM10X-", superIndex@datasets, value = T)
select.celltypes <- cellTypeNames(superIndex, select.dataset)
## Run query optimization routine
result.superIndex.tm10x <- markerGenes(object = superIndex,
gene.list = long.query,
dataset = select.dataset)
head(result.superIndex.tm10x, 10)
subQueriesTfidf.superIndex.tm10x <- setNames(result.superIndex.tm10x$tfidf, gsub(",", "&", result.superIndex.tm10x$Query))
UpSetR::upset(UpSetR::fromExpression(subQueriesTfidf.superIndex.tm10x), mainbar.y.label = "TFIDF score", order.by = "freq", show.numbers = FALSE)
subQueriesCells.superIndex.tm10x <- setNames(result.superIndex.tm10x$Cells, gsub(",", "&", result.superIndex.tm10x$Query))
UpSetR::upset(UpSetR::fromExpression(subQueriesCells.superIndex.tm10x), mainbar.y.label = "No. of cells", order.by = "freq", show.numbers = FALSE)
```
Here, with the super index, we can now evaluate the genes in both `TM10X` and `TMFACS` simultaneously.
```{r}
## Obtain all cell type names in `TM10X`
select.dataset <- grep("TM10X-", superIndex@datasets, value = T)
select.celltypes <- cellTypeNames(superIndex, select.dataset)
evaluate.genes.tm10x <- evaluateMarkers(object = superIndex,
gene.list = long.query,
cell.types = "TM10X-Heart.cardiac muscle cell",
background.cell.types = select.celltypes) ## using all cells in `TMFACS` as background cell types
## Obtain all cell type names in `TMFACS`
select.dataset <- grep("TMFACS-", superIndex@datasets, value = T)
select.celltypes <- cellTypeNames(superIndex, select.dataset)
evaluate.genes.tmfacs <- evaluateMarkers(object = superIndex,
gene.list = long.query,
cell.types = "TMFACS-Heart.cardiac muscle cell",
background.cell.types = select.celltypes) ## using all cells in `TMFACS` as background cell types
evaluate.result <- rbind(
data.frame(evaluate.genes.tm10x, dataset = "TM10X"),
data.frame(evaluate.genes.tmfacs, dataset = "TMFACS")
)
ggplot2::qplot(precision, recall, data = evaluate.result,colour = dataset, size = f1, label = genes) +
ggplot2::geom_text(vjust = 0, nudge_y = 5e-3, colour = "black", check_overlap = T)
```
# Interactive Session
__Note__ `scfind` can also be run in an interactive `Shiny` session:
```{r eval=FALSE, include=TRUE}
scfindShiny(object = geneIndex)
```
To enhancer user experience, the [Hemberg-lab](https://scfind.sanger.ac.uk/) has an interactive website of 9 collections of `scfind` indexes.
# sessionInfo()
```{r echo=FALSE}
sessionInfo()
```