This repository has been archived by the owner on Jan 10, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathstringApp.Rmd
373 lines (260 loc) · 15 KB
/
stringApp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
---
title: "Cytoscape StringApp"
author: "Kozo Nishida, Kristina Hanspers and Alex Pico"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc: true
toc_depth: 4
df_print: paged
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
eval=FALSE
)
```
*The R markdown is available from the pulldown menu for* Code *at the upper-right, choose "Download Rmd", or [download the Rmd from GitHub](https://raw.githubusercontent.com/nrnb/gsod2019_kozo_nishida/master/stringApp.Rmd).*
<hr />
In these exercises, we will use the [stringApp](http://apps.cytoscape.org/apps/stringApp) for [Cytoscape](http://cytoscape.org/) to retrieve molecular networks from the [STRING](https://string-db.org/) and [STITCH](http://stitch-db.org/) databases. The exercises will teach you how to:
- retrieve networks for proteins or small-molecule compounds of interest
- retrieve networks for a disease or an arbitrary topics in PubMed
- layout and visually style the resulting networks
- import external data and map them onto a network
- perform enrichment analyses and visualize the results
- merge and compare networks
- select proteins by attributes
- identify functional modules through network clustering
The original version of this tutorial was developed by Lars Juhl Jensen of the Novo Nordisk Center for Protein Research at the University of Copenhagen. We thank professor Jensen for his gracious willingness to allow us to repackage the content for delivery as a Cytoscape tutorial.
<hr />
# Installation
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if(!"RCy3" %in% installed.packages())
BiocManager::install("RCy3")
library(RCy3)
```
# Getting started
First, launch Cytoscape and keep it running whenever using RCy3. Confirm that you have everything installed and running:
```{r}
cytoscapePing()
cytoscapeVersionInfo()
```
# Prerequisites
The exercises require you to have certain Cytoscape apps and R packages installed.
```{r}
installApp('stringApp')
installApp('yFiles Layout Algorithms')
installApp('enhancedGraphics')
installApp('clusterMaker2')
install.packages("httr")
install.packages("readxl")
```
If you are not already familiar with the STRING database, we highly recommend that you go through the short [STRING exercises](https://jensenlab.org/training/string/) provided by the [Jensen lab](https://jensenlab.org/) to learn about the underlying data before working with them in these exercises.
# Exercise 1
In this exercise, we will perform some simple queries to retrieve molecular networks based on a protein, a small-molecule compound, a disease, and a topic in PubMed.
## Protein queries
You can query **STRING** as protein data source (in the following query the **protein name** is **SORCS2**).
You can select the appropriate organism by setting the **species** option (e.g. **Homo sapiens**).
(And the following chunk also imports the result into Cytoscape.)
```{r}
string.cmd = 'string protein query query="SORCS2" species="Homo sapiens"'
commandsRun(string.cmd)
```
The limit (Maximum number of interactors) option determines how many interaction partners of your protein(s) of interest will be added to the network.
By default, if you enter only one protein name, the resulting network will contain 10 additional interactors. If you enter more than one protein name, the network will contain only the interactions among these proteins, unless you explicitly ask for additional proteins.
```{r}
string.cmd = 'string protein query query="SORCS2" species="Homo sapiens" limit=100'
commandsRun(string.cmd)
```
How many nodes are in the resulting network?
How does this compare to the maximum number of interactors you specified?
What types of information does the **Node Table** provide?
## Compound queries
You can query **STITCH** as protein/compound data source (in the following query the **compound name** is **imatinib**).
You can select the organism and number of additional interactors just like for the protein query above.
```{r}
string.cmd = 'string compound query query="imatinib" species="Homo sapiens"'
commandsRun(string.cmd)
```
How is this network different from the protein-only network with respect to node types and the information provided in the Node Table?
## Disease queries
You can query **STRING** as disease data source (in the following query the **disease term** is **Alzheimer**).
The stringApp will retrieve a STRING network for the top-N proteins (by default 100) associated with the disease.
```{r}
string.cmd = 'string disease query disease="Alzheimer"'
commandsRun(string.cmd)
```
Which additional attribute column do you get in the Node Table for a disease query compared to a protein query? (Hint: check the last column.)
## PubMed queries
You can query **STRING** as **PubMed** data source (in the following query the query representing a topic of interest is **jet-lag**).
You can use any query that would work on the PubMed website, but it should obviously a topic with related genes or proteins. The stringApp will query PubMed for the abstracts, find the top-N proteins (by default 100) associated with these abstracts, and retrieve a STRING network for them.
```{r}
string.cmd = 'string pubmed query pubmed="jet-lag"'
commandsRun(string.cmd)
```
Which attribute column do you get in the Node Table for a PubMed query compared to a disease query? (Hint: check the last columns.)
# Exercise 2
In this exercise, we will work with a list of 78 proteins that interact with TrkA (tropomyosin-related kinase A) in neuroblastoma cells 10 min after stimulation with NGF (nerve growth factor) ([Emdal et al., 2015](http://stke.sciencemag.org/content/8/374/ra40)). An adapted table with the data from this study is available [here](https://cytoscape.org/cytoscape-tutorials/protocols/data/Emdal2015SciSignal.xlsx).
## Protein network retrieval
Here, we run a query with the first column (UniProt IDs) in the table.
```{r}
library(httr)
library(readxl)
xlsx_url = "https://cytoscape.org/cytoscape-tutorials/protocols/data/Emdal2015SciSignal.xlsx"
GET(xlsx_url, write_disk(tf <- tempfile(fileext = ".xlsx")))
df <- read_excel(tf)
```
```{r}
string.cmd = paste('string protein query query="', paste(df$UniProt, collapse = '\n'), '"', sep = "")
commandsRun(string.cmd)
```
How many nodes and edges are there in the resulting network? Do the proteins all form a connected network? Why?
Cytoscape provides several network layout options.
For example, you can try the **Degree Sorted Circle Layout**
```{r}
layoutNetwork('degree-circle')
```
and the **Prefuse Force Directed Layout** with **score** as edge weight
```{r}
layoutNetwork('force-directed edgeAttribute="score"')
```
*Can you find a layout that allows you to easily recognize patterns in the network? What about the Edge-weighted Spring Embedded (Kamada-Kawai) Layout with the attribute ‘score’, which is the combined STRING interaction score.*
```{r}
layoutNetwork('kamada-kawai edgeAttribute="score"')
```
**Notice about yFiles layout automation**
[**yFiles** Layout Algorithms App](http://apps.cytoscape.org/apps/yfileslayoutalgorithms) does not support any automation.
## Discrete color mapping
Cytoscape allows you to map attributes of the nodes and edges to visual properties such as node color and edge width.
Here, we will map the target family data to the node color.
Run this code chunk to change the node **Fill Color** with **target family** column in the node table.
```{r}
column <- "target::family"
values <- c('Kinase', 'GPCR')
colors <- c('#FF0000', '#0000FF')
setNodeColorMapping(column, values, colors, mapping.type = "d", style.name = "STRING style v1.5")
```
You can find the column names in the node table with
```{r}
getTableColumnNames()
```
You can find the visual style names with
```{r}
getVisualStyleNames()
```
This action will remove the rainbow coloring of the nodes and present you with a list of all the different values of the attribute that are exist in the network.
*Which target families are present in the network?*
(Actually you already used the target family names in `column`. You can check it with
```{r}
getTableColumns('node','target::family')
```
)
You should see the attribute values **GPCR** or **Kinase**.
You can also set a default color, e.g. for all nodes that do not have a target family annotation with
```{r}
setNodeColorDefault('#FFFFFF', style.name = "STRING style v1.5")
```
*How many of the proteins in the network are kinases?*
Note that the retrieved network contains a lot of additional information associated with the nodes and edges, such as the protein sequence, tissue expression data (Node Table) as well as the confidence scores for the different interaction evidences (Edge Table). In the following, we will explore these data using Cytoscape.
## Data import
Network nodes and edges can have additional information associated with them that we can load into Cytoscape and use for visualization.
We already imported the data from an Excel spreadsheet derived from data provided in the paper mentioned above. Here we check it again with
```{r}
head(df)
```
Now you need to map unique identifiers between the entries in the data and the nodes in the network.
The key point of this is to identify which nodes in the network are equivalent to which entries in the table.
This enables mapping of data values into visual properties like Fill Color and Shape.
This kind of mapping is typically done by comparing the unique Identifier attribute value for each node (Key Column for Network) with the unique Identifier value for each data value (key symbol).
The **Key Column** for Network allows you to set the node attribute column that is to be used as key to map to.
In this case it is **query term** because this attribute contains the UniProt accession numbers you entered when retrieving the network.
To import the node attributes file into Cytoscape, run
```{r}
loadTableData(as.data.frame(df), data.key.column = "UniProt", table = "node", table.key.column = "query term")
```
If there is a match between the value of a Key in the dataset and the value the Key Column for Network field in the network, all attribute–value pairs associated with the element in the dataset are assigned to the matching node in the network.
You will find the imported columns at the end of the Node Table.
## Continuous color mapping
Now, we want to color the nodes according to the protein abundance (log ratio) compared to the cells before NGF treatment.
Here we set **Column** to the node column containing the data that you want to use (10 min log ratio).
```{r}
logratio.score.table <- getTableColumns('node', "10 min log ratio")
```
Since this is a numeric value, we will use the Continuous Mapping as the Mapping Type, and set a color gradient for how abundant each protein is.
```{r}
logratio.min <- min(logratio.score.table, na.rm = T)
logratio.max <- max(logratio.score.table, na.rm = T)
```
```{r}
if (abs(logratio.min) < logratio.max) {
setNodeColorMapping("10 min log ratio", c(-logratio.max, 0, logratio.max), c('#0000FF', '#FFFFFF', '#FF0000'), style.name = "STRING style v1.5")
} else {
setNodeColorMapping("10 min log ratio", c(logratio.min, 0, -logratio.min), c('#0000FF', '#FFFFFF', '#FF0000'), style.name = "STRING style v1.5")
}
```
The color gradient blue-white-red gives a visualization of the negative-to-positive abundance ratio.
*Are the up-regulated nodes grouped together?*
## Functional enrichment
Next, we will retrieve functional enrichment for the proteins in our network.
After making sure that no nodes are selected in the network, go to the menu **Apps → STRING Enrichment → Retrieve functional enrichment**.
A new STRING Enrichment tab will appear in the Table Panel on the bottom.
It contains a table of enriched terms and corresponding information for each enrichment category.
*Which are the three most statistically significant terms?*
To explore only specific types of terms, e.g. GO terms, and to remove redundant terms from the table, click on the filter icon in the **Table panel** (leftmost icon).
Select the three types of GO terms, enable the option to **Remove redundant terms** and set **Redundancy cutoff** to 0.2.
In this way, you will see only the statistically significant GO terms that do not represent largely the same set of proteins within the network.
You can see which proteins are annotated with a given term by selecting the term in the **STRING Enrichment** panel.
*Do the functional terms assigned to a protein correlate with whether it is up- or down-regulated?*
Next, we will visualize the top-3 enriched terms in the network by using split charts.
Click the settings icon (rightmost icon) and set the **Number of terms** to chart to 3; you can optionally also **Change Color Palette** before clicking **OK** to confirm the new settings.
Click the colorful chart icon to show the terms as the charts on the network.
*Do these terms give good coverage of the proteins in network?*
Finally, save the list of enriched terms and associated p-values as a text file by
```{r}
commandsPOST(paste0('table export table="STRING Enrichment: All" options="CSV" outputFile="',paste(getwd(),"string-enrichment-all.csv",sep = "/"),'"'))
```
# Exercise 3
We are going to use the stringApp to query the [DISEASES](https://diseases.jensenlab.org/) database for proteins associated with Parkinson’s disease and with Alzheimer’s disease, retrieve STRING networks for both, created a combined network for the two neurodegenerative diseases, identify clusters in the network, and color it to compare the diseases.
## Disease network retrieval
Retrieve a network for **Parkinson’s disease** and another for **Alzheimer’s disease**.
```{r}
string.cmd = 'string disease query disease="Parkinson"'
commandsRun(string.cmd)
```
```{r}
string.cmd = 'string disease query disease="Alzheimer"'
commandsRun(string.cmd)
```
## Working with node attributes
Browse through the node attributes table and find the disease score column.
```{r}
getTableColumnNames()
```
Sort it descending by values to see the highest disease scores.
```{r}
disease.score <- getTableColumns('node', "stringdb::disease score")
disease.score.sorted <- disease.score[order(-disease.score$`stringdb::disease score`),,drop=FALSE]
head(disease.score.sorted)
```
The row names are SUIDs for the network nodes.
You can highlight the corresponding nodes by selecting the rows in the table,
```{r}
selectNodes(rownames(head(disease.score.sorted)))
```
Look for an example for a node with a disease score of 5 and one with a disease score below 4.
Rename the **disease score** columns in the two networks to **PD score** and **AD score**, respectively, by
```{r}
setCurrentNetwork("String Network - Parkinson's disease")
renameTableColumn("stringdb::disease score", "stringdb::PD score")
```
```{r}
setCurrentNetwork("String Network - Alzheimer's disease")
renameTableColumn("stringdb::disease score", "stringdb::AD score")
```
## Merging networks
Cytoscape provides functionality to merge two or more networks, building either their union, intersection or difference.
To merge the two disease networks
```{r}
```
*How many nodes and edges are there in the merged network compared to the two individual disease networks?*