rbeagrie · rbeagrie · Jul 23, 2014 · Jul 23, 2014 · Jul 23, 2014 · Jul 23, 2014
diff --git a/novice/capstones/deseq_capstone/analysis.Rmd b/novice/capstones/deseq_capstone/analysis.Rmd
@@ -17,7 +17,17 @@ install.packages("ggplot2")
 install.packages("calibrate")
 ```
 
-Import the data as a `data.frame` and examine it. The data is stored in a text file with the first line being a header giving the column names, and the row names in the first column. 
+# Introduction and data import
+
+The analysis of an RNAseq experiment begins with sequencing reads in the form of FASTQ files with reads and quality scores. These then need to be aligned to a reference genome or transcriptome. There are many different alignment tools available, but the process of alignment is both computationally intensive and time-consuming, so we won't cover it today. Once reads are aligned, the number of reads mapped to each gene can be counted to produce a counts matrix. Again, there are several ways of doing this. The best way to find out about the tools that are available and suitable for your research is to look for recent review papers that comapre the different tools.
+
+The data for this tutorial comes from a PLOS ONE paper, [Genome-Wide Transcriptional Profiling of Skin and Dorsal Root Ganglia after Ultraviolet-B-Induced Inflammation](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0093338)[1], and the raw data can be downloaded from [GEO](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54413). 
+
+This data has already been downloaded and aligned. The command line tool [featureCounts](http://bioinf.wehi.edu.au/featureCounts/) was used to count reads mapped to human genes from the Ensembl annotation (available for download [here](http://www.ensembl.org/info/data/ftp/index.html)). 
+
+The output from this tool is provided in the `counts.txt` file. Have a look at this file in the shell, using `head`.
+
+Import the data into R as a `data.frame` and examine it again. You can set the arguments of `read.table` to import the first row as a header giving the column names, and the first column as row names. 
 
 ```{r data_input}
 # Filename with output from featureCounts
@@ -28,6 +38,7 @@ head(countData)
 colnames(countData)
 class(countData)
 ```
+
 It contains information about genes (one gene per row) with the gene positions in the first five columns and then information about the number of reads aligning to the gene in each experimental sample. We don't need the information on gene position, so we can remove it from the data frame.
 
 ```{r}
@@ -38,6 +49,7 @@ colnames(countData)
 ```
 
 We can rename the columns to something a bit more readable.
+
 ```{r eval=FALSE}
 # Manually
 c("ctl1", "ctl2", "ctl3", "uvb1", "uvb2", "uvb3")
@@ -47,7 +59,7 @@ paste("ctl", 1:3)
 paste("ctl", 1:3, sep="")
 ?paste0
 paste0("ctl", 1:3)
-c(paste0("ctl", 1:4), paste0("uvb", 1:5))
+c(paste0("ctl", 1:3), paste0("uvb", 1:3))
 ```
 
 An easier way to do this, especially for files with many columns, is to use the `gsub` command to strip out the extra information. This is also more robust to introduced errors, for example if the column order changes at some point in the future.
@@ -60,9 +72,15 @@ head(countData)
 ```
 
 ## Exercise 1
-Find the gene with the highest expression in any sample. Extract  the expression data for this gene for all samples. In which sample does it have the highest expression? What is the function of the gene? Can you suggest why this is the top expressed gene?
+Find the gene with the highest expression in any sample. Extract the expression data for this gene for all samples. In which sample does it have the highest expression? 
+
+What is the function of the gene? Can you suggest why this is the top expressed gene?
 
-```{r exercise_1}
+Hint 1: use the `apply` function from the introductory R lessons.
+
+Hint 2: try `?which.max`.
+
+```{r exercise_1, echo=FALSE, include=FALSE}
 max(apply(countData, 1, max)) # max expression is 7013
 topGene <- which.max(apply(countData, 1, max)) # gene is EEF1A1P9
 countData[topGene, ] # get other sample data - max is in uvb1
@@ -84,16 +102,16 @@ colnames(countData2)
 grep("ctl", colnames(countData2))
 ctlCols <- grep("ctl", colnames(countData2))
 head(countData2[,ctlCols])
-apply(countData2[, ctlCols], 1, mean) 
-# here we'll use rowMeans instead, its a convenient shortcut, and also faster!
+head(apply(countData2[, ctlCols], 1, mean))
+# here we'll use rowMeans instead, it's a convenient shortcut, and also faster!
 countData2$ctlMean <- rowMeans(countData2[, ctlCols])
 
 # same for uvb
 uvbCols <- grep("uvb", colnames(countData2))
 countData2$uvbMean <- rowMeans(countData2[, uvbCols])
 ```
 
-Plot the mean expression of each gene in control against the UVB sample mean. Look for outliers.
+Plot the mean expression of each gene in control against the UVB sample mean. Are there any outliers?
 
 ```{r plot_means}
 plot(countData2$ctlMean, countData2$uvbMean)
@@ -105,24 +123,27 @@ ggplot(countData2, aes(x=ctlMean, y=uvbMean)) + geom_point()
 ```
 
 ## Exercise 2
-How could you make this plot more informative and look more professional? Hint: try using a log scale. Try changing colours, transparencies, sizes, or shapes of points. 
+How could you make this plot more informative and look more professional? 
 
-`help(par)` will give you information on lots of graphical parameters that can be set. Help for ggplot2 can be found [here](http://docs.ggplot2.org/current/).
+Hint: try using a log scale. You can also changing colours, transparencies, sizes, or shapes of points. 
 
-```{r exercise2_1}
-plot(countData2$ctlMean, countData2$uvbMean, log="xy")
+`?par` will give you information on lots of graphical parameters that can be set. Help for ggplot2 can be found [here](http://docs.ggplot2.org/current/).
+
+```{r exercise2_1, echo=FALSE, include=FALSE}
+plot(countData2$ctlMean, countData2$uvbMean, log="xy", pch=16)
 ```
 
-```{r exercise2_2}
+```{r exercise2_2, echo=FALSE, include=FALSE}
 ggplot(countData2, aes(x=ctlMean, y=uvbMean)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw()
 ```
 There are lots more options you can use to alter the appearance of these plots.
 
 #Find candidate differentially expressed genes
 
-We can find candidate differentially expressed genes by looking for genes with a large change between control and UVB samples. A common threshold used is log2 fold change more than 2 fold. We will calculate log2 fold change for all the genes and colour the genes with log2 fold change more than 2 fold on the plot.
+We can find candidate differentially expressed genes by looking for genes with a large change between control and UVB samples. A common threshold used is log2 fold change more than 2 or less than -2. We will calculate log2 fold change for all the genes and colour the genes with log2 fold change of more than 2 or less than -2 on the plot.
 
 ```{r remove_unexpressed}
+# discuss: why to remove zeroes (NAs produced)
 # explain: whats happening here, is for each ctlMean there is a TRUE/FALSE for 
 # if it is greater than 0. TRUE and FALSE can be summed: FALSE is typically 
 # considered a 0, and TRUE is considered to be 1. So when we call
@@ -139,6 +160,7 @@ nrow(countData2)
 
 ```{r log2FC}
 # explain: what is fold change? why do we use log2 to quantify the fold change?
+# discuss: Inf / -Inf may be produced in some cases. Concept of adding pseudocounts.
 countData2$log2FC <- log2(countData2$uvbMean / countData2$ctlMean)
 # again, reinforce that summing a logical vector gives you the number of 
 # occurences of TRUE.
@@ -154,14 +176,16 @@ countData2$outlier[countData2$log2FC < -2] <- TRUE
 ```
 
 ```{r plot_outliers}
-plot(countData2$ctlMean, countData2$uvbMean, log="xy")
-points(countData2$ctlMean[countData2$outlier==TRUE], countData2$uvbMean[countData2$outlier==TRUE], col="red")
+plot(countData2$ctlMean, countData2$uvbMean, log="xy", pch=16)
+points(countData2$ctlMean[countData2$outlier==TRUE], countData2$uvbMean[countData2$outlier==TRUE], col="red", pch=16)
 ```
 
 ```{r ggplot_outliers}
-ggplot(countData2, aes(x=ctlMean, y=uvbMean, colour=outlier)) + geom_point() + scale_x_log10() + scale_y_log10() 
+ggplot(countData2, aes(x=ctlMean, y=uvbMean, colour=outlier)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw()
 ```
 
+What do you notice about the positions of the outliers on these plots? How would you interpret this?
+
 # DESeq2 analysis
 
 DESeq2 is an R package for analysis of RNAseq data. It is available from [Bioconductor](http://www.bioconductor.org/). Bioconductor is a project to provide tools for analysing high-throughput genomic data including RNA-seq, ChIP-seq and arrays. You can explore Bioconductor packages [here](http://www.bioconductor.org/packages/release/BiocViews.html#___Software). 
@@ -230,6 +254,8 @@ dir.create("results")
 write.table(sig, file="results/sig.txt", sep="\t") # tab delim data
 ```
 
+You can open this file in Excel or any text editor (try it now).
+
 # Data Visualization
 
 We can also do some exploratory plotting of the data.
@@ -291,3 +317,7 @@ legend("topleft", legend=c("FDR<0.05", "|LFC|>1", "both"), pch=16, col=c("red","
 # Label points
 with(subset(res, padj<.05 & abs(log2FoldChange)>1), textxy(log2FoldChange, -log10(pvalue), labs=Gene, cex=1))
 ```
+
+# References
+
+1. Dawes JM, Antunes-Martins A, Perkins JR, Paterson KJ, Sisignano M, et al. (2014) Genome-Wide Transcriptional Profiling of Skin and Dorsal Root Ganglia after Ultraviolet-B-Induced Inflammation. PLoS ONE 9(4): e93338. doi: 10.1371/journal.pone.0093338