diff --git a/novice/capstones/deseq_capstone/analysis.Rmd b/novice/capstones/deseq_capstone/analysis.Rmd
index b249d5bfa..94822ebe6 100644
--- a/novice/capstones/deseq_capstone/analysis.Rmd
+++ b/novice/capstones/deseq_capstone/analysis.Rmd
@@ -17,7 +17,17 @@ install.packages("ggplot2")
 install.packages("calibrate")
 ```
 
-Import the data as a `data.frame` and examine it. The data is stored in a text file with the first line being a header giving the column names, and the row names in the first column. 
+# Introduction and data import
+
+The analysis of an RNAseq experiment begins with sequencing reads in the form of FASTQ files with reads and quality scores. These then need to be aligned to a reference genome or transcriptome. There are many different alignment tools available, but the process of alignment is both computationally intensive and time-consuming, so we won't cover it today. Once reads are aligned, the number of reads mapped to each gene can be counted to produce a counts matrix. Again, there are several ways of doing this. The best way to find out about the tools that are available and suitable for your research is to look for recent review papers that comapre the different tools.
+
+The data for this tutorial comes from a PLOS ONE paper, [Genome-Wide Transcriptional Profiling of Skin and Dorsal Root Ganglia after Ultraviolet-B-Induced Inflammation](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0093338)[1], and the raw data can be downloaded from [GEO](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54413). 
+
+This data has already been downloaded and aligned. The command line tool [featureCounts](http://bioinf.wehi.edu.au/featureCounts/) was used to count reads mapped to human genes from the Ensembl annotation (available for download [here](http://www.ensembl.org/info/data/ftp/index.html)). 
+
+The output from this tool is provided in the `counts.txt` file. Have a look at this file in the shell, using `head`.
+
+Import the data into R as a `data.frame` and examine it again. You can set the arguments of `read.table` to import the first row as a header giving the column names, and the first column as row names. 
 
 ```{r data_input}
 # Filename with output from featureCounts
@@ -28,6 +38,7 @@ head(countData)
 colnames(countData)
 class(countData)
 ```
+
 It contains information about genes (one gene per row) with the gene positions in the first five columns and then information about the number of reads aligning to the gene in each experimental sample. We don't need the information on gene position, so we can remove it from the data frame.
 
 ```{r}
@@ -38,6 +49,7 @@ colnames(countData)
 ```
 
 We can rename the columns to something a bit more readable.
+
 ```{r eval=FALSE}
 # Manually
 c("ctl1", "ctl2", "ctl3", "uvb1", "uvb2", "uvb3")
@@ -47,7 +59,7 @@ paste("ctl", 1:3)
 paste("ctl", 1:3, sep="")
 ?paste0
 paste0("ctl", 1:3)
-c(paste0("ctl", 1:4), paste0("uvb", 1:5))
+c(paste0("ctl", 1:3), paste0("uvb", 1:3))
 ```
 
 An easier way to do this, especially for files with many columns, is to use the `gsub` command to strip out the extra information. This is also more robust to introduced errors, for example if the column order changes at some point in the future.
@@ -60,9 +72,15 @@ head(countData)
 ```
 
 ## Exercise 1
-Find the gene with the highest expression in any sample. Extract  the expression data for this gene for all samples. In which sample does it have the highest expression? What is the function of the gene? Can you suggest why this is the top expressed gene?
+Find the gene with the highest expression in any sample. Extract the expression data for this gene for all samples. In which sample does it have the highest expression? 
+
+What is the function of the gene? Can you suggest why this is the top expressed gene?
 
-```{r exercise_1}
+Hint 1: use the `apply` function from the introductory R lessons.
+
+Hint 2: try `?which.max`.
+
+```{r exercise_1, echo=FALSE, include=FALSE}
 max(apply(countData, 1, max)) # max expression is 7013
 topGene <- which.max(apply(countData, 1, max)) # gene is EEF1A1P9
 countData[topGene, ] # get other sample data - max is in uvb1
@@ -84,8 +102,8 @@ colnames(countData2)
 grep("ctl", colnames(countData2))
 ctlCols <- grep("ctl", colnames(countData2))
 head(countData2[,ctlCols])
-apply(countData2[, ctlCols], 1, mean) 
-# here we'll use rowMeans instead, its a convenient shortcut, and also faster!
+head(apply(countData2[, ctlCols], 1, mean))
+# here we'll use rowMeans instead, it's a convenient shortcut, and also faster!
 countData2$ctlMean <- rowMeans(countData2[, ctlCols])
 
 # same for uvb
@@ -93,7 +111,7 @@ uvbCols <- grep("uvb", colnames(countData2))
 countData2$uvbMean <- rowMeans(countData2[, uvbCols])
 ```
 
-Plot the mean expression of each gene in control against the UVB sample mean. Look for outliers.
+Plot the mean expression of each gene in control against the UVB sample mean. Are there any outliers?
 
 ```{r plot_means}
 plot(countData2$ctlMean, countData2$uvbMean)
@@ -105,24 +123,27 @@ ggplot(countData2, aes(x=ctlMean, y=uvbMean)) + geom_point()
 ```
 
 ## Exercise 2
-How could you make this plot more informative and look more professional? Hint: try using a log scale. Try changing colours, transparencies, sizes, or shapes of points. 
+How could you make this plot more informative and look more professional? 
 
-`help(par)` will give you information on lots of graphical parameters that can be set. Help for ggplot2 can be found [here](http://docs.ggplot2.org/current/).
+Hint: try using a log scale. You can also changing colours, transparencies, sizes, or shapes of points. 
 
-```{r exercise2_1}
-plot(countData2$ctlMean, countData2$uvbMean, log="xy")
+`?par` will give you information on lots of graphical parameters that can be set. Help for ggplot2 can be found [here](http://docs.ggplot2.org/current/).
+
+```{r exercise2_1, echo=FALSE, include=FALSE}
+plot(countData2$ctlMean, countData2$uvbMean, log="xy", pch=16)
 ```
 
-```{r exercise2_2}
+```{r exercise2_2, echo=FALSE, include=FALSE}
 ggplot(countData2, aes(x=ctlMean, y=uvbMean)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw()
 ```
 There are lots more options you can use to alter the appearance of these plots.
 
 #Find candidate differentially expressed genes
 
-We can find candidate differentially expressed genes by looking for genes with a large change between control and UVB samples. A common threshold used is log2 fold change more than 2 fold. We will calculate log2 fold change for all the genes and colour the genes with log2 fold change more than 2 fold on the plot.
+We can find candidate differentially expressed genes by looking for genes with a large change between control and UVB samples. A common threshold used is log2 fold change more than 2 or less than -2. We will calculate log2 fold change for all the genes and colour the genes with log2 fold change of more than 2 or less than -2 on the plot.
 
 ```{r remove_unexpressed}
+# discuss: why to remove zeroes (NAs produced)
 # explain: whats happening here, is for each ctlMean there is a TRUE/FALSE for 
 # if it is greater than 0. TRUE and FALSE can be summed: FALSE is typically 
 # considered a 0, and TRUE is considered to be 1. So when we call
@@ -139,6 +160,7 @@ nrow(countData2)
 
 ```{r log2FC}
 # explain: what is fold change? why do we use log2 to quantify the fold change?
+# discuss: Inf / -Inf may be produced in some cases. Concept of adding pseudocounts.
 countData2$log2FC <- log2(countData2$uvbMean / countData2$ctlMean)
 # again, reinforce that summing a logical vector gives you the number of 
 # occurences of TRUE.
@@ -154,14 +176,16 @@ countData2$outlier[countData2$log2FC < -2] <- TRUE
 ```
 
 ```{r plot_outliers}
-plot(countData2$ctlMean, countData2$uvbMean, log="xy")
-points(countData2$ctlMean[countData2$outlier==TRUE], countData2$uvbMean[countData2$outlier==TRUE], col="red")
+plot(countData2$ctlMean, countData2$uvbMean, log="xy", pch=16)
+points(countData2$ctlMean[countData2$outlier==TRUE], countData2$uvbMean[countData2$outlier==TRUE], col="red", pch=16)
 ```
 
 ```{r ggplot_outliers}
-ggplot(countData2, aes(x=ctlMean, y=uvbMean, colour=outlier)) + geom_point() + scale_x_log10() + scale_y_log10() 
+ggplot(countData2, aes(x=ctlMean, y=uvbMean, colour=outlier)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw()
 ```
 
+What do you notice about the positions of the outliers on these plots? How would you interpret this?
+
 # DESeq2 analysis
 
 DESeq2 is an R package for analysis of RNAseq data. It is available from [Bioconductor](http://www.bioconductor.org/). Bioconductor is a project to provide tools for analysing high-throughput genomic data including RNA-seq, ChIP-seq and arrays. You can explore Bioconductor packages [here](http://www.bioconductor.org/packages/release/BiocViews.html#___Software). 
@@ -230,6 +254,8 @@ dir.create("results")
 write.table(sig, file="results/sig.txt", sep="\t") # tab delim data
 ```
 
+You can open this file in Excel or any text editor (try it now).
+
 # Data Visualization
 
 We can also do some exploratory plotting of the data.
@@ -291,3 +317,7 @@ legend("topleft", legend=c("FDR<0.05", "|LFC|>1", "both"), pch=16, col=c("red","
 # Label points
 with(subset(res, padj<.05 & abs(log2FoldChange)>1), textxy(log2FoldChange, -log10(pvalue), labs=Gene, cex=1))
 ```
+
+# References
+
+1. Dawes JM, Antunes-Martins A, Perkins JR, Paterson KJ, Sisignano M, et al. (2014) Genome-Wide Transcriptional Profiling of Skin and Dorsal Root Ganglia after Ultraviolet-B-Induced Inflammation. PLoS ONE 9(4): e93338. doi: 10.1371/journal.pone.0093338 
\ No newline at end of file
diff --git a/novice/capstones/deseq_capstone/analysis.md b/novice/capstones/deseq_capstone/analysis.md
index f01d234cf..a39646662 100644
--- a/novice/capstones/deseq_capstone/analysis.md
+++ b/novice/capstones/deseq_capstone/analysis.md
@@ -2,7 +2,7 @@
 
 This is an introduction to RNAseq analysis for use at Software Carpentry bootcamps that have covered novice R. It involves reading in some count data from an RNAseq experiment, exploring the data using base R functions and then analysis with the package DESeq2.
 
-## Install required CRAN packages
+# Install required CRAN packages
 
 First, install some packages that you'll use.
 
@@ -13,13 +13,23 @@ install.packages("ggplot2")
 install.packages("calibrate")
 ```
 
-Import the data as a `data.frame` and examine it. The data is stored in a text file with the first line being a header giving the column names, and the row names in the first column. 
+# Introduction and data import
+
+The analysis of an RNAseq experiment begins with sequencing reads in the form of FASTQ files with reads and quality scores. These then need to be aligned to a reference genome or transcriptome. There are many different alignment tools available, but the process of alignment is both computationally intensive and time-consuming, so we won't cover it today. Once reads are aligned, the number of reads mapped to each gene can be counted to produce a counts matrix. Again, there are several ways of doing this. The best way to find out about the tools that are available and suitable for your research is to look for recent review papers that comapre the different tools.
+
+The data for this tutorial comes from a PLOS ONE paper, [Genome-Wide Transcriptional Profiling of Skin and Dorsal Root Ganglia after Ultraviolet-B-Induced Inflammation](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0093338)[1], and the raw data can be downloaded from [GEO](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54413). 
+
+This data has already been downloaded and aligned. The command line tool [featureCounts](http://bioinf.wehi.edu.au/featureCounts/) was used to count reads mapped to human genes from the Ensembl annotation (available for download [here](http://www.ensembl.org/info/data/ftp/index.html)). 
+
+The output from this tool is provided in the `counts.txt` file. Have a look at this file in the shell, using `head`.
+
+Import the data into R as a `data.frame` and examine it again. You can set the arguments of `read.table` to import the first row as a header giving the column names, and the first column as row names. 
 
 
 ```r
-## Filename with output from featureCounts
+# Filename with output from featureCounts
 countFile <- "data/counts.txt"
-## Read in the data
+# Read in the data
 countData <- read.table(countFile, header=TRUE, row.names=1)
 head(countData)
 ```
@@ -54,47 +64,47 @@ head(countData)
 ## OR4G4P                           +;+    947
 ## OR4G11P                            +    940
 ##            ctl1.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    337
-## WASH7P                                     319
-## MIR1302-10                                 313
-## FAM138A                                    321
-## OR4G4P                                     322
-## OR4G11P                                    343
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            ctl2.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    322
-## WASH7P                                     377
-## MIR1302-10                                 348
-## FAM138A                                    333
-## OR4G4P                                     311
-## OR4G11P                                    339
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            ctl3.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    325
-## WASH7P                                     299
-## MIR1302-10                                 328
-## FAM138A                                    340
-## OR4G4P                                     364
-## OR4G11P                                    335
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            uvb1.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    131
-## WASH7P                                     213
-## MIR1302-10                                 250
-## FAM138A                                    253
-## OR4G4P                                     378
-## OR4G11P                                    418
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            uvb2.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    158
-## WASH7P                                     194
-## MIR1302-10                                 247
-## FAM138A                                    234
-## OR4G4P                                     315
-## OR4G11P                                    450
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            uvb3.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    136
-## WASH7P                                     203
-## MIR1302-10                                 263
-## FAM138A                                    276
-## OR4G4P                                     405
-## OR4G11P                                    437
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ```
 
 ```r
@@ -133,47 +143,47 @@ head(countData)
 
 ```
 ##            ctl1.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    337
-## WASH7P                                     319
-## MIR1302-10                                 313
-## FAM138A                                    321
-## OR4G4P                                     322
-## OR4G11P                                    343
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            ctl2.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    322
-## WASH7P                                     377
-## MIR1302-10                                 348
-## FAM138A                                    333
-## OR4G4P                                     311
-## OR4G11P                                    339
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            ctl3.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    325
-## WASH7P                                     299
-## MIR1302-10                                 328
-## FAM138A                                    340
-## OR4G4P                                     364
-## OR4G11P                                    335
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            uvb1.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    131
-## WASH7P                                     213
-## MIR1302-10                                 250
-## FAM138A                                    253
-## OR4G4P                                     378
-## OR4G11P                                    418
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            uvb2.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    158
-## WASH7P                                     194
-## MIR1302-10                                 247
-## FAM138A                                    234
-## OR4G4P                                     315
-## OR4G11P                                    450
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ##            uvb3.fastq_tophat.accepted_hits.bam
-## DDX11L1                                    136
-## WASH7P                                     203
-## MIR1302-10                                 263
-## FAM138A                                    276
-## OR4G4P                                     405
-## OR4G11P                                    437
+## DDX11L1                                      0
+## WASH7P                                       0
+## MIR1302-10                                   0
+## FAM138A                                      0
+## OR4G4P                                       0
+## OR4G11P                                      0
 ```
 
 ```r
@@ -192,21 +202,21 @@ colnames(countData)
 We can rename the columns to something a bit more readable.
 
 ```r
-## Manually
+# Manually
 c("ctl1", "ctl2", "ctl3", "uvb1", "uvb2", "uvb3")
-## Using paste
+# Using paste
 ?paste
 paste("ctl", 1:3)
 paste("ctl", 1:3, sep="")
 ?paste0
 paste0("ctl", 1:3)
-c(paste0("ctl", 1:4), paste0("uvb", 1:5))
+c(paste0("ctl", 1:3), paste0("uvb", 1:3))
 ```
 
 An easier way to do this, especially for files with many columns, is to use the `gsub` command to strip out the extra information. This is also more robust to introduced errors, for example if the column order changes at some point in the future.
 
 ```r
-## Using gsub -- reproducible
+# Using gsub -- reproducible
 ?gsub
 gsub(pattern=".fastq_tophat.accepted_hits.bam", replacement="", x=colnames(countData))
 ```
@@ -222,41 +232,26 @@ head(countData)
 
 ```
 ##            ctl1 ctl2 ctl3 uvb1 uvb2 uvb3
-## DDX11L1     337  322  325  131  158  136
-## WASH7P      319  377  299  213  194  203
-## MIR1302-10  313  348  328  250  247  263
-## FAM138A     321  333  340  253  234  276
-## OR4G4P      322  311  364  378  315  405
-## OR4G11P     343  339  335  418  450  437
+## DDX11L1       0    0    0    0    0    0
+## WASH7P        0    0    0    0    0    0
+## MIR1302-10    0    0    0    0    0    0
+## FAM138A       0    0    0    0    0    0
+## OR4G4P        0    0    0    0    0    0
+## OR4G11P       0    0    0    0    0    0
 ```
 
-### Exercise 1
-Find the gene with the highest expression in any sample. Extract  the expression data for this gene for all samples. In which sample does it have the highest expression? What is the function of the gene? Can you suggest why this is the top expressed gene?
-
+## Exercise 1
+Find the gene with the highest expression in any sample. Extract the expression data for this gene for all samples. In which sample does it have the highest expression? 
 
-```r
-max(apply(countData, 1, max)) #max expression is 7013
-```
+What is the function of the gene? Can you suggest why this is the top expressed gene?
 
-```
-## [1] 450
-```
+Hint 1: use the `apply` function from the introductory R lessons.
 
-```r
-topGene <- which.max(apply(countData, 1, max)) #gene is EEF1A1P9
-countData[topGene, ] #get other sample data - max is in uvb1
-```
+Hint 2: try `?which.max`.
 
-```
-##         ctl1 ctl2 ctl3 uvb1 uvb2 uvb3
-## OR4G11P  343  339  335  418  450  437
-```
 
-```r
-#this is a pseudogene - maybe an artefact of only aligning reads to a single chromosome?
-```
 
-## Data investigation using base R
+# Data investigation using base R
 
 We can investigate this data a bit more using some of the basic R functions before going on to use more sophisticated analysis tools.
 
@@ -266,7 +261,7 @@ We will calculate the mean for each gene for each condition. First make a copy o
 ```r
 countData2 <- countData
 
-#get Control columns
+# get Control columns
 colnames(countData2)
 ```
 
@@ -289,33 +284,33 @@ head(countData2[,ctlCols])
 
 ```
 ##            ctl1 ctl2 ctl3
-## DDX11L1     337  322  325
-## WASH7P      319  377  299
-## MIR1302-10  313  348  328
-## FAM138A     321  333  340
-## OR4G4P      322  311  364
-## OR4G11P     343  339  335
+## DDX11L1       0    0    0
+## WASH7P        0    0    0
+## MIR1302-10    0    0    0
+## FAM138A       0    0    0
+## OR4G4P        0    0    0
+## OR4G11P       0    0    0
 ```
 
 ```r
-apply(countData2[, ctlCols], 1, mean) 
+head(apply(countData2[, ctlCols], 1, mean))
 ```
 
 ```
 ##    DDX11L1     WASH7P MIR1302-10    FAM138A     OR4G4P    OR4G11P 
-##      328.0      331.7      329.7      331.3      332.3      339.0
+##          0          0          0          0          0          0
 ```
 
 ```r
-# here we'll use rowMeans instead, its a convenient shortcut, and also faster!
+# here we'll use rowMeans instead, it's a convenient shortcut, and also faster!
 countData2$ctlMean <- rowMeans(countData2[, ctlCols])
 
-#same for uvb
+# same for uvb
 uvbCols <- grep("uvb", colnames(countData2))
 countData2$uvbMean <- rowMeans(countData2[, uvbCols])
 ```
 
-Plot the mean expression of each gene in control against the UVB sample mean. Look for outliers.
+Plot the mean expression of each gene in control against the UVB sample mean. Are there any outliers?
 
 
 ```r
@@ -332,37 +327,35 @@ ggplot(countData2, aes(x=ctlMean, y=uvbMean)) + geom_point()
 
 ![plot of chunk ggplot_means](./analysis_files/figure-html/ggplot_means.png) 
 
-### Exercise 2
-How could you make this plot more informative and look more professional? Hint: try using a log scale. Try changing colours, transparencies, sizes, or shapes of points. 
+## Exercise 2
+How could you make this plot more informative and look more professional? 
 
-`help(par)` will give you information on lots of graphical parameters that can be set. Help for ggplot2 can be found [here](http://docs.ggplot2.org/current/).
+Hint: try using a log scale. You can also changing colours, transparencies, sizes, or shapes of points. 
 
+`?par` will give you information on lots of graphical parameters that can be set. Help for ggplot2 can be found [here](http://docs.ggplot2.org/current/).
 
-```r
-plot(countData2$ctlMean, countData2$uvbMean, log="xy")
-```
 
-![plot of chunk exercise2_1](./analysis_files/figure-html/exercise2_1.png) 
 
 
-```r
-ggplot(countData2, aes(x=ctlMean, y=uvbMean)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw()
-```
-
-![plot of chunk exercise2_2](./analysis_files/figure-html/exercise2_2.png) 
 There are lots more options you can use to alter the appearance of these plots.
 
-##Find candidate differentially expressed genes
+#Find candidate differentially expressed genes
 
-We can find candidate differentially expressed genes by looking for genes with a large change between control and UVB samples. A common threshold used is log2 fold change more than 2 fold. We will calculate log2 fold change for all the genes and colour the genes with log2 fold change more than 2 fold on the plot.
+We can find candidate differentially expressed genes by looking for genes with a large change between control and UVB samples. A common threshold used is log2 fold change more than 2 or less than -2. We will calculate log2 fold change for all the genes and colour the genes with log2 fold change of more than 2 or less than -2 on the plot.
 
 
 ```r
+# discuss: why to remove zeroes (NAs produced)
+# explain: whats happening here, is for each ctlMean there is a TRUE/FALSE for 
+# if it is greater than 0. TRUE and FALSE can be summed: FALSE is typically 
+# considered a 0, and TRUE is considered to be 1. So when we call
+# `sum(countData2$ctlMean > 0)`, we're really asking, "how many genes have a 
+# mean above 0 in the control group?"
 sum(countData2$ctlMean > 0)
 ```
 
 ```
-## [1] 6
+## [1] 474
 ```
 
 ```r
@@ -370,7 +363,7 @@ sum(countData2$uvbMean > 0)
 ```
 
 ```
-## [1] 6
+## [1] 528
 ```
 
 ```r
@@ -378,27 +371,31 @@ nrow(countData2)
 ```
 
 ```
-## [1] 6
+## [1] 56638
 ```
 
 ```r
 countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0))
-#explain | operator meaning OR in this context?
+# explain: | operator meaning OR in this context?
 nrow(countData2)
 ```
 
 ```
-## [1] 6
+## [1] 565
 ```
 
 
 ```r
+# explain: what is fold change? why do we use log2 to quantify the fold change?
+# discuss: Inf / -Inf may be produced in some cases. Concept of adding pseudocounts.
 countData2$log2FC <- log2(countData2$uvbMean / countData2$ctlMean)
+# again, reinforce that summing a logical vector gives you the number of 
+# occurences of TRUE.
 sum(countData2$log2FC > 2)
 ```
 
 ```
-## [1] 0
+## [1] 170
 ```
 
 ```r
@@ -406,7 +403,7 @@ sum(countData2$log2FC < -2)
 ```
 
 ```
-## [1] 0
+## [1] 46
 ```
 Make a new column to store this information in.
 
@@ -419,27 +416,37 @@ countData2$outlier[countData2$log2FC < -2] <- TRUE
 
 
 ```r
-plot(countData2$ctlMean, countData2$uvbMean, log="xy")
-points(countData2$ctlMean[countData2$outlier==TRUE], countData2$uvbMean[countData2$outlier==TRUE], col="red")
+plot(countData2$ctlMean, countData2$uvbMean, log="xy", pch=16)
+```
+
+```
+## Warning: 91 x values <= 0 omitted from logarithmic plot
+## Warning: 37 y values <= 0 omitted from logarithmic plot
+```
+
+```r
+points(countData2$ctlMean[countData2$outlier==TRUE], countData2$uvbMean[countData2$outlier==TRUE], col="red", pch=16)
 ```
 
 ![plot of chunk plot_outliers](./analysis_files/figure-html/plot_outliers.png) 
 
 
 ```r
-ggplot(countData2, aes(x=ctlMean, y=uvbMean, colour=outlier)) + geom_point() + scale_x_log10() + scale_y_log10() 
+ggplot(countData2, aes(x=ctlMean, y=uvbMean, colour=outlier)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw()
 ```
 
 ![plot of chunk ggplot_outliers](./analysis_files/figure-html/ggplot_outliers.png) 
 
-## DESeq2 analysis
+What do you notice about the positions of the outliers on these plots? How would you interpret this?
+
+# DESeq2 analysis
 
 DESeq2 is an R package for analysis of RNAseq data. It is available from [Bioconductor](http://www.bioconductor.org/). Bioconductor is a project to provide tools for analysing high-throughput genomic data including RNA-seq, ChIP-seq and arrays. You can explore Bioconductor packages [here](http://www.bioconductor.org/packages/release/BiocViews.html#___Software). 
 
 
 ```r
-#install and have a break to check everyone is up to date?
-#explain bioconductor?
+# install and have a break to check everyone is up to date?
+# explain bioconductor?
 source("http://bioconductor.org/biocLite.R")
 biocLite("DESeq2")
 ```
@@ -481,6 +488,10 @@ library(DESeq2)
 ## Loading required package: RcppArmadillo
 ```
 
+```
+## Warning: package 'RcppArmadillo' was built under R version 3.1.1
+```
+
 ```r
 citation("DESeq2")
 ```
@@ -507,7 +518,8 @@ It requires the count data to be in matrix form, and an additional dataframe des
 
 
 ```r
-# Convert to matrix
+# countData is currently a data.frame, but DESeq2 expects its input to be in 
+# matrix format, so we will convert our countData to a matrix.
 class(countData)
 ```
 
@@ -530,17 +542,17 @@ head(countData)
 
 ```
 ##            ctl1 ctl2 ctl3 uvb1 uvb2 uvb3
-## DDX11L1     337  322  325  131  158  136
-## WASH7P      319  377  299  213  194  203
-## MIR1302-10  313  348  328  250  247  263
-## FAM138A     321  333  340  253  234  276
-## OR4G4P      322  311  364  378  315  405
-## OR4G11P     343  339  335  418  450  437
+## DDX11L1       0    0    0    0    0    0
+## WASH7P        0    0    0    0    0    0
+## MIR1302-10    0    0    0    0    0    0
+## FAM138A       0    0    0    0    0    0
+## OR4G4P        0    0    0    0    0    0
+## OR4G11P       0    0    0    0    0    0
 ```
 
 ```r
 # construct colData dataframe
-#three replicates of control and UVB.
+# three replicates of control and UVB.
 colData <- data.frame(condition=c(rep("ctl", 3), rep("uvb",3)), row.names=colnames(countData))
 ```
 
@@ -548,26 +560,18 @@ DESeq works on a particular type of object called a DESeqDataSet.
 
 
 ```r
-#introduce how DESeq2 works - type of object it works on etc
+# introduce how DESeq2 works - type of object it works on etc
 # instantiate the DESeqDataSet
 dds <- DESeqDataSetFromMatrix(countData=countData, colData=colData, design=~condition)
-```
-
-```
-## Warning: some variables in design formula are characters, converting to
-## factors
-```
-
-```r
 dds
 ```
 
 ```
 ## class: DESeqDataSet 
-## dim: 6 6 
+## dim: 56638 6 
 ## exptData(0):
 ## assays(1): counts
-## rownames(6): DDX11L1 WASH7P ... OR4G4P OR4G11P
+## rownames(56638): DDX11L1 WASH7P ... MT-TT MT-TP
 ## rowData metadata column names(0):
 ## colnames(6): ctl1 ctl2 ... uvb2 uvb3
 ## colData names(1): condition
@@ -586,21 +590,6 @@ dds <- DESeq(dds)
 ## estimating dispersions
 ## gene-wise dispersion estimates
 ## mean-dispersion relationship
-```
-
-```
-## Warning: the parametric fit of dispersion estimates over the mean of counts
-## failed, which occurs when the trend is not well captured by the
-## function y = a/x + b. A local regression fit is automatically performed,
-## and the analysis can continue. You can specify fitType='local' or 'mean'
-## to avoid this message if re-running the same data.
-## When using local regression fit, the user should examine plotDispEsts(dds)
-## to make sure the fitted line is not sharply curving up or down based on
-## the position of individual points.
-## Warning: Estimated rdf < 1.0; not estimating variance
-```
-
-```
 ## final dispersion estimates
 ## fitting model and testing
 ```
@@ -617,20 +606,20 @@ head(res)
 ## DataFrame with 6 rows and 6 columns
 ##             baseMean log2FoldChange     lfcSE      stat    pvalue
 ##            <numeric>      <numeric> <numeric> <numeric> <numeric>
-## DDX11L1        225.4      -0.800662   0.11188  -7.15620 8.295e-13
-## WASH7P         261.6      -0.316571   0.10258  -3.08609 2.028e-03
-## MIR1302-10     289.2      -0.001925   0.09442  -0.02039 9.837e-01
-## FAM138A        290.4      -0.005237   0.09441  -0.05546 9.558e-01
-## OR4G4P         354.2       0.498655   0.09741   5.11892 3.073e-07
-## OR4G11P        397.6       0.725214   0.07570   9.58062 9.647e-22
+## DDX11L1            0             NA        NA        NA        NA
+## WASH7P             0             NA        NA        NA        NA
+## MIR1302-10         0             NA        NA        NA        NA
+## FAM138A            0             NA        NA        NA        NA
+## OR4G4P             0             NA        NA        NA        NA
+## OR4G11P            0             NA        NA        NA        NA
 ##                 padj
 ##            <numeric>
-## DDX11L1    2.488e-12
-## WASH7P     3.042e-03
-## MIR1302-10 9.837e-01
-## FAM138A    9.837e-01
-## OR4G4P     6.146e-07
-## OR4G11P    5.788e-21
+## DDX11L1           NA
+## WASH7P            NA
+## MIR1302-10        NA
+## FAM138A           NA
+## OR4G4P            NA
+## OR4G11P           NA
 ```
 
 ```r
@@ -640,11 +629,11 @@ table(res$padj<0.05)
 ```
 ## 
 ## FALSE  TRUE 
-##     2     4
+##   337    50
 ```
 
 ```r
-## Order by adjusted p-value
+# Order by adjusted p-value
 res <- res[order(res$padj), ]
 head(res)
 ```
@@ -653,22 +642,14 @@ head(res)
 ## log2 fold change (MAP): condition uvb vs ctl 
 ## Wald test p-value: condition uvb vs ctl 
 ## DataFrame with 6 rows and 6 columns
-##             baseMean log2FoldChange     lfcSE      stat    pvalue
-##            <numeric>      <numeric> <numeric> <numeric> <numeric>
-## OR4G11P        397.6       0.725214   0.07570   9.58062 9.647e-22
-## DDX11L1        225.4      -0.800662   0.11188  -7.15620 8.295e-13
-## OR4G4P         354.2       0.498655   0.09741   5.11892 3.073e-07
-## WASH7P         261.6      -0.316571   0.10258  -3.08609 2.028e-03
-## MIR1302-10     289.2      -0.001925   0.09442  -0.02039 9.837e-01
-## FAM138A        290.4      -0.005237   0.09441  -0.05546 9.558e-01
-##                 padj
-##            <numeric>
-## OR4G11P    5.788e-21
-## DDX11L1    2.488e-12
-## OR4G4P     6.146e-07
-## WASH7P     3.042e-03
-## MIR1302-10 9.837e-01
-## FAM138A    9.837e-01
+##         baseMean log2FoldChange     lfcSE      stat    pvalue      padj
+##        <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
+## ANXA3     312.32          3.910    0.3475    11.251 2.287e-29 8.849e-27
+## SFRP2    1711.27         -3.356    0.3407    -9.852 6.735e-23 1.303e-20
+## MT2P1     167.36          4.067    0.4582     8.878 6.834e-19 8.816e-17
+## IL8        31.70          5.976    0.9456     6.320 2.620e-10 2.535e-08
+## PTPN13    358.34         -1.386    0.2362    -5.869 4.388e-09 3.396e-07
+## CXCL1      42.47          3.691    0.6448     5.724 1.040e-08 6.705e-07
 ```
 
 Combine DEseq results with the original counts data. Write significant results to a file.
@@ -680,20 +661,20 @@ head(resData)
 ```
 
 ```
-##    Row.names baseMean log2FoldChange   lfcSE     stat    pvalue      padj
-## 1    OR4G11P    397.6       0.725214 0.07570  9.58062 9.647e-22 5.788e-21
-## 2    DDX11L1    225.4      -0.800662 0.11188 -7.15620 8.295e-13 2.488e-12
-## 3     OR4G4P    354.2       0.498655 0.09741  5.11892 3.073e-07 6.146e-07
-## 4     WASH7P    261.6      -0.316571 0.10258 -3.08609 2.028e-03 3.042e-03
-## 5 MIR1302-10    289.2      -0.001925 0.09442 -0.02039 9.837e-01 9.837e-01
-## 6    FAM138A    290.4      -0.005237 0.09441 -0.05546 9.558e-01 9.837e-01
-##    ctl1  ctl2  ctl3  uvb1  uvb2  uvb3
-## 1 313.1 288.2 292.5 480.9 541.6 469.3
-## 2 307.6 273.7 283.8 150.7 190.2 146.1
-## 3 293.9 264.4 317.9 434.9 379.1 435.0
-## 4 291.2 320.5 261.1 245.1 233.5 218.0
-## 5 285.7 295.8 286.4 287.6 297.3 282.5
-## 6 293.0 283.1 296.9 291.1 281.6 296.4
+##   Row.names baseMean log2FoldChange  lfcSE   stat    pvalue      padj
+## 1     ANXA3   312.32          3.910 0.3475 11.251 2.287e-29 8.849e-27
+## 2     SFRP2  1711.27         -3.356 0.3407 -9.852 6.735e-23 1.303e-20
+## 3     MT2P1   167.36          4.067 0.4582  8.878 6.834e-19 8.816e-17
+## 4       IL8    31.70          5.976 0.9456  6.320 2.620e-10 2.535e-08
+## 5    PTPN13   358.34         -1.386 0.2362 -5.869 4.388e-09 3.396e-07
+## 6     CXCL1    42.47          3.691 0.6448  5.724 1.040e-08 6.705e-07
+##      ctl1     ctl2    ctl3   uvb1   uvb2   uvb3
+## 1   26.33   35.131   48.83 378.95 783.44 601.27
+## 2 1847.65 4865.598 2683.82 382.63 325.05 162.88
+## 3   19.02    8.107   23.63 438.34 373.97 141.08
+## 4    0.00    0.000    0.00  66.12  44.18  79.93
+## 5  485.69  535.067  538.65 203.25 205.13 182.26
+## 6   11.70    2.702    0.00  89.99  50.49  99.91
 ```
 
 ```r
@@ -702,20 +683,20 @@ head(resData)
 ```
 
 ```
-##       GeneID baseMean log2FoldChange   lfcSE     stat    pvalue      padj
-## 1    OR4G11P    397.6       0.725214 0.07570  9.58062 9.647e-22 5.788e-21
-## 2    DDX11L1    225.4      -0.800662 0.11188 -7.15620 8.295e-13 2.488e-12
-## 3     OR4G4P    354.2       0.498655 0.09741  5.11892 3.073e-07 6.146e-07
-## 4     WASH7P    261.6      -0.316571 0.10258 -3.08609 2.028e-03 3.042e-03
-## 5 MIR1302-10    289.2      -0.001925 0.09442 -0.02039 9.837e-01 9.837e-01
-## 6    FAM138A    290.4      -0.005237 0.09441 -0.05546 9.558e-01 9.837e-01
-##    ctl1  ctl2  ctl3  uvb1  uvb2  uvb3
-## 1 313.1 288.2 292.5 480.9 541.6 469.3
-## 2 307.6 273.7 283.8 150.7 190.2 146.1
-## 3 293.9 264.4 317.9 434.9 379.1 435.0
-## 4 291.2 320.5 261.1 245.1 233.5 218.0
-## 5 285.7 295.8 286.4 287.6 297.3 282.5
-## 6 293.0 283.1 296.9 291.1 281.6 296.4
+##   GeneID baseMean log2FoldChange  lfcSE   stat    pvalue      padj    ctl1
+## 1  ANXA3   312.32          3.910 0.3475 11.251 2.287e-29 8.849e-27   26.33
+## 2  SFRP2  1711.27         -3.356 0.3407 -9.852 6.735e-23 1.303e-20 1847.65
+## 3  MT2P1   167.36          4.067 0.4582  8.878 6.834e-19 8.816e-17   19.02
+## 4    IL8    31.70          5.976 0.9456  6.320 2.620e-10 2.535e-08    0.00
+## 5 PTPN13   358.34         -1.386 0.2362 -5.869 4.388e-09 3.396e-07  485.69
+## 6  CXCL1    42.47          3.691 0.6448  5.724 1.040e-08 6.705e-07   11.70
+##       ctl2    ctl3   uvb1   uvb2   uvb3
+## 1   35.131   48.83 378.95 783.44 601.27
+## 2 4865.598 2683.82 382.63 325.05 162.88
+## 3    8.107   23.63 438.34 373.97 141.08
+## 4    0.000    0.00  66.12  44.18  79.93
+## 5  535.067  538.65 203.25 205.13 182.26
+## 6    2.702    0.00  89.99  50.49  99.91
 ```
 
 ```r
@@ -728,10 +709,12 @@ dir.create("results")
 ```
 
 ```r
-write.table(sig, file="results/sig.txt", sep="\t") #tab delim data
+write.table(sig, file="results/sig.txt", sep="\t") # tab delim data
 ```
 
-## Data Visualization
+You can open this file in Excel or any text editor (try it now).
+
+# Data Visualization
 
 We can also do some exploratory plotting of the data.
 
@@ -744,7 +727,7 @@ plotDispEsts(dds, main="Dispersion plot")
 
 
 ```r
-## Regularized log transformation for clustering/heatmaps, etc
+# Regularized log transformation for clustering/heatmaps, etc
 rld <- rlogTransformation(dds)
 plotPCA(rld)
 ```
@@ -757,13 +740,13 @@ head(assay(rld))
 ```
 
 ```
-##             ctl1  ctl2  ctl3  uvb1  uvb2  uvb3
-## DDX11L1    7.947 7.888 7.906 7.636 7.724 7.624
-## WASH7P     8.078 8.126 8.026 7.998 7.977 7.947
-## MIR1302-10 8.170 8.187 8.171 8.173 8.189 8.165
-## FAM138A    8.186 8.170 8.192 8.183 8.167 8.191
-## OR4G4P     8.369 8.321 8.406 8.565 8.493 8.565
-## OR4G11P    8.500 8.463 8.470 8.718 8.785 8.705
+##            ctl1 ctl2 ctl3 uvb1 uvb2 uvb3
+## DDX11L1       0    0    0    0    0    0
+## WASH7P        0    0    0    0    0    0
+## MIR1302-10    0    0    0    0    0    0
+## FAM138A       0    0    0    0    0    0
+## OR4G4P        0    0    0    0    0    0
+## OR4G11P       0    0    0    0    0    0
 ```
 
 ```r
@@ -771,12 +754,12 @@ assay(rld)[1:5,1:5]
 ```
 
 ```
-##             ctl1  ctl2  ctl3  uvb1  uvb2
-## DDX11L1    7.947 7.888 7.906 7.636 7.724
-## WASH7P     8.078 8.126 8.026 7.998 7.977
-## MIR1302-10 8.170 8.187 8.171 8.173 8.189
-## FAM138A    8.186 8.170 8.192 8.183 8.167
-## OR4G4P     8.369 8.321 8.406 8.565 8.493
+##            ctl1 ctl2 ctl3 uvb1 uvb2
+## DDX11L1       0    0    0    0    0
+## WASH7P        0    0    0    0    0
+## MIR1302-10    0    0    0    0    0
+## FAM138A       0    0    0    0    0
+## OR4G4P        0    0    0    0    0
 ```
 
 ```r
@@ -785,11 +768,11 @@ t(assay(rld))[1:5,1:5]
 
 ```
 ##      DDX11L1 WASH7P MIR1302-10 FAM138A OR4G4P
-## ctl1   7.947  8.078      8.170   8.186  8.369
-## ctl2   7.888  8.126      8.187   8.170  8.321
-## ctl3   7.906  8.026      8.171   8.192  8.406
-## uvb1   7.636  7.998      8.173   8.183  8.565
-## uvb2   7.724  7.977      8.189   8.167  8.493
+## ctl1       0      0          0       0      0
+## ctl2       0      0          0       0      0
+## ctl3       0      0          0       0      0
+## uvb1       0      0          0       0      0
+## uvb2       0      0          0       0      0
 ```
 
 ```r
@@ -797,12 +780,12 @@ dist(t(assay(rld)))
 ```
 
 ```
-##         ctl1    ctl2    ctl3    uvb1    uvb2
-## ctl2 0.10004                                
-## ctl3 0.08184 0.13488                        
-## uvb1 0.43456 0.45264 0.40115                
-## uvb2 0.39578 0.42692 0.37861 0.13597        
-## uvb3 0.44902 0.46984 0.40809 0.05509 0.15427
+##       ctl1  ctl2  ctl3  uvb1  uvb2
+## ctl2 12.18                        
+## ctl3 12.49 13.61                  
+## uvb1 15.07 16.22 17.29            
+## uvb2 20.01 19.73 20.43 15.05      
+## uvb3 18.70 18.94 20.21 13.75 11.83
 ```
 
 ```r
@@ -810,13 +793,13 @@ as.matrix(dist(t(assay(rld))))
 ```
 
 ```
-##         ctl1   ctl2    ctl3    uvb1   uvb2    uvb3
-## ctl1 0.00000 0.1000 0.08184 0.43456 0.3958 0.44902
-## ctl2 0.10004 0.0000 0.13488 0.45264 0.4269 0.46984
-## ctl3 0.08184 0.1349 0.00000 0.40115 0.3786 0.40809
-## uvb1 0.43456 0.4526 0.40115 0.00000 0.1360 0.05509
-## uvb2 0.39578 0.4269 0.37861 0.13597 0.0000 0.15427
-## uvb3 0.44902 0.4698 0.40809 0.05509 0.1543 0.00000
+##       ctl1  ctl2  ctl3  uvb1  uvb2  uvb3
+## ctl1  0.00 12.18 12.49 15.07 20.01 18.70
+## ctl2 12.18  0.00 13.61 16.22 19.73 18.94
+## ctl3 12.49 13.61  0.00 17.29 20.43 20.21
+## uvb1 15.07 16.22 17.29  0.00 15.05 13.75
+## uvb2 20.01 19.73 20.43 15.05  0.00 11.83
+## uvb3 18.70 18.94 20.21 13.75 11.83  0.00
 ```
 
 ```r
@@ -827,10 +810,14 @@ heatmap(sampleDists)
 ![plot of chunk plot_heatmaps](./analysis_files/figure-html/plot_heatmaps2.png) 
 
 ```r
-## better heatmap with gplots
+# better heatmap with gplots
 library(gplots)
 ```
 
+```
+## Warning: package 'gplots' was built under R version 3.1.1
+```
+
 ```
 ## KernSmooth 2.23 loaded
 ## Copyright M. P. Wand 1997-2009
@@ -878,7 +865,7 @@ heatmap.2(sampleDists, col=colorpanel(64, "red", "white", "blue"), key=FALSE, tr
 
 
 ```r
-## Examine plot of p-values
+# Examine plot of p-values
 hist(res$pvalue, breaks=50, col="grey")
 ```
 
@@ -887,10 +874,17 @@ hist(res$pvalue, breaks=50, col="grey")
 
 
 ```r
-#These are the plots that are most recognisable from papers
+# These are the plots that are most recognisable from papers
 # MA Plot
 par(pch=16)
 with(res, plot(baseMean, log2FoldChange, pch=20, cex=.5, log="x"))
+```
+
+```
+## Warning: 56073 x values <= 0 omitted from logarithmic plot
+```
+
+```r
 with(subset(res, padj<.05), points(baseMean, log2FoldChange, col="red", pch=16))
 library(calibrate)
 ```
@@ -910,19 +904,20 @@ with(subset(res, padj<.05), textxy(baseMean, log2FoldChange, labs=Gene, cex=1, c
 
 ```r
 # Volcano plot
-## Set point character
+# Set point character
 par(pch=16)
 with(res, plot(log2FoldChange, -log10(pvalue), main="Volcano plot"))
 with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), col="red"))
 with(subset(res, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), col="orange"))
 with(subset(res, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), col="green"))
-## Add legend
+# Add legend
 legend("topleft", legend=c("FDR<0.05", "|LFC|>1", "both"), pch=16, col=c("red","orange","green"))
+# Label points
+with(subset(res, padj<.05 & abs(log2FoldChange)>1), textxy(log2FoldChange, -log10(pvalue), labs=Gene, cex=1))
 ```
 
 ![plot of chunk volcano_plot](./analysis_files/figure-html/volcano_plot.png) 
 
-```r
-## Label points
-with(subset(res, padj<.05 & abs(log2FoldChange)>1), textxy(log2FoldChange, -log10(pvalue), labs=Gene, cex=1))
-```
+# References
+
+1. Dawes JM, Antunes-Martins A, Perkins JR, Paterson KJ, Sisignano M, et al. (2014) Genome-Wide Transcriptional Profiling of Skin and Dorsal Root Ganglia after Ultraviolet-B-Induced Inflammation. PLoS ONE 9(4): e93338. doi: 10.1371/journal.pone.0093338