Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bioinfo r capstone #3

Merged
merged 4 commits into from
Jul 23, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 46 additions & 16 deletions novice/capstones/deseq_capstone/analysis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,17 @@ install.packages("ggplot2")
install.packages("calibrate")
```

Import the data as a `data.frame` and examine it. The data is stored in a text file with the first line being a header giving the column names, and the row names in the first column.
# Introduction and data import

The analysis of an RNAseq experiment begins with sequencing reads in the form of FASTQ files with reads and quality scores. These then need to be aligned to a reference genome or transcriptome. There are many different alignment tools available, but the process of alignment is both computationally intensive and time-consuming, so we won't cover it today. Once reads are aligned, the number of reads mapped to each gene can be counted to produce a counts matrix. Again, there are several ways of doing this. The best way to find out about the tools that are available and suitable for your research is to look for recent review papers that comapre the different tools.

The data for this tutorial comes from a PLOS ONE paper, [Genome-Wide Transcriptional Profiling of Skin and Dorsal Root Ganglia after Ultraviolet-B-Induced Inflammation](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0093338)[1], and the raw data can be downloaded from [GEO](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54413).

This data has already been downloaded and aligned. The command line tool [featureCounts](http://bioinf.wehi.edu.au/featureCounts/) was used to count reads mapped to human genes from the Ensembl annotation (available for download [here](http://www.ensembl.org/info/data/ftp/index.html)).

The output from this tool is provided in the `counts.txt` file. Have a look at this file in the shell, using `head`.

Import the data into R as a `data.frame` and examine it again. You can set the arguments of `read.table` to import the first row as a header giving the column names, and the first column as row names.

```{r data_input}
# Filename with output from featureCounts
Expand All @@ -28,6 +38,7 @@ head(countData)
colnames(countData)
class(countData)
```

It contains information about genes (one gene per row) with the gene positions in the first five columns and then information about the number of reads aligning to the gene in each experimental sample. We don't need the information on gene position, so we can remove it from the data frame.

```{r}
Expand All @@ -38,6 +49,7 @@ colnames(countData)
```

We can rename the columns to something a bit more readable.

```{r eval=FALSE}
# Manually
c("ctl1", "ctl2", "ctl3", "uvb1", "uvb2", "uvb3")
Expand All @@ -47,7 +59,7 @@ paste("ctl", 1:3)
paste("ctl", 1:3, sep="")
?paste0
paste0("ctl", 1:3)
c(paste0("ctl", 1:4), paste0("uvb", 1:5))
c(paste0("ctl", 1:3), paste0("uvb", 1:3))
```

An easier way to do this, especially for files with many columns, is to use the `gsub` command to strip out the extra information. This is also more robust to introduced errors, for example if the column order changes at some point in the future.
Expand All @@ -60,9 +72,15 @@ head(countData)
```

## Exercise 1
Find the gene with the highest expression in any sample. Extract the expression data for this gene for all samples. In which sample does it have the highest expression? What is the function of the gene? Can you suggest why this is the top expressed gene?
Find the gene with the highest expression in any sample. Extract the expression data for this gene for all samples. In which sample does it have the highest expression?

What is the function of the gene? Can you suggest why this is the top expressed gene?

```{r exercise_1}
Hint 1: use the `apply` function from the introductory R lessons.

Hint 2: try `?which.max`.

```{r exercise_1, echo=FALSE, include=FALSE}
max(apply(countData, 1, max)) # max expression is 7013
topGene <- which.max(apply(countData, 1, max)) # gene is EEF1A1P9
countData[topGene, ] # get other sample data - max is in uvb1
Expand All @@ -84,16 +102,16 @@ colnames(countData2)
grep("ctl", colnames(countData2))
ctlCols <- grep("ctl", colnames(countData2))
head(countData2[,ctlCols])
apply(countData2[, ctlCols], 1, mean)
# here we'll use rowMeans instead, its a convenient shortcut, and also faster!
head(apply(countData2[, ctlCols], 1, mean))
# here we'll use rowMeans instead, it's a convenient shortcut, and also faster!
countData2$ctlMean <- rowMeans(countData2[, ctlCols])

# same for uvb
uvbCols <- grep("uvb", colnames(countData2))
countData2$uvbMean <- rowMeans(countData2[, uvbCols])
```

Plot the mean expression of each gene in control against the UVB sample mean. Look for outliers.
Plot the mean expression of each gene in control against the UVB sample mean. Are there any outliers?

```{r plot_means}
plot(countData2$ctlMean, countData2$uvbMean)
Expand All @@ -105,24 +123,27 @@ ggplot(countData2, aes(x=ctlMean, y=uvbMean)) + geom_point()
```

## Exercise 2
How could you make this plot more informative and look more professional? Hint: try using a log scale. Try changing colours, transparencies, sizes, or shapes of points.
How could you make this plot more informative and look more professional?

`help(par)` will give you information on lots of graphical parameters that can be set. Help for ggplot2 can be found [here](http://docs.ggplot2.org/current/).
Hint: try using a log scale. You can also changing colours, transparencies, sizes, or shapes of points.

```{r exercise2_1}
plot(countData2$ctlMean, countData2$uvbMean, log="xy")
`?par` will give you information on lots of graphical parameters that can be set. Help for ggplot2 can be found [here](http://docs.ggplot2.org/current/).

```{r exercise2_1, echo=FALSE, include=FALSE}
plot(countData2$ctlMean, countData2$uvbMean, log="xy", pch=16)
```

```{r exercise2_2}
```{r exercise2_2, echo=FALSE, include=FALSE}
ggplot(countData2, aes(x=ctlMean, y=uvbMean)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw()
```
There are lots more options you can use to alter the appearance of these plots.

#Find candidate differentially expressed genes

We can find candidate differentially expressed genes by looking for genes with a large change between control and UVB samples. A common threshold used is log2 fold change more than 2 fold. We will calculate log2 fold change for all the genes and colour the genes with log2 fold change more than 2 fold on the plot.
We can find candidate differentially expressed genes by looking for genes with a large change between control and UVB samples. A common threshold used is log2 fold change more than 2 or less than -2. We will calculate log2 fold change for all the genes and colour the genes with log2 fold change of more than 2 or less than -2 on the plot.

```{r remove_unexpressed}
# discuss: why to remove zeroes (NAs produced)
# explain: whats happening here, is for each ctlMean there is a TRUE/FALSE for
# if it is greater than 0. TRUE and FALSE can be summed: FALSE is typically
# considered a 0, and TRUE is considered to be 1. So when we call
Expand All @@ -139,6 +160,7 @@ nrow(countData2)

```{r log2FC}
# explain: what is fold change? why do we use log2 to quantify the fold change?
# discuss: Inf / -Inf may be produced in some cases. Concept of adding pseudocounts.
countData2$log2FC <- log2(countData2$uvbMean / countData2$ctlMean)
# again, reinforce that summing a logical vector gives you the number of
# occurences of TRUE.
Expand All @@ -154,14 +176,16 @@ countData2$outlier[countData2$log2FC < -2] <- TRUE
```

```{r plot_outliers}
plot(countData2$ctlMean, countData2$uvbMean, log="xy")
points(countData2$ctlMean[countData2$outlier==TRUE], countData2$uvbMean[countData2$outlier==TRUE], col="red")
plot(countData2$ctlMean, countData2$uvbMean, log="xy", pch=16)
points(countData2$ctlMean[countData2$outlier==TRUE], countData2$uvbMean[countData2$outlier==TRUE], col="red", pch=16)
```

```{r ggplot_outliers}
ggplot(countData2, aes(x=ctlMean, y=uvbMean, colour=outlier)) + geom_point() + scale_x_log10() + scale_y_log10()
ggplot(countData2, aes(x=ctlMean, y=uvbMean, colour=outlier)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw()
```

What do you notice about the positions of the outliers on these plots? How would you interpret this?

# DESeq2 analysis

DESeq2 is an R package for analysis of RNAseq data. It is available from [Bioconductor](http://www.bioconductor.org/). Bioconductor is a project to provide tools for analysing high-throughput genomic data including RNA-seq, ChIP-seq and arrays. You can explore Bioconductor packages [here](http://www.bioconductor.org/packages/release/BiocViews.html#___Software).
Expand Down Expand Up @@ -230,6 +254,8 @@ dir.create("results")
write.table(sig, file="results/sig.txt", sep="\t") # tab delim data
```

You can open this file in Excel or any text editor (try it now).

# Data Visualization

We can also do some exploratory plotting of the data.
Expand Down Expand Up @@ -291,3 +317,7 @@ legend("topleft", legend=c("FDR<0.05", "|LFC|>1", "both"), pch=16, col=c("red","
# Label points
with(subset(res, padj<.05 & abs(log2FoldChange)>1), textxy(log2FoldChange, -log10(pvalue), labs=Gene, cex=1))
```

# References

1. Dawes JM, Antunes-Martins A, Perkins JR, Paterson KJ, Sisignano M, et al. (2014) Genome-Wide Transcriptional Profiling of Skin and Dorsal Root Ganglia after Ultraviolet-B-Induced Inflammation. PLoS ONE 9(4): e93338. doi: 10.1371/journal.pone.0093338
Loading