A capstone example for Bioinformatics in R #609

rbeagrie · 2014-07-22T16:49:50Z

The idea here is to show learners how to apply novice R lessons to bioinformatics. This is a very hastily prepared pull request for work in progress but contributions and suggestions very welcome!

drlabratory · 2014-07-22T17:53:02Z

Working from analysis.Rmd, you didn't commit data/counts.txt so the second code chunk fails.

gvwilson · 2014-07-22T18:09:46Z

Does the Python lesson in #608 have that file?

sritchie73 · 2014-07-23T00:33:14Z

data/counts.txt doesn't appear to be in #608 either.

… column names

…th the maximum number of aligning reads across all samples.

…he concept of the `apply` family of functions.

…d (how do I verb knitr?)

… the naming scheme

rbeagrie · 2014-07-23T08:30:01Z

Oops sorry about that! Included the counts.txt file now

…o sritchie73-bioinfo-r-capstone-aus

Bioinfo r capstone

rbeagrie · 2014-07-23T13:47:40Z

OK this PR is also up to date with development in London

added explanation of TRUE/FALSE being summable

naupaka · 2014-07-23T19:13:02Z

novice/capstones/deseq_capstone/analysis.Rmd

+
+# Introduction and data import
+
+The analysis of an RNAseq experiment begins with sequencing reads in the form of FASTQ files with reads and quality scores. These then need to be aligned to a reference genome or transcriptome. There are many different alignment tools available, but the process of alignment is both computationally intensive and time-consuming, so we won't cover it today. Once reads are aligned, the number of reads mapped to each gene can be counted to produce a counts matrix. Again, there are several ways of doing this. The best way to find out about the tools that are available and suitable for your research is to look for recent review papers that comapre the different tools.


comapre -> compare

naupaka · 2014-07-23T19:19:29Z

Nice exercise!

BernhardKonrad · 2014-07-23T20:17:16Z

Hi, thanks for this work!

With regards to my comment: I'm not in bioinformatics, so if you assume prior subject knowledge then some of these can be disregarded. In that case, I suggest stating who the lecture is tailored for at the beginning.

It would be nice to have a one-line comment on what the libraries that you load are going to be useful for.
In the Introduction, I don't know the terms FASTQ, quality scores, counts matrix, and in general the motivation is quite brief. It would also be nice to mention what we are trying to achieve in this lecture.
When you mention that you can look at the file using head, why not just do it there and show the result?
We don't need the information on gene position. Why not? What are we doing instead?
We can rename the columns to something a bit more readable. and the we choose ctl1 and uvb? Are the abbreviations common enough, could you please remind the reader what they stand for?
# Using gsub -- reproducible I think this should read robust instead.
In Exercise 1, can you remind me that the each row is a gene (if that's the case)?

BernhardKonrad · 2014-07-23T20:43:58Z

Comments on Data investigation using base R

Do the students know grep, otherwise a few words about what it is would be helpful.
I would like to see a comment/reminder on apply(countData2[, ctlCols], 1, mean), even though they know apply already it is nice to get a reminder of what we are doing and why, and what the 1 does.
Please mention that we add new columns (ctlMean and uvbMean) to the data.frame and why (for ggplot, I assume).
From the plot, how do you answer your question Are there any outliers??
#Find candidate differentially expressed genes needs a space after # to render properly
Could you please mention that the reason why log2 and 0 produce NA/Inf/-Inf is mathematical (and not related to eg R or the dataset).
countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0)) should this be an AND instead to avoid dividing by zero later?
The outlier analysis is really nice.

sritchie73 · 2014-07-24T00:52:07Z

@BernhardKonrad lots of good points. A couple of additions and explanations:

It's not clear to me why we create countData2 and add new columns to it. I was going to rewrite that section yesterday, but didn't get time.
Agreed on the robust instead of reproducible. I believe @gvwilson can back me up here, but most scientists don't find reproducibility compelling enough to change their workflow (i forget whether I read it in one of the SWC papers or heard it in @gvwilson's PyCon talk), so it's better to use another motivating argument.

gavinsimpson · 2014-07-24T15:53:21Z

novice/capstones/deseq_capstone/analysis.Rmd

+```
+
+```{r ggplot_means}
+library(ggplot2)


library("ggplot2") is considered better style, for example in the JStatSoft guidelines. ggplot2 doesn't exist as an object and how/why the unquoted form works may as well be magic to people at this stage. So from a consistency argument (why can't you do: install.packages(ggplot2) ? (rhetorical)) the quoted version is preferred.

liz-is · 2014-07-27T16:07:03Z

Thanks for all the comments and suggestions! My responses to you all are mixed together but I've tried to put them in the order of the lesson itself...

You can see my most recent edits here as I don't think it's been merged yet.

First, a note on audience. I believe that as with the Python capstone, the audience is intended to be biologists with an idea of why we'd do an RNAseq experiment but no / very little programming experience.

Introduction - I removed jargon by shortening sentences rather than re-writing, as I realised it was unnecessary :) I think that covering the full workflow of sequencing analysis, including explaining what FASTQ files are, would be too much for a 1.5-2hr lesson so opted to avoid mentioning them instead.

Is there a way to get head output from the shell embedded into an Rmd file? If there is I can add this.

Gene position information - 'what are we doing instead?' I don't really know what you mean here. We don't need to know where the genes are for this analysis. I've expanded that sentence a bit which might make it clearer.

Copying countData - my original idea was to add the mean columns to the data.frame and then select only the necessary columns when later converting to a matrix for DESeq (although all the rows with a mean expression of zero will have been removed...), then creating and working on a copy was suggested. Alternatively we could create a new data frame just for the mean data? I've left this bit as is for now as I'm not sure what would be preferable.

Answer to 'are there any outliers?' is kinda subjective (and hard to tell on these plots!) :) this is why we use DESeq! It's more a point for discussion than supposed to have a definitive answer.

countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0)) No, this was left in to allow for discussion of adding pseudocounts. Maybe adding pseudocounts should be part of the lesson rather than just for discussion?

Explaining Bioconductor - what additional explanation do you think is needed here?

BernhardKonrad · 2014-07-27T23:26:21Z

Hi and thanks for these changes, this is already much better!

You can get the output of head with system("head data/counts.txt", intern = TRUE)

With "what are we doing instead?" I wanted to ask you to motivate and define the next step in the analysis.

Similar with the answer to "are there outliers?": Your point is that it is hard to tell, so I suggest stating that in the document. This way you drive home the point of motivating DESeq. Ideally you refer back to this plot after your improved analysis and plot.

I don't remember what you mean by pseudocounts, but I suggest either making the point here more clear or drop it altogether.

added suggestions from pull request

jdblischak · 2014-10-01T03:43:31Z

I assume this lesson is in the same state as #608. We'll close for now and this can be revisited after #759.

Added lessons for DeSeq

29157a8

jdblischak added the R label Jul 22, 2014

sritchie73 and others added 14 commits July 23, 2014 12:06

Expanded on motivation for using gsub to strip extra information on…

5d13af8

… column names

removed magic number, replacing with variable that stores the gene wi…

5a01412

…th the maximum number of aligning reads across all samples.

Introduced the rowMeans function as a shortcut, while reinforcing t…

0244f50

…he concept of the `apply` family of functions.

Added required call to "dir.create" so that the Rmd file can be knit'…

38572c3

…d (how do I verb knitr?)

Reconstructed count data from previously knit'd md file.

7596511

Spoofed some differential RNAseq count data

025540e

Changed some variable names to camelCase for consistency with rest of…

5dec78c

… the naming scheme

knit'd md file now reflects Rmd file

89fafde

Added explanation for a non-obvious function call.

db8e7fb

Added some more teacher comments

519288b

added explanation for why we want to convert to a matrix

ce241cb

Fixed comment formatting for consistency

249c34f

reorganized files more sensibly

6bbd4fd

Text refers to data/counts.txt

5bbba76

rbeagrie and others added 7 commits July 23, 2014 10:44

Merge branch 'bioinfo-r-capstone-aus' of github.com:sritchie73/bc int…

aae6fe6

…o sritchie73-bioinfo-r-capstone-aus

Merge branch 'sritchie73-bioinfo-r-capstone-aus' into bioinfo-r-capstone

b803492

feedback from @rbeagrie, mostly additional explanation detail

1e7a414

echo=FALSE on exercise answers

1dc61b6

and include=FALSE

d27528c

expanded introduction so this can be standalone

ae66b70

Merge pull request #3 from liz-is/bioinfo-r-capstone

c57ae6b

Bioinfo r capstone

liz-is and others added 2 commits July 23, 2014 15:05

added explanation of TRUE/FALSE being summable

78623f6

Merge pull request #4 from liz-is/bioinfo-r-capstone

9444c0f

added explanation of TRUE/FALSE being summable

naupaka reviewed Jul 23, 2014
View reviewed changes

gavinsimpson reviewed Jul 24, 2014
View reviewed changes

added suggestions from pull request

fb5e849

liz-is mentioned this pull request Jul 27, 2014

added suggestions from pull request rbeagrie/bc#5

Merged

Merge pull request #5 from liz-is/bioinfo-r-capstone

b07db43

added suggestions from pull request

gvwilson assigned jdblischak Sep 29, 2014

jdblischak closed this Oct 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A capstone example for Bioinformatics in R #609

A capstone example for Bioinformatics in R #609

rbeagrie commented Jul 22, 2014

drlabratory commented Jul 22, 2014

gvwilson commented Jul 22, 2014

sritchie73 commented Jul 23, 2014

rbeagrie commented Jul 23, 2014

rbeagrie commented Jul 23, 2014

naupaka Jul 23, 2014

naupaka commented Jul 23, 2014

BernhardKonrad commented Jul 23, 2014

BernhardKonrad commented Jul 23, 2014

sritchie73 commented Jul 24, 2014

gavinsimpson Jul 24, 2014

liz-is commented Jul 27, 2014

BernhardKonrad commented Jul 27, 2014

jdblischak commented Oct 1, 2014


		# Introduction and data import

		The analysis of an RNAseq experiment begins with sequencing reads in the form of FASTQ files with reads and quality scores. These then need to be aligned to a reference genome or transcriptome. There are many different alignment tools available, but the process of alignment is both computationally intensive and time-consuming, so we won't cover it today. Once reads are aligned, the number of reads mapped to each gene can be counted to produce a counts matrix. Again, there are several ways of doing this. The best way to find out about the tools that are available and suitable for your research is to look for recent review papers that comapre the different tools.

A capstone example for Bioinformatics in R #609

A capstone example for Bioinformatics in R #609

Conversation

rbeagrie commented Jul 22, 2014

drlabratory commented Jul 22, 2014

gvwilson commented Jul 22, 2014

sritchie73 commented Jul 23, 2014

rbeagrie commented Jul 23, 2014

rbeagrie commented Jul 23, 2014

naupaka Jul 23, 2014

Choose a reason for hiding this comment

naupaka commented Jul 23, 2014

BernhardKonrad commented Jul 23, 2014

BernhardKonrad commented Jul 23, 2014

sritchie73 commented Jul 24, 2014

gavinsimpson Jul 24, 2014

Choose a reason for hiding this comment

liz-is commented Jul 27, 2014

BernhardKonrad commented Jul 27, 2014

jdblischak commented Oct 1, 2014