Skip to content
This repository has been archived by the owner on Jan 3, 2018. It is now read-only.

A capstone example for Bioinformatics in R #609

Closed
wants to merge 26 commits into from

Conversation

rbeagrie
Copy link
Contributor

The idea here is to show learners how to apply novice R lessons to bioinformatics. This is a very hastily prepared pull request for work in progress but contributions and suggestions very welcome!

@drlabratory
Copy link
Contributor

Working from analysis.Rmd, you didn't commit data/counts.txt so the second code chunk fails.

@jdblischak jdblischak added the R label Jul 22, 2014
@gvwilson
Copy link
Contributor

Does the Python lesson in #608 have that file?

@sritchie73
Copy link
Contributor

data/counts.txt doesn't appear to be in #608 either.

@rbeagrie
Copy link
Contributor Author

Oops sorry about that! Included the counts.txt file now

@rbeagrie
Copy link
Contributor Author

OK this PR is also up to date with development in London


# Introduction and data import

The analysis of an RNAseq experiment begins with sequencing reads in the form of FASTQ files with reads and quality scores. These then need to be aligned to a reference genome or transcriptome. There are many different alignment tools available, but the process of alignment is both computationally intensive and time-consuming, so we won't cover it today. Once reads are aligned, the number of reads mapped to each gene can be counted to produce a counts matrix. Again, there are several ways of doing this. The best way to find out about the tools that are available and suitable for your research is to look for recent review papers that comapre the different tools.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comapre -> compare

@naupaka
Copy link
Member

naupaka commented Jul 23, 2014

Nice exercise!

@BernhardKonrad
Copy link
Contributor

Hi, thanks for this work!

With regards to my comment: I'm not in bioinformatics, so if you assume prior subject knowledge then some of these can be disregarded. In that case, I suggest stating who the lecture is tailored for at the beginning.

  1. It would be nice to have a one-line comment on what the libraries that you load are going to be useful for.
  2. In the Introduction, I don't know the terms FASTQ, quality scores, counts matrix, and in general the motivation is quite brief. It would also be nice to mention what we are trying to achieve in this lecture.
  3. When you mention that you can look at the file using head, why not just do it there and show the result?
  4. We don't need the information on gene position. Why not? What are we doing instead?
  5. We can rename the columns to something a bit more readable. and the we choose ctl1 and uvb? Are the abbreviations common enough, could you please remind the reader what they stand for?
  6. # Using gsub -- reproducible I think this should read robust instead.
  7. In Exercise 1, can you remind me that the each row is a gene (if that's the case)?

@BernhardKonrad
Copy link
Contributor

Comments on Data investigation using base R

  1. Do the students know grep, otherwise a few words about what it is would be helpful.
  2. I would like to see a comment/reminder on apply(countData2[, ctlCols], 1, mean), even though they know apply already it is nice to get a reminder of what we are doing and why, and what the 1 does.
  3. Please mention that we add new columns (ctlMean and uvbMean) to the data.frame and why (for ggplot, I assume).
  4. From the plot, how do you answer your question Are there any outliers??
  5. #Find candidate differentially expressed genes needs a space after # to render properly
  6. Could you please mention that the reason why log2 and 0 produce NA/Inf/-Inf is mathematical (and not related to eg R or the dataset).
  7. countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0)) should this be an AND instead to avoid dividing by zero later?
  8. The outlier analysis is really nice.

@sritchie73
Copy link
Contributor

@BernhardKonrad lots of good points. A couple of additions and explanations:

  • It's not clear to me why we create countData2 and add new columns to it. I was going to rewrite that section yesterday, but didn't get time.
  • Agreed on the robust instead of reproducible. I believe @gvwilson can back me up here, but most scientists don't find reproducibility compelling enough to change their workflow (i forget whether I read it in one of the SWC papers or heard it in @gvwilson's PyCon talk), so it's better to use another motivating argument.

```

```{r ggplot_means}
library(ggplot2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

library("ggplot2") is considered better style, for example in the JStatSoft guidelines. ggplot2 doesn't exist as an object and how/why the unquoted form works may as well be magic to people at this stage. So from a consistency argument (why can't you do: install.packages(ggplot2) ? (rhetorical)) the quoted version is preferred.

@liz-is
Copy link

liz-is commented Jul 27, 2014

Thanks for all the comments and suggestions! My responses to you all are mixed together but I've tried to put them in the order of the lesson itself...

You can see my most recent edits here as I don't think it's been merged yet.

First, a note on audience. I believe that as with the Python capstone, the audience is intended to be biologists with an idea of why we'd do an RNAseq experiment but no / very little programming experience.

Introduction - I removed jargon by shortening sentences rather than re-writing, as I realised it was unnecessary :) I think that covering the full workflow of sequencing analysis, including explaining what FASTQ files are, would be too much for a 1.5-2hr lesson so opted to avoid mentioning them instead.

Is there a way to get head output from the shell embedded into an Rmd file? If there is I can add this.

Gene position information - 'what are we doing instead?' I don't really know what you mean here. We don't need to know where the genes are for this analysis. I've expanded that sentence a bit which might make it clearer.

Copying countData - my original idea was to add the mean columns to the data.frame and then select only the necessary columns when later converting to a matrix for DESeq (although all the rows with a mean expression of zero will have been removed...), then creating and working on a copy was suggested. Alternatively we could create a new data frame just for the mean data? I've left this bit as is for now as I'm not sure what would be preferable.

Answer to 'are there any outliers?' is kinda subjective (and hard to tell on these plots!) :) this is why we use DESeq! It's more a point for discussion than supposed to have a definitive answer.

countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0)) No, this was left in to allow for discussion of adding pseudocounts. Maybe adding pseudocounts should be part of the lesson rather than just for discussion?

Explaining Bioconductor - what additional explanation do you think is needed here?

@BernhardKonrad
Copy link
Contributor

Hi and thanks for these changes, this is already much better!

You can get the output of head with system("head data/counts.txt", intern = TRUE)

With "what are we doing instead?" I wanted to ask you to motivate and define the next step in the analysis.

Similar with the answer to "are there outliers?": Your point is that it is hard to tell, so I suggest stating that in the document. This way you drive home the point of motivating DESeq. Ideally you refer back to this plot after your improved analysis and plot.

I don't remember what you mean by pseudocounts, but I suggest either making the point here more clear or drop it altogether.

added suggestions from pull request
@jdblischak
Copy link
Contributor

I assume this lesson is in the same state as #608. We'll close for now and this can be revisited after #759.

@jdblischak jdblischak closed this Oct 1, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants