-
-
Notifications
You must be signed in to change notification settings - Fork 381
A capstone example for Bioinformatics in R #609
Conversation
Working from analysis.Rmd, you didn't commit |
Does the Python lesson in #608 have that file? |
|
…th the maximum number of aligning reads across all samples.
…he concept of the `apply` family of functions.
…d (how do I verb knitr?)
… the naming scheme
Oops sorry about that! Included the counts.txt file now |
…o sritchie73-bioinfo-r-capstone-aus
Bioinfo r capstone
OK this PR is also up to date with development in London |
added explanation of TRUE/FALSE being summable
|
||
# Introduction and data import | ||
|
||
The analysis of an RNAseq experiment begins with sequencing reads in the form of FASTQ files with reads and quality scores. These then need to be aligned to a reference genome or transcriptome. There are many different alignment tools available, but the process of alignment is both computationally intensive and time-consuming, so we won't cover it today. Once reads are aligned, the number of reads mapped to each gene can be counted to produce a counts matrix. Again, there are several ways of doing this. The best way to find out about the tools that are available and suitable for your research is to look for recent review papers that comapre the different tools. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comapre -> compare
Nice exercise! |
Hi, thanks for this work! With regards to my comment: I'm not in bioinformatics, so if you assume prior subject knowledge then some of these can be disregarded. In that case, I suggest stating who the lecture is tailored for at the beginning.
|
Comments on
|
@BernhardKonrad lots of good points. A couple of additions and explanations:
|
``` | ||
|
||
```{r ggplot_means} | ||
library(ggplot2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
library("ggplot2")
is considered better style, for example in the JStatSoft guidelines. ggplot2
doesn't exist as an object and how/why the unquoted form works may as well be magic to people at this stage. So from a consistency argument (why can't you do: install.packages(ggplot2)
? (rhetorical)) the quoted version is preferred.
Thanks for all the comments and suggestions! My responses to you all are mixed together but I've tried to put them in the order of the lesson itself... You can see my most recent edits here as I don't think it's been merged yet. First, a note on audience. I believe that as with the Python capstone, the audience is intended to be biologists with an idea of why we'd do an RNAseq experiment but no / very little programming experience. Introduction - I removed jargon by shortening sentences rather than re-writing, as I realised it was unnecessary :) I think that covering the full workflow of sequencing analysis, including explaining what FASTQ files are, would be too much for a 1.5-2hr lesson so opted to avoid mentioning them instead. Is there a way to get Gene position information - 'what are we doing instead?' I don't really know what you mean here. We don't need to know where the genes are for this analysis. I've expanded that sentence a bit which might make it clearer. Copying countData - my original idea was to add the mean columns to the data.frame and then select only the necessary columns when later converting to a matrix for DESeq (although all the rows with a mean expression of zero will have been removed...), then creating and working on a copy was suggested. Alternatively we could create a new data frame just for the mean data? I've left this bit as is for now as I'm not sure what would be preferable. Answer to 'are there any outliers?' is kinda subjective (and hard to tell on these plots!) :) this is why we use DESeq! It's more a point for discussion than supposed to have a definitive answer.
Explaining Bioconductor - what additional explanation do you think is needed here? |
Hi and thanks for these changes, this is already much better! You can get the output of With "what are we doing instead?" I wanted to ask you to motivate and define the next step in the analysis. Similar with the answer to "are there outliers?": Your point is that it is hard to tell, so I suggest stating that in the document. This way you drive home the point of motivating DESeq. Ideally you refer back to this plot after your improved analysis and plot. I don't remember what you mean by pseudocounts, but I suggest either making the point here more clear or drop it altogether. |
added suggestions from pull request
The idea here is to show learners how to apply novice R lessons to bioinformatics. This is a very hastily prepared pull request for work in progress but contributions and suggestions very welcome!