week4.Rmd

---
title: "Week 4"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Confounding Exercises


```{r }
library(dagdata)
data(admissions)
```

```{r }
print( admissions )
```
## Confounding Exercises #1

Let's compute the proportion of men who were accepted:
```{r }
index = which(admissions$Gender==1)
accepted = sum(admissions$Number[index] * admissions$Percent[index]/100)
applied = sum(admissions$Number[index])
accepted/applied
```

What is the proportion of women that were accepted?
Note: The code sample above gives the proportion of men accepted. Alter the code to find the proportion of women instead.
```{r }
index = which(admissions$Gender==0)
accepted = sum(admissions$Number[index] * admissions$Percent[index]/100)
applied = sum(admissions$Number[index])
accepted/applied
```

## Confounding Exercises #2
Now that we have observed different acceptance rates between genders, test for the significance of this result.

If you perform a chi-square independence test, what is the p-value?
Hint: create a table that has the totals for accepted and not-accepted by gender then use chisq.test().
```{r }
index = admissions$Gender==1
men = admissions[index,]
women = admissions[!index,]
menYes = sum(men$Number*men$Percent/100)
menNo = sum(men$Number*(1-men$Percent/100))
womenYes = sum(women$Number*women$Percent/100)
womenNo = sum(women$Number*(1-women$Percent/100))
tab = matrix(c(menYes,womenYes,menNo,womenNo),2,2)
chisq.test(tab)$p.value
```
This difference actually led to a lawsuit.

Now notice that looking at the data by major, the differences disappear.
```{r }
index = admissions$Gender==1
men = admissions[index,]
women = admissions[!index,]
print( data.frame( major=admissions[1:6,1],men=men[,3], women=women[,3]) )
```
How can this be? This is referred to as Simpson's Paradox.

In the following questions we will try to decipher why this is happening.

## Confounding Exercises #3
We can quantify how "hard" a major is using the percent of students that were accepted. Compute the percent that were accepted (regardless of gender) to each major and call this vector H.

Which is the hardest major?
```{r }
H <- admissions[,c("Major","Percent")]
min_per <- which.min(H$Percent)
H$Major[min_per]
```

Another way to solve it
```{r }
major = admissions[1:6,1]
men = admissions[1:6,]
women =admissions[7:12,]
H = (men$Number*men$Percent/100 + women$Number*women$Percent/100) / (men$Number+women$Number)
major[which.min(H)]
```

## Confounding Exercises #4

What proportion of students is admitted for the hardest major from Confounding Exercises #3?
```{r }
min(H)
```

## Confounding Exercises #5
For men, what is the correlation between the number of applications across majors and H?
```{r }
cor(H,men$Number)
```

## Confounding Exercises #6

For women, what is the correlation between the number of applications across majors and H?
```{r }
cor(H,women$Number)
```

## Confounding in Genomics Exercises

```{r }
library(Biobase)
library(GSE5859)
data(GSE5859)
```

We can extract the gene expression data and sample information table using the Bioconductor functions exprs() and pData() like this:
```{r }
geneExpression = exprs(e)
sampleInfo = pData(e)
```

## Confounding in Genomics Exercises #1
Familiarize yourself with the sampleInfo table. Note that some samples were processed at different times. This is an extraneous variable and should not affect the values in geneExpression. However, as we have seen in previous analyses, it does appear to have an effect, so we will explore this here.

You can extract the year from each date like this:
```{r }
year = format(sampleInfo$date,"%y")
```

```{r }
length( unique(year) )
```
unique years for which we have data.

For how many of these years do we have more than one ethnicity represented?
```{r }
table(year, sampleInfo$ethnicity)
```
Two of the years have more than one ethnicity represented.

## Confounding in Genomics Exercises #2

Repeat the above exercise but now instead of year consider the month as well. Specifically, instead of the year variable defined above, use:
```{r }
month.year = format(sampleInfo$date, "%m%y")
```

For what proportion of these month.year values do we have more than one ethnicity represented?
```{r }
tab = table(month.year, sampleInfo$ethnicity)
print(tab)
x = rowSums (tab != 0)
mean (x >=2)
```
Note that this implies that month.year and ethnicity are almost completely confounded. This means that it is hard to separate effects due to date from effects due to our outcome of interest.

## Confounding in Genomics Exercises #3
Perform a t-test (use rowttests() from the genefilter package) comparing CEU samples processed in 2002 to those processed in 2003. Then use the qvalue package to obtain q-values for each gene.

How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
year = factor( format(sampleInfo$date,"%y") )
index = which(year%in% c("02","03") & sampleInfo$ethnicity=="CEU")
year = droplevels(year[index])#remove unused levels from the factor variable.
pval = rowttests(geneExpression[ ,index], year)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05) #counts the number of genes with q-values below 0.05
```

What is the estimate of pi0 provided by qvalue()?
```{r }
print(qvalue(pval)$pi0 )
```
Note that the estimated percentage of genes that are differentially expressed is above 30%. This is one way to show the magnitude of the effect processing date has on the measurements.

## Confounding in Genomics Exercises #4

Now perform a t-test (use rowttests()) comparing CEU samples processed in 2003 to CEU samples processed in 2004. Then use the qvalue package to obtain q-values for each gene.

How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
year = factor( format(sampleInfo$date,"%y") )
index = which(year%in% c("03","04") & sampleInfo$ethnicity=="CEU")
year = droplevels(year[index])#remove unused levels from the factor variable.
pval = rowttests(geneExpression[ ,index], year)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05) #counts the number of genes with q-values below 0.05
```

## Confounding in Genomics Exercises #5

Now we are going to compare ethnicities as was done in the original publication in which these data were first presented. Use the rowttests() function to compare the ASN population to the CEU population. Once again, use the qvalue() function to obtain q-values.

How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
index = which(sampleInfo$ethnicity%in% c("CEU","ASN"))
g = droplevels(sampleInfo$ethnicity[index])
pval = rowttests(geneExpression[ ,index], g)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05)
```

## Confounding in Genomics Exercises #6

Note that over 80% of genes are called differentially expressed between ethnic groups. However, due to the confounding with processing date, we need to confirm these differences are actually due to ethnicity. This will not be easy due to the almost perfect confounding. However, above we noted that two groups were represented in 2005. Just like we stratified by majors to remove the "major effect" in our admissions example, here we can stratify by year and perform a t-test comparing ASN and CEU, but only for samples processed in 2005.

How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
year = factor( format(sampleInfo$date,"%y") )
index = which(sampleInfo$ethnicity%in% c("CEU","ASN") & year=="05")
g = droplevels(sampleInfo$ethnicity[index])
pval = rowttests(geneExpression[ ,index], g)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05)
```
Note the dramatic drop in the number of genes with q-value < 0.05 when we fix the year. However, the sample size is much smaller in this latest analysis which means we have less power:
```{r }
table(sampleInfo$ethnicity[index])
```

## Confounding in Genomics Exercises #7

To provide a more balanced comparison, we repeat the analysis but now by taking 3 random CEU samples from 2002. Repeat the analysis above but comparing the ASN from 2005 to three random CEU samples from 2002. Set the seed at 3, set.seed(3), before random sampling.

How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
year = factor(format(sampleInfo$date, "%y"))
index1 = which(sampleInfo$ethnicity=="ASN" & year=="05")
set.seed(3)
index2 = sample(which(sampleInfo$ethnicity =="CEU" & year=="02"),3)
index = c(index1, index2)
g = droplevels(sampleInfo$ethnicity[index])
pval = rowttests(geneExpression[,index],g)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05)
```

## EDA with PCA

## Variance explained

```{r }
library(RColorBrewer)
cols=colorRampPalette(rev(brewer.pal(11,"RdBu")))(100)
n <- ncol(y)
image(1:n,1:n,cor(y),xlab="samples",ylab="samples",col=cols,zlim=c(-1,1))
```

One simple exploratory plot we make to determine how many principal components we need to describe this _structure_ is the variance-explained plot. This is what the variance explained for the PCs would look like if data were independent :
```{r }
y0 <- matrix(rnorm(nrow(y)*ncol(y)), nrow(y),ncol(y))
d0 <- svd(y0)$d
plot(d0^2/sum(d0^2),ylim=c(0,.25))
```
Instead we see this:
```{r }
plot(s$d^2/sum(s$d^2))
```

## MDS plot
One way to explore the relationship
between variables of interest and PCs is to use color to denote these variables. For example, here are the first two PCs with color representing ethnicity:
```{r }
cols = as.numeric(eth)
mypar()
plot(s$v[,1],s$v[,2],col=cols,pch=16,
     xlab="PC1",ylab="PC2")
legend("bottomleft",levels(eth),col=seq(along=levels(eth)),pch=16)
```

There is a very clear association between the first PC and ethnicity. However, we also see that for the orange points there are sub-clusters. We know from previous analyses that ethnicity and preprocessing date are correlated:

```{r }
year = factor(format(dates,"%y"))
table(year,eth)
```
look at the year:
```{r }
cols = as.numeric(year)
mypar()
plot(s$v[,1],s$v[,2],col=cols,pch=16,
     xlab="PC1",ylab="PC2")
legend("bottomleft",levels(year),col=seq(along=levels(year)),pch=16)
```

## Boxplot of PCs

```{r }
month <- format(dates, "%y%m")
length(unique(month))
```
Because there are so many months (21), it becomes complicated to use color. Instead we can stratify by month and look at boxplots of our PCs:

```{r }
variable <- as.numeric(month)
mypar(2,2)
for(i in 1:4){
  boxplot(split(s$v[,i],variable),las=2,range=0)
  stripchart(split(s$v[,i],variable),add=TRUE,vertical=TRUE,pch=1,cex=.5,col=1)
}
```
## Modeling Batch Effects Exercises

```{r }
library(GSE5859Subset)
data(GSE5859Subset)
```

Here we purposely confounded month and group (sex) but not completely:
```{r }
sex = sampleInfo$group
month = factor(format(sampleInfo$date,"%m"))
table(sampleInfo$group, month)
```
## Modeling Batch Effects Exercises #1

Using the functions rowttests() and qvalue() compare the two groups, in this case males and females coded in sex. Because this is a smaller dataset, which decreases our power, we will use a more lenient FDR cut-off of 10%.

How many gene have q-values less than 0.1?
```{r }
library(qvalue)
library(genefilter)
pvals <- rowttests(geneExpression,factor(sampleInfo$g))$p.value
qvals = qvalue(pvals)$qvalues
sum(qvals<0.1)
```

## Modeling Batch Effects Exercises #2

Note that sampleInfo$group here represents males and females. Thus we expect differences to be on chrY and, for genes that escape inactivation, chrX. Note that we do not expect many autosomal genes to be different between males and females. This gives us an opportunity to evaluate false and true positives with experimental data. For example, we evaluate results using the proportion genes of the list that are on chrX or chrY.

For the list of genes with q<0.1 calculated in Modeling Batch Effects Exercises #1, what proportion of genes are on chrX or chrY?
```{r }
index = geneAnnotation$CHR[qvals<0.1]%in%c("chrX","chrY")
mean(index)
```

## Modeling Batch Effects Exercises #3
Now for the autosomal genes (not on chrX and chrY) for which q-value < 0.1 perform a t-test comparing samples processed in June to those processed in October.

What proportion of these have p-values < 0.05?
```{r }
index = which(qvals<0.1 & !geneAnnotation$CHR%in%c("chrX","chrY"))
month = factor(format(sampleInfo$date, "%m"))
pvals = rowttests(geneExpression[index,],month)$p.value
mean(pvals<0.05)
```

## Modeling Batch Effects Exercises #5

Now use the X defined above to fit a regression model using lm for each gene. Note that you can obtain p-values for estimated parameters using summary(). Here is an example:
```{r }
X = model.matrix(~sex+month)
i = 234
y = geneExpression[i,]
fit = lm(y~X-1)
summary(fit)$coef
```

```{r }
library(qvalue)
pvals = sapply(1:nrow(geneExpression),function(i){
    y = geneExpression[i,]
    fit = lm(y~X-1) 
    summary(fit)$coef[2,4]
})
qvals = qvalue(pvals)$qvalue
sum(qvals<0.1)
```

## Modeling Batch Effects Exercises #6
With this new list, what proportion of these are chrX and chrY?
```{r }
index = geneAnnotation$CHR[qvals<0.1]%in%c("chrX","chrY")
mean(index)
```

## Modeling Batch Effects Exercises #7

Now, from the linear model in Modeling Batch Effects Exercises #6, extract the p-values related to the coefficient representing the October versus June differences using the same linear model.

How many of the q-values for the month comparison are < 0.1 now?
```{r }
pvals = sapply(1:nrow(geneExpression),function(i){
  y = geneExpression[i,]
  fit = lm(y~X)
  summary(fit)$coef[3,4]
})
qvals = qvalue(pvals)$qvalue
sum(qvals<0.1)
```

## Factor Analysis Exercises

```{r }
library(Biobase)
library(GSE5859Subset)
data(GSE5859Subset)
```

```{r }
y = geneExpression - rowMeans(geneExpression)
```

Compute and plot an image of the correlation for each sample. Make two image plots of these correlations. In the first one, plot the correlation as image. In the second, order the samples by date and then plot the an image of the correlation. The only difference in these plots is the order in which the samples are plotted.
```{r }
library(rafalib)
sex = sampleInfo$group
mypar(1,2)
cors = cor(y)
image(cors)
o = order(sampleInfo$date)
image(cors[o,o])
```
A: The fact that in the plot ordered by month we see two groups mainly driven by month and within these, we see subgroups driven by date seems to suggest date more than month per se are the hidden factors.

## Factor Analysis Exercises #2

Based on the correlation plots above, we could argue that there are at least two hidden factors. Using PCA estimate these two factors. Specifically, apply the svd() to y and use the first two PCs as estimates.

Which command gives us these estimates?
```{r }
pcs = svd(y)$v[,1:2]
```

## Factor Analysis Exercises #3

Plot each of the estimated factor ordered by date. Use color to denote month. The first factor is clearly related to date.

Which of the following appear to be most different according to this factor?
```{r }
pcs = svd(y)$v[,1:2]
o = order(sampleInfo$date)
cols = as.numeric(month)[o]
mypar(2,1)
for(i in 1:2){
  plot(pcs[o,i],col=cols,xaxt="n",xlab="")
  label = gsub("2005-","",sampleInfo$date[o])
  axis(1,1:ncol(y),label,las=2)
}
```
we see that the first factor changes
A: June 23 and June 27

## Factor Analysis Exercises #4
Use the svd() function to obtain the principal components (PCs) for our detrended gene expression data y.

How many principal components (PCs) explain more than 10% each of the variability?

varexplained is then calculated as the ratio of each squared singular value to the total sum of squared singular values. It represents the proportion of variability explained by each principal component.
```{r }
s = svd(y)
varexplained = s$d^2/ sum(s$d^2)
plot(varexplained)
sum(varexplained>0.10)
```
## Factor Analysis Exercises #5

Which PC most correlates (negative or positive correlation) with month?
```{r }
month = factor( format(sampleInfo$date,"%m"))
cors = cor( as.numeric(month),s$v)
plot(t(cors))
which.max(abs(cors))
max(abs(cors))
```
## Factor Analysis Exercises #6

```{r }
sex = factor( format(sampleInfo$group))
cors = cor( as.numeric(sex),s$v)
plot(t(cors))
which.max(abs(cors))
max(abs(cors))
```
## Factor Analysis Exercises #7

Now instead of using month, which we have shown does not quite describe the batch, add the two estimated factors in Factor Analysis Exercises #6 to the linear model we used in previous exercises:
```{r }
X <- model.matrix(~sex+s$v[,1:2])
```

Apply this model to each gene, and compute q-values for the sex difference.

How many q-values are <0.1 for the sex comparison?
```{r }
pvals = sapply(1:nrow(geneExpression),function(i){
  y = geneExpression[i,]
  fit = lm(y~X-1) #-1 means no intercept
  summary(fit)$coef[2,4]
})
qvals = qvalue(pvals)$qvalue
sum(qvals<0.1)
```

What proportion of the genes are on chrX and chrY?
```{r }
index = geneAnnotation$CHR[qvals<0.1]%in%c("chrX","chrY")
mean(index)
```

## Surrogate Variable Analysis (SVA)

```{r }
#BiocManager::install("sva")
library(sva)
library(Biobase)
library(GSE5859Subset)
data(GSE5859Subset)
```
## SVA Exercises #1
In the previous section we estimated factors using PCA. But we noted that the first factor was correlated with our outcome of interest:
```{r }
s <- svd(geneExpression-rowMeans(geneExpression))
cor(sampleInfo$group,s$v[,1])
```

As in the previous questions we are interested in finding genes that are differentially expressed between the two groups (males and females in this case). Here we learn to use SVA to estimate these effects while using a factor analysis approach to account for batch effects.

The svafit() function estimates factors, but downweighting the genes that appear to correlate with the outcome of interest. It also tries to estimate the number of factors and returns the estimated factors like this:
```{r }
sex = sampleInfo$group
mod = model.matrix(~sex)
svafit = sva(geneExpression,mod)
head(svafit$sv)
```
Note that the resulting estimated factors are not that different from the PCs:
```{r }
for(i in 1:ncol(svafit$sv)){
  print( cor(s$v[,i],svafit$sv[,i]) )
}
```
Now fit a linear model to estimate the difference between males and females for each gene but that instead of accounting for batch effects using month it includes the factors estimated by sva in the model. Use the qvalue() function to estimate q-values.

How many genes have q-value < 0.1?
```{r }
library(qvalue)
library(sva)
X = model.matrix(~sex+svafit$sv)
pvals = sapply(1:nrow(geneExpression),function(i){
  y = geneExpression[i,]
  fit = lm(y~X-1)
  summary(fit)$coef[2,4]
})
qvals = qvalue(pvals)$qvalue
sum(qvals<0.1)
```

## SVA Exercises #2
What proportion of the genes from SVA Exercises #1 are from chrY or chrX?
```{r }
index = geneAnnotation$CHR[qvals<0.1]%in%c("chrX","chrY")
mean(index)
```

```{r }

```