-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathweek4.Rmd
568 lines (471 loc) · 18.4 KB
/
week4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
---
title: "Week 4"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Confounding Exercises
```{r }
library(dagdata)
data(admissions)
```
```{r }
print( admissions )
```
## Confounding Exercises #1
Let's compute the proportion of men who were accepted:
```{r }
index = which(admissions$Gender==1)
accepted = sum(admissions$Number[index] * admissions$Percent[index]/100)
applied = sum(admissions$Number[index])
accepted/applied
```
What is the proportion of women that were accepted?
Note: The code sample above gives the proportion of men accepted. Alter the code to find the proportion of women instead.
```{r }
index = which(admissions$Gender==0)
accepted = sum(admissions$Number[index] * admissions$Percent[index]/100)
applied = sum(admissions$Number[index])
accepted/applied
```
## Confounding Exercises #2
Now that we have observed different acceptance rates between genders, test for the significance of this result.
If you perform a chi-square independence test, what is the p-value?
Hint: create a table that has the totals for accepted and not-accepted by gender then use chisq.test().
```{r }
index = admissions$Gender==1
men = admissions[index,]
women = admissions[!index,]
menYes = sum(men$Number*men$Percent/100)
menNo = sum(men$Number*(1-men$Percent/100))
womenYes = sum(women$Number*women$Percent/100)
womenNo = sum(women$Number*(1-women$Percent/100))
tab = matrix(c(menYes,womenYes,menNo,womenNo),2,2)
chisq.test(tab)$p.value
```
This difference actually led to a lawsuit.
Now notice that looking at the data by major, the differences disappear.
```{r }
index = admissions$Gender==1
men = admissions[index,]
women = admissions[!index,]
print( data.frame( major=admissions[1:6,1],men=men[,3], women=women[,3]) )
```
How can this be? This is referred to as Simpson's Paradox.
In the following questions we will try to decipher why this is happening.
## Confounding Exercises #3
We can quantify how "hard" a major is using the percent of students that were accepted. Compute the percent that were accepted (regardless of gender) to each major and call this vector H.
Which is the hardest major?
```{r }
H <- admissions[,c("Major","Percent")]
min_per <- which.min(H$Percent)
H$Major[min_per]
```
Another way to solve it
```{r }
major = admissions[1:6,1]
men = admissions[1:6,]
women =admissions[7:12,]
H = (men$Number*men$Percent/100 + women$Number*women$Percent/100) / (men$Number+women$Number)
major[which.min(H)]
```
## Confounding Exercises #4
What proportion of students is admitted for the hardest major from Confounding Exercises #3?
```{r }
min(H)
```
## Confounding Exercises #5
For men, what is the correlation between the number of applications across majors and H?
```{r }
cor(H,men$Number)
```
## Confounding Exercises #6
For women, what is the correlation between the number of applications across majors and H?
```{r }
cor(H,women$Number)
```
## Confounding in Genomics Exercises
```{r }
library(Biobase)
library(GSE5859)
data(GSE5859)
```
We can extract the gene expression data and sample information table using the Bioconductor functions exprs() and pData() like this:
```{r }
geneExpression = exprs(e)
sampleInfo = pData(e)
```
## Confounding in Genomics Exercises #1
Familiarize yourself with the sampleInfo table. Note that some samples were processed at different times. This is an extraneous variable and should not affect the values in geneExpression. However, as we have seen in previous analyses, it does appear to have an effect, so we will explore this here.
You can extract the year from each date like this:
```{r }
year = format(sampleInfo$date,"%y")
```
```{r }
length( unique(year) )
```
unique years for which we have data.
For how many of these years do we have more than one ethnicity represented?
```{r }
table(year, sampleInfo$ethnicity)
```
Two of the years have more than one ethnicity represented.
## Confounding in Genomics Exercises #2
Repeat the above exercise but now instead of year consider the month as well. Specifically, instead of the year variable defined above, use:
```{r }
month.year = format(sampleInfo$date, "%m%y")
```
For what proportion of these month.year values do we have more than one ethnicity represented?
```{r }
tab = table(month.year, sampleInfo$ethnicity)
print(tab)
x = rowSums (tab != 0)
mean (x >=2)
```
Note that this implies that month.year and ethnicity are almost completely confounded. This means that it is hard to separate effects due to date from effects due to our outcome of interest.
## Confounding in Genomics Exercises #3
Perform a t-test (use rowttests() from the genefilter package) comparing CEU samples processed in 2002 to those processed in 2003. Then use the qvalue package to obtain q-values for each gene.
How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
year = factor( format(sampleInfo$date,"%y") )
index = which(year%in% c("02","03") & sampleInfo$ethnicity=="CEU")
year = droplevels(year[index])#remove unused levels from the factor variable.
pval = rowttests(geneExpression[ ,index], year)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05) #counts the number of genes with q-values below 0.05
```
What is the estimate of pi0 provided by qvalue()?
```{r }
print(qvalue(pval)$pi0 )
```
Note that the estimated percentage of genes that are differentially expressed is above 30%. This is one way to show the magnitude of the effect processing date has on the measurements.
## Confounding in Genomics Exercises #4
Now perform a t-test (use rowttests()) comparing CEU samples processed in 2003 to CEU samples processed in 2004. Then use the qvalue package to obtain q-values for each gene.
How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
year = factor( format(sampleInfo$date,"%y") )
index = which(year%in% c("03","04") & sampleInfo$ethnicity=="CEU")
year = droplevels(year[index])#remove unused levels from the factor variable.
pval = rowttests(geneExpression[ ,index], year)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05) #counts the number of genes with q-values below 0.05
```
## Confounding in Genomics Exercises #5
Now we are going to compare ethnicities as was done in the original publication in which these data were first presented. Use the rowttests() function to compare the ASN population to the CEU population. Once again, use the qvalue() function to obtain q-values.
How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
index = which(sampleInfo$ethnicity%in% c("CEU","ASN"))
g = droplevels(sampleInfo$ethnicity[index])
pval = rowttests(geneExpression[ ,index], g)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05)
```
## Confounding in Genomics Exercises #6
Note that over 80% of genes are called differentially expressed between ethnic groups. However, due to the confounding with processing date, we need to confirm these differences are actually due to ethnicity. This will not be easy due to the almost perfect confounding. However, above we noted that two groups were represented in 2005. Just like we stratified by majors to remove the "major effect" in our admissions example, here we can stratify by year and perform a t-test comparing ASN and CEU, but only for samples processed in 2005.
How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
year = factor( format(sampleInfo$date,"%y") )
index = which(sampleInfo$ethnicity%in% c("CEU","ASN") & year=="05")
g = droplevels(sampleInfo$ethnicity[index])
pval = rowttests(geneExpression[ ,index], g)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05)
```
Note the dramatic drop in the number of genes with q-value < 0.05 when we fix the year. However, the sample size is much smaller in this latest analysis which means we have less power:
```{r }
table(sampleInfo$ethnicity[index])
```
## Confounding in Genomics Exercises #7
To provide a more balanced comparison, we repeat the analysis but now by taking 3 random CEU samples from 2002. Repeat the analysis above but comparing the ASN from 2005 to three random CEU samples from 2002. Set the seed at 3, set.seed(3), before random sampling.
How many genes have q-values < 0.05?
```{r }
library(qvalue)
library(genefilter)
year = factor(format(sampleInfo$date, "%y"))
index1 = which(sampleInfo$ethnicity=="ASN" & year=="05")
set.seed(3)
index2 = sample(which(sampleInfo$ethnicity =="CEU" & year=="02"),3)
index = c(index1, index2)
g = droplevels(sampleInfo$ethnicity[index])
pval = rowttests(geneExpression[,index],g)$p.value
qval = qvalue(pval)
sum(qval$qvalue < 0.05)
```
## EDA with PCA
## Variance explained
```{r }
library(RColorBrewer)
cols=colorRampPalette(rev(brewer.pal(11,"RdBu")))(100)
n <- ncol(y)
image(1:n,1:n,cor(y),xlab="samples",ylab="samples",col=cols,zlim=c(-1,1))
```
One simple exploratory plot we make to determine how many principal components we need to describe this _structure_ is the variance-explained plot. This is what the variance explained for the PCs would look like if data were independent :
```{r }
y0 <- matrix(rnorm(nrow(y)*ncol(y)), nrow(y),ncol(y))
d0 <- svd(y0)$d
plot(d0^2/sum(d0^2),ylim=c(0,.25))
```
Instead we see this:
```{r }
plot(s$d^2/sum(s$d^2))
```
## MDS plot
One way to explore the relationship
between variables of interest and PCs is to use color to denote these variables. For example, here are the first two PCs with color representing ethnicity:
```{r }
cols = as.numeric(eth)
mypar()
plot(s$v[,1],s$v[,2],col=cols,pch=16,
xlab="PC1",ylab="PC2")
legend("bottomleft",levels(eth),col=seq(along=levels(eth)),pch=16)
```
There is a very clear association between the first PC and ethnicity. However, we also see that for the orange points there are sub-clusters. We know from previous analyses that ethnicity and preprocessing date are correlated:
```{r }
year = factor(format(dates,"%y"))
table(year,eth)
```
look at the year:
```{r }
cols = as.numeric(year)
mypar()
plot(s$v[,1],s$v[,2],col=cols,pch=16,
xlab="PC1",ylab="PC2")
legend("bottomleft",levels(year),col=seq(along=levels(year)),pch=16)
```
## Boxplot of PCs
```{r }
month <- format(dates, "%y%m")
length(unique(month))
```
Because there are so many months (21), it becomes complicated to use color. Instead we can stratify by month and look at boxplots of our PCs:
```{r }
variable <- as.numeric(month)
mypar(2,2)
for(i in 1:4){
boxplot(split(s$v[,i],variable),las=2,range=0)
stripchart(split(s$v[,i],variable),add=TRUE,vertical=TRUE,pch=1,cex=.5,col=1)
}
```
## Modeling Batch Effects Exercises
```{r }
library(GSE5859Subset)
data(GSE5859Subset)
```
Here we purposely confounded month and group (sex) but not completely:
```{r }
sex = sampleInfo$group
month = factor(format(sampleInfo$date,"%m"))
table(sampleInfo$group, month)
```
## Modeling Batch Effects Exercises #1
Using the functions rowttests() and qvalue() compare the two groups, in this case males and females coded in sex. Because this is a smaller dataset, which decreases our power, we will use a more lenient FDR cut-off of 10%.
How many gene have q-values less than 0.1?
```{r }
library(qvalue)
library(genefilter)
pvals <- rowttests(geneExpression,factor(sampleInfo$g))$p.value
qvals = qvalue(pvals)$qvalues
sum(qvals<0.1)
```
## Modeling Batch Effects Exercises #2
Note that sampleInfo$group here represents males and females. Thus we expect differences to be on chrY and, for genes that escape inactivation, chrX. Note that we do not expect many autosomal genes to be different between males and females. This gives us an opportunity to evaluate false and true positives with experimental data. For example, we evaluate results using the proportion genes of the list that are on chrX or chrY.
For the list of genes with q<0.1 calculated in Modeling Batch Effects Exercises #1, what proportion of genes are on chrX or chrY?
```{r }
index = geneAnnotation$CHR[qvals<0.1]%in%c("chrX","chrY")
mean(index)
```
## Modeling Batch Effects Exercises #3
Now for the autosomal genes (not on chrX and chrY) for which q-value < 0.1 perform a t-test comparing samples processed in June to those processed in October.
What proportion of these have p-values < 0.05?
```{r }
index = which(qvals<0.1 & !geneAnnotation$CHR%in%c("chrX","chrY"))
month = factor(format(sampleInfo$date, "%m"))
pvals = rowttests(geneExpression[index,],month)$p.value
mean(pvals<0.05)
```
## Modeling Batch Effects Exercises #5
Now use the X defined above to fit a regression model using lm for each gene. Note that you can obtain p-values for estimated parameters using summary(). Here is an example:
```{r }
X = model.matrix(~sex+month)
i = 234
y = geneExpression[i,]
fit = lm(y~X-1)
summary(fit)$coef
```
```{r }
library(qvalue)
pvals = sapply(1:nrow(geneExpression),function(i){
y = geneExpression[i,]
fit = lm(y~X-1)
summary(fit)$coef[2,4]
})
qvals = qvalue(pvals)$qvalue
sum(qvals<0.1)
```
## Modeling Batch Effects Exercises #6
With this new list, what proportion of these are chrX and chrY?
```{r }
index = geneAnnotation$CHR[qvals<0.1]%in%c("chrX","chrY")
mean(index)
```
## Modeling Batch Effects Exercises #7
Now, from the linear model in Modeling Batch Effects Exercises #6, extract the p-values related to the coefficient representing the October versus June differences using the same linear model.
How many of the q-values for the month comparison are < 0.1 now?
```{r }
pvals = sapply(1:nrow(geneExpression),function(i){
y = geneExpression[i,]
fit = lm(y~X)
summary(fit)$coef[3,4]
})
qvals = qvalue(pvals)$qvalue
sum(qvals<0.1)
```
## Factor Analysis Exercises
```{r }
library(Biobase)
library(GSE5859Subset)
data(GSE5859Subset)
```
```{r }
y = geneExpression - rowMeans(geneExpression)
```
Compute and plot an image of the correlation for each sample. Make two image plots of these correlations. In the first one, plot the correlation as image. In the second, order the samples by date and then plot the an image of the correlation. The only difference in these plots is the order in which the samples are plotted.
```{r }
library(rafalib)
sex = sampleInfo$group
mypar(1,2)
cors = cor(y)
image(cors)
o = order(sampleInfo$date)
image(cors[o,o])
```
A: The fact that in the plot ordered by month we see two groups mainly driven by month and within these, we see subgroups driven by date seems to suggest date more than month per se are the hidden factors.
## Factor Analysis Exercises #2
Based on the correlation plots above, we could argue that there are at least two hidden factors. Using PCA estimate these two factors. Specifically, apply the svd() to y and use the first two PCs as estimates.
Which command gives us these estimates?
```{r }
pcs = svd(y)$v[,1:2]
```
## Factor Analysis Exercises #3
Plot each of the estimated factor ordered by date. Use color to denote month. The first factor is clearly related to date.
Which of the following appear to be most different according to this factor?
```{r }
pcs = svd(y)$v[,1:2]
o = order(sampleInfo$date)
cols = as.numeric(month)[o]
mypar(2,1)
for(i in 1:2){
plot(pcs[o,i],col=cols,xaxt="n",xlab="")
label = gsub("2005-","",sampleInfo$date[o])
axis(1,1:ncol(y),label,las=2)
}
```
we see that the first factor changes
A: June 23 and June 27
## Factor Analysis Exercises #4
Use the svd() function to obtain the principal components (PCs) for our detrended gene expression data y.
How many principal components (PCs) explain more than 10% each of the variability?
varexplained is then calculated as the ratio of each squared singular value to the total sum of squared singular values. It represents the proportion of variability explained by each principal component.
```{r }
s = svd(y)
varexplained = s$d^2/ sum(s$d^2)
plot(varexplained)
sum(varexplained>0.10)
```
## Factor Analysis Exercises #5
Which PC most correlates (negative or positive correlation) with month?
```{r }
month = factor( format(sampleInfo$date,"%m"))
cors = cor( as.numeric(month),s$v)
plot(t(cors))
which.max(abs(cors))
max(abs(cors))
```
## Factor Analysis Exercises #6
```{r }
sex = factor( format(sampleInfo$group))
cors = cor( as.numeric(sex),s$v)
plot(t(cors))
which.max(abs(cors))
max(abs(cors))
```
## Factor Analysis Exercises #7
Now instead of using month, which we have shown does not quite describe the batch, add the two estimated factors in Factor Analysis Exercises #6 to the linear model we used in previous exercises:
```{r }
X <- model.matrix(~sex+s$v[,1:2])
```
Apply this model to each gene, and compute q-values for the sex difference.
How many q-values are <0.1 for the sex comparison?
```{r }
pvals = sapply(1:nrow(geneExpression),function(i){
y = geneExpression[i,]
fit = lm(y~X-1) #-1 means no intercept
summary(fit)$coef[2,4]
})
qvals = qvalue(pvals)$qvalue
sum(qvals<0.1)
```
What proportion of the genes are on chrX and chrY?
```{r }
index = geneAnnotation$CHR[qvals<0.1]%in%c("chrX","chrY")
mean(index)
```
## Surrogate Variable Analysis (SVA)
```{r }
#BiocManager::install("sva")
library(sva)
library(Biobase)
library(GSE5859Subset)
data(GSE5859Subset)
```
## SVA Exercises #1
In the previous section we estimated factors using PCA. But we noted that the first factor was correlated with our outcome of interest:
```{r }
s <- svd(geneExpression-rowMeans(geneExpression))
cor(sampleInfo$group,s$v[,1])
```
As in the previous questions we are interested in finding genes that are differentially expressed between the two groups (males and females in this case). Here we learn to use SVA to estimate these effects while using a factor analysis approach to account for batch effects.
The svafit() function estimates factors, but downweighting the genes that appear to correlate with the outcome of interest. It also tries to estimate the number of factors and returns the estimated factors like this:
```{r }
sex = sampleInfo$group
mod = model.matrix(~sex)
svafit = sva(geneExpression,mod)
head(svafit$sv)
```
Note that the resulting estimated factors are not that different from the PCs:
```{r }
for(i in 1:ncol(svafit$sv)){
print( cor(s$v[,i],svafit$sv[,i]) )
}
```
Now fit a linear model to estimate the difference between males and females for each gene but that instead of accounting for batch effects using month it includes the factors estimated by sva in the model. Use the qvalue() function to estimate q-values.
How many genes have q-value < 0.1?
```{r }
library(qvalue)
library(sva)
X = model.matrix(~sex+svafit$sv)
pvals = sapply(1:nrow(geneExpression),function(i){
y = geneExpression[i,]
fit = lm(y~X-1)
summary(fit)$coef[2,4]
})
qvals = qvalue(pvals)$qvalue
sum(qvals<0.1)
```
## SVA Exercises #2
What proportion of the genes from SVA Exercises #1 are from chrY or chrX?
```{r }
index = geneAnnotation$CHR[qvals<0.1]%in%c("chrX","chrY")
mean(index)
```
```{r }
```