-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMueller_Stats_lecture_Everything You NEED to Know_1_6.R
270 lines (151 loc) · 8.96 KB
/
Mueller_Stats_lecture_Everything You NEED to Know_1_6.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
## This document is presented by Dr. Liam O'Connor Mueller for the purpose of a lecture demonstration at the University of California, San Diego on the 25th of May 2021
### This document should be used in concert with your lecture notes! This document is to help you separate the R code I demonstrate in class for theory purposes and the R code I demonstrate in class that has practical application. Use this document as a starting guide for the coding in your assignments.
#####The Binomial#####
#For nominal data with two possible outcomes; Alive or Dead, Treatment or refusal, Test positive or test negative ect.
#The parameter pi which we estimate with p = x/n, where x is the number of "successes" and n is the number of trials
#The shape of the binomial can be estimated with the probability mass function:
dbinom(x = ,size = ,prob = ) #Do not worry about the log argument, we will look at that much later
#The cumulative mass function can be explored with:
pbinom(q = ,size = ,prob = ,lower.tail = ) #q is the quantile we are interested in
#####The Poisson#####
#For count data, especially useful with data containing lots of zeros
# The parameter lambda which we estimate with l= sum(Y)/length(Y) where Y is your count data
# The shape of a Poisson distribution can be estimated with the PMF:
dpois(x = ,lambda = )
#The Cumulative mass function:
ppois(q = ,lambda = ,lower.tail = )
#####The Normal#####
#For continuous data with symmetry around a central tendency. The "bell" curve
#The location parameter mu is estimated with mean= sum(Y)/length(Y)
mean(Y)
#The shape parameter sigma is estimated with the standard deviation
sd(Y)
#The probability density function for a normal is:
dnorm(x = ,mean = ,sd = )
#The cumulative density function for a normal is:
pnorm(q = ,mean = ,sd = ,lower.tail = )
#Z transformations
#There is a special type of normal distribution called the standard normal which we like because it has a mean of 0 and a standard deviation of 1
#We can transform our data to z units:
ZY<- (Y-mean(Y))/sd(Y)
#or
ZY<-scale(x = Y, center = TRUE, scale = TRUE)
#####Central Limits theorem, law of large numbers, and the SE#####
#These two concepts are the foundation of probability. You wont need to do anything explicitly in r with them, just understand the theory.
#The Standard error is the standard deviation of the distribution of sampled means (from the central limit theorem) and is calculated
SE<- sd(Y)/sqrt(length(Y)-1)
#We can use the standard error to determine the confidence intervals around our mean
CI<-mean(Y)+qnorm(p = )*SE #Where P is the upper quantile(s) you want (e.g. p=c(0.025,0.925) when the goal is to determine the 95%CI)
#####t tests#####
#The goal of a t test is to determine the probability of your data given the null hypothesis is true
# P(data|null=TRUE)
#We do this by calculating a t statistic and determining how likely it would be to have that t statistic or greater in magnitude given a certain t distribution.
theta<-#your null hypothesized value
SE<- #your calculated standard error
tstat<- (mean(Y)-theta)/SE
p.value<-pt(q = tstat,df = ,lower.tail = )*2 #Lower tail determined by what sign your tstat has. Multiplied by 2 for a two tailed t test(The one you often want to do!).
#Or, if you know what kind of t test you are doing use the shortcuts (only for what are traditionally considered t tests, a test of the mean. We will see later, that t tests are much broader)
#single sample
t.test(x = ,mu= )
#Paired
t.test(x = ,y= ,paired = TRUE)
#Two sample
t.test(x = ,y= )
#####Tests for normality and transformations#####
#When using the normal distribution, we want the mean and standard deviation to be meaningful estimates, and the only way that is true is if the data are (roughly) normally distributed. There are a few tests that we can use to determine if our data are normally distributed
#The "eye ball" test
hist(Y) #Do our data look like a bell?
#The "best" "actual" test, the D'Agostino test. I like it because it tests for multiple possible issues with your data.
install.packages("fBasics") # you only need to do this once, ever.
library(fBasics) # you only need to do this once per session. Run this each time you are restarting R.
#This is a package, it increases the functionality of R.
normalTest(x = Y,method = "da")
?normalTest
#So, your data are not normal. You might want to use a transformation to get the data more resembling a normal.
#Common transformations
LY<-log(Y+1)
#Back transform with
Y<-exp(LY)-1
SY<-sqrt(Y) #As long as all values in Y are >0
#Back transform with
(SY)^2
#There are plenty of others! Don't be afraid to look up other ones.
#Last ditch effort. If you can't find a transformation to work, the rank transformation is used:
RY<- rank(Y)
#Back transformation
#impossible, that's why we don't like it!
#####Covariation and correlation####
#covariation is how data rise and fall together from a center point (the mean of X and Y).
cov(X,Y)
#correlation is the covariation on the z-transformed data:
cov(scale(X),scale(Y))
cor(X,Y)
#A t test can be used to determine if the r we calculate is different from some null hypothesized value.
cor.test(X,Y)
#####General Linear Models#####
# A linear model where the variation in Y is some function of X
model<-lm(Y~X) #A linear model can be saved as an object
summary(model) #Test if the slope and intercept are different from zero
anova(model) #sums of squares
names(model) # A lot of useful information is hidden in the model object!
plot(model) #check residuals
#power
ModelR2<- #find this value from the summary function or SSregression over SStotal
f2<-ModelR2/(1-ModelR2) #useful for power calculation
install.packages("pwr")
library(pwr)
pwr.f2.test(u = ,v = ,f2 = ,sig.level = ,power = ) #Just like the other ones, leave one out to solve for.
#####Plots#####
par(mar=c(1,1)) #Set the number of plots in your window c(how many rows, how many columns)
plot(X,Y) #see https://www.statmethods.net/graphs/index.html for all sorts of help.
abline(model) #draw the slope of your regression
pred.frame<-data.frame(X=seq(from=min(X),to=max(X),length=100))#creating a series of values of X
pred.c<-predict(model,int="confidence",newdata=pred.frame)# is confidence intervals over fitted data
matlines(pred.frame,pred.c) #plot the 95% CI of the model, many graphical parameters can spruce this up.
#####Multiple Linear Regression#####
#Remember the steps!
###1. Identify the question
###2. Examine the data
###3. Chose the appropriate test
###4. Build a full model
fullmod<-lm(Y~X1+X2+X3)
###5. Test model assumptions and modify model as necessary
library(car)
vif(fullmod)
?vif
#Remove factors which are collinear
###6. Use model selection methods and your brain to determine which model should be analyzed
step(Mymod)
#Use AIC to determine "best" model/s
###7. Examine your final model for problems in your error structure and possible outliers
plot(Mymod)
###8. Test your hypothesis with the model chosen
summary(Mymod)
###9. Report the meaningful results
library(QuantPsyc)
lm.beta(Mymod)
###10. Visualize the model and data
### https://www.datamentor.io/r-programming/3d-plot/
#####ANOVA/ANCOVA/Factorial design#####
#ANOVA all factors are categorical
#Same steps as above, with modifications:
#Use Anova() instead of summary. Anova() is in the car package
library(car)
Anova(model) #This tests which parameters in your model are significant
#To determine which groups are different, use a Tukey test:
aov.mod<-aov(Y~X) #create aov object which is required for TukeyHSD()
TukeyHSD(aov.mod) #Not for ancova, just ANOVA
####Interactions#####
Factormod<-lm(Y~X1*X2) # is shorthand for
Factormod<-lm(Y~X1+X2+X1:X2) # this
VIF(Factormod) #Still important to examine collinearity in you main effects
step(Factormod) # Model selection still useful as well! AIC is an important tool, but make sure the model you choose is relevant to your hypothesis!
Anova(Factormod) #Consider what the p-values mean in the full model. A significant interaction term does not mean that all groups have slopes different from 0.
#####Mixed Effect Models/Random Effects#####
library(lme4)
Random_inter <-lmer(Y~X1 + (1|R)) #where X1 is our fixed effect and R is our random intercept.
Random_slope <-lmer(Y~X1 + (1+X1|R)) #Now both the intercept and slope can vary among groups in the random effect.
Nested_Mod <- lm(Y~X1 + X1/R) #Where R is our hierarchical grouping factor
###Hypothesis testing can be difficult in mixed effect models, especially when considering the error degrees of freedom (go back to your lecture notes!). Sometimes the best choice is a likelihood ratio test between our full model and a null model.
Null_Mod <- lmer(Y~1|R) #A model with no fixed effect, a null model
anova(Random_inter,Null_Mod) # Performs a likelihood rratio test.