-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathmultiple_regression.Rmd
203 lines (152 loc) · 8.31 KB
/
multiple_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: "Multiple Regression with R"
author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_document:
theme: united
highlight: tango
code_folding: show
toc: true
toc_depth: 2
toc_float: true
fig_width: 8
fig_height: 6
---
<!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/multiple_regression.Rmd") --->
```{r message = FALSE, warning = FALSE, echo = FALSE}
# change working directory: should be the directory containg the Markdown files:
#setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/")
```
# Section 1: Multiple Regression
The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Now we will expand this by considering `Height` as another predictor.
Start by plotting the dataset:
```{r}
plot(trees)
```
This plots all variables against each other, enabling visual information about correlations within the dataset.
Re-create the original model of `Volume` against `Girth`:
```{r}
m1 = lm(Volume~Girth,data=trees)
summary(m1)
```
Now include `Height` as an additional variable:
```{r}
m2 = lm(Volume~Girth+Height,data=trees)
summary(m2)
```
Note that the R^2 has improved, yet the `Height` term is less significant than the other two parameters.
Try including the interaction term between `Girth` and `Height`:
```{r}
m3 = lm(Volume~Girth*Height,data=trees)
summary(m3)
```
All terms are highly significant. Note that the `Height` is more significant than in the previous model, despite the introduction of an additional parameter.
We'll now try a different functional form - rather than looking for an additive model, we can explore a multiplicative model by applying a log-log transformation (leaving out the interaction term for now).
```{r}
m4 = lm(log(Volume)~log(Girth)+log(Height),data=trees)
summary(m4)
```
All terms are significant. Note that the residual standard error is much lower than for the previous models. However, this value cannot be compared with the previous models due to transforming the response variable. The R^2 value has increased further, despite reducing the number of parameters from four to three.
```{r}
confint(m4)
```
Looking at the confidence intervals for the parameters reveals that the estimated power of `Girth` is around 2, and `Height` around 1. This makes a lot of sense, given the well-known dimensional relationship between `Volume`, `Girth` and `Height`!
For completeness, we'll now add the interaction term.
```{r}
m5 = lm(log(Volume)~log(Girth)*log(Height),data=trees)
summary(m5)
```
The R^2 value has increased (of course, as all we've done is add an additional parameter), but interestingly none of the four terms are significant. This means that none of the individual terms alone are vital for the model - there is duplication of information between the variables. So we will revert back to the previous model.
Given that it would be reasonable to expect the power of `Girth` to be 2, and Height to be 1, we will now fix those parameters, and instead just estimate the one remaining parameter.
```{r}
m6 = lm(log(Volume)-log((Girth^2)*Height)~1,data=trees)
summary(m6)
```
Note that there is no R^2 (as only the intercept was included in the model), and that the Residual Standard Error is incomparable with previous models due to changing the response variable.
We can alternatively construct a model with the response being y, and the error term additive rather than multiplicative.
```{r}
m7 = lm(Volume~0+I(Girth^2):Height,data=trees)
summary(m7)
```
Note that the parameter estimates for the last two models are slightly different... this is due to differences in the error model.
# Section 2: Model Selection
Of the last two models, the one with the log-Normal error model would seem to have the more Normal residuals. This can be inspected by looking at diagnostic plots, by and using the `shapiro.test()`:
```{r}
plot(m6)
plot(m7)
shapiro.test(residuals(m6))
shapiro.test(residuals(m7))
```
The Akaike Information Criterion (AIC) can help to make decisions regarding which model is the most appropriate. Now calculate the AIC for each of the above models:
```{r}
summary(m1)
AIC(m1)
summary(m2)
AIC(m2)
summary(m3)
AIC(m3)
summary(m4)
AIC(m4)
summary(m5)
AIC(m5)
summary(m6)
AIC(m6)
summary(m7)
AIC(m7)
```
Whilst the AIC can help differentiate between similar models, it cannot help deciding between models that have different responses. Which model would you select as the most appropriate?
# Section 3: Stepwise Regression
The in-built dataset `swiss` contains data pertaining to fertility, along with a variety of socioeconomic indicators. We want to select a sensible model using stepwise regression. First regress `Fertility` agains all available indicators:
```{r}
m8 = lm(Fertility~.,data=swiss)
summary(m8)
```
Are all terms significant?
Now use stepwise regression, performing backward elimination in order to automatically remove inappropriate terms:
```{r}
library(MASS)
summary(stepAIC(m8))
```
Are all terms significant? Is this model suitable? What are the pro's and con's of this approach?
# Section 4: Non-Linear Models
The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees.
In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Using Multiple Regression we trialled various models, including some that had multiple predictor variables and/or involved log-log transformations to explore power relationships.
However, due to limitations of the method, we were not able to explore other options such as a parameterised power relationship with an additive error model. We will now attempt to fit this model:
$y = \beta_0x_1^{\beta_1}x_2^{\beta_2}+\varepsilon$
Parameters for non-linear models may be estimated using the `nls` package in R.
```{r}
volume = trees$Volume
height = trees$Height
girth = trees$Girth
m9 = nls(volume~beta0*girth^beta1*height^beta2,start=list(beta0=1,beta1=2,beta2=1))
summary(m9)
```
Note that the parameters `beta0`, `beta1` and `beta2` weren't defined prior to the function call - `nls` knew what to do with them. Also note that we had to provide starting points for the parameters. What happens if you change them?
Are all terms significant? Is this model appropriate? What else could be tried to achieve a better model?
# Section 5: Practical Exercises
## Puromycin
The in-built R dataset `Puromycin` contains data regarding the reaction velocity versus
substrate concentration in an enzymatic reaction involving untreated cells or cells
treated with Puromycin.
- Plot `conc` (concentration) against `rate`. What is the nature of the relationship
between `conc` and `rate`?
- Find a transformation that linearises the data and stabilises the variance,
making it possible to use linear regression. Create the corresponding linear
regression model. Are all terms significant?
- Add the `state` term to the model. What type of variable is this? Is the
inclusion of this term appropriate?
- Now add a term representing the interaction between `rate` and `state`. Are all
terms significant? What can you conclude?
- Given this information, create the regression model you believe to be the most
appropriate for modelling `conc`. Regenerate the plot of `conc` against `rate`.
Draw curves corresponding to the fitted values of the final model onto this
plot (note that two separate curves should be drawn, corresponding to the
two levels of `state`).
## Attitude
The in-built R dataset `attitude` contains data from a survey of clerical employees.
- Create a linear model regressing `rating` on `complaints`, and store the model
in a variable.
- Use the step function to perform forward selection stepwise regression, in order to automatically add appropriate terms, using a command similar to:
`new_model = step(original_model,.~.+privileges+learning+raises+critical+advance)`
- Which term(s) were added? What is Akaike's Information Criterion (AIC) corresponding to the final model? Are all terms in the resulting model significant? Check diagnostic plots. Do you think this is a suitable model?