-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathacath.qmd
271 lines (217 loc) · 9.42 KB
/
acath.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
title: "Diagnostic Model Using Duke Cardiac Catheterization Data"
author:
- name: Frank Harrell
affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
html:
self-contained: true
number-sections: true
number-depth: 3
anchor-sections: true
smooth-scroll: true
theme: journal
toc: true
toc-depth: 3
toc-title: Contents
toc-location: left
code-link: false
code-tools: true
code-fold: show
code-block-bg: "#f1f3f5"
code-block-border-left: "#31BAE9"
reference-location: margin
fig-cap-location: margin
execute:
warning: false
message: false
major: regression modeling; rms
minor: logistic
---
```{r setup}
require(rms)
require(ggplot2)
require(lattice)
require(qreport)
options(prType='html')
```
# Problem
* Probabilistic diagnosis of significant (coronary flow-limiting) coronary artery stenosis $~~\rightarrow~ \geq 70$ percent diameter narrowing in $\geq 1$ major coronary artery
+ Process leads to a model for pre-test probability of CAD
* Patients presenting with chest pain to Duke University Medical Center and undergoing left heart catheterization/angiogram for it
* Dataset: `acath`
* Number of observations: 2258 patients
* Response: `sigdz`
* Predictors: age, sex, cholesterol
* Statistical model: binary logistic regression model
* See [here](http://hbiostat.org/rmsc/lrm.html#assessment-of-model-fit) and [here](http://hbiostat.org/bbr/dx.html#example-diagnosis-of-coronary-artery-disease-cad)
* Question [Thanks to Tom Stewart for an earlier version of this example.]{.aside}
* Translate question to model
* Identify covariates for adjustment
+ Missing data strategy
* Determine parameter budget
+ Effective sample size --> 15:1 rule --> budget
+ Data reduction needed?
+ Interactions?
+ Non-linear terms
+ Allocation of parameters according to clinical understanding
* Fit the model
* Model diagnostics
+ Overly influential observations (dfbetas, dfits)
+ Residual diagnostics (linear model usually)
* Measures of model performance (Optimism corrected)
+ Discrimination
+ Calibration
* Understanding the model (Partial effect plots)
* Hypothesis tests
+ Total predictor impact
+ Linearity
+ Interaction
# Preparation
## Data Import
Exclude patients having missing cholesterol. This is not good statistical practice and is used only for simplicity.
```{r import}
getHdata(acath)
d <- upData(acath,
subset = ! is.na(choleste),
sex = factor(sex, 0:1, c('male', 'female')))
print(contents(d), levelType='table')
```
## Descriptive Statistics
```{r desc}
describe(d)
```
# Marginal Relationships
Estimate the probability that significant CAD will be found, using moving overlapping proportions and other methods. The probability estimates will be a function of age and sex then of cholesterol and age. Use the `movStats` function in `Hmisc`. Estimate probabilities using the `loess` nonparametric smoother, logistic regression (LR), and moving proportions in overlapping windows. Use the `addggLayers` function to add spike histograms and extended box plots to depict age and cholesterol marginal distributions. For the second plot, since age is a continuous variable, if we want to stratify on it vertically we must use an arbitrary information-losing categorization (we will model this interaction as continuous later). We use tertiles of age.
```{r results='asis'}
#| fig.height: 7.5
u <- movStats(sigdz ~ age + sex, loess=TRUE, lrm=TRUE,
data=d, melt=TRUE, pr='margin')
g <- ggplot(u, aes(x=age, y=sigdz, col=sex)) + geom_line() +
facet_grid(Type ~ Statistic) +
xlab(hlab(age)) + ylab(hlab(sigdz)) +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom')
g <- addggLayers(g, d, by='sex', value='age', pos='top')
addggLayers(g, d, by='sex', value='age', type='spike')
d <- upData(d, ageTertile = cut2(age, g=3))
u <- movStats(sigdz ~ choleste + ageTertile, loess=TRUE, lrm=TRUE,
data=d, melt=TRUE, pr='margin')
g <- ggplot(u, aes(x=choleste, y=sigdz, col=ageTertile)) + geom_line() +
facet_grid(Type ~ Statistic) +
xlab(hlab(choleste)) + ylab(hlab(sigdz)) +
guides(color=guide_legend(title='Age Tertile')) +
theme(legend.position='bottom')
g <- addggLayers(g, d, by='ageTertile', value='choleste', pos='top')
addggLayers(g, d, by='ageTertile', value='choleste', type='spike')
```
# Parameter Budget
* Smaller of the number of events and number of non-events: `r (m <- min(table(d$sigdz)))`
* Rough rule of thumb: this number / 15: `r round(m / 15, 1)`
* Data reduction/model simplification probably not needed
* Allocation of parameters according to clinical knowledge
* Age and cholesterol not known to act linearly
* Interactions:
+ Cholesterol known to be weaker risk factor as one ages
+ Women and men have different risk trajectories $\rightarrow$ age $\times$ sex interaction likely
* Let u and v be restricted cubic spline functions with 4 knots
+ 3 parameters = 1 linear + 2 nonlinear
* u is for age and v is for cholesterol (ch)
* Base model omits ch $\times$ age interaction
* Base model: logit = intercept + sex + u(age) + u(age) $\times$ sex + v(ch)
* Parameters: 1 + 1 + 3 + 3 + 3 = 11
* How to model age $\times$ ch interaction?
+ full spline interactions: u(age) $\times$ v(ch)
+ $3\times 3 = 9$ more parameters
+ restricted interaction surface (singly nonlinear): u(age) $\times$ ch + age $\times$ v(ch)
+ $3\times 1 + 1\times 3 - 1 = 5$ parameters (-1 so as not not include the linear $\times$ linear term twice)
+ linear $\times$ linear interaction: age $\times$ ch
+ 1 parameter
# Fit Three Models
For model goodness of fit comparisons we focus on AIC and the third $R^2$ measure `R2(p,1520.4)` which is adjusted for `p` model parameters excluding the intercept, and uses an effective sample size of 1520.[The last adjusted $R^2$ is computed in a way that recognizes that binary outcomes have less information than continuous ones so that the effective sample size here is not the total sample size of 2258. AIC is inversely related to the model LR $\chi^2$ statistic $-2 \times p$.]{.aside}
```{r}
dd <- datadist(d); options(datadist='dd')
f1 <- lrm(sigdz ~ rcs(age, 4) * (sex + rcs(choleste, 4)),
data=d, x=TRUE, y=TRUE) # x, y for test='LR' below
f1
f2 <- lrm(sigdz ~ rcs(age, 4) * sex + rcs(choleste, 4) +
rcs(age, 4) %ia% rcs(choleste, 4), data=d)
f2
f3 <- lrm(sigdz ~ rcs(age, 4) * sex + rcs(choleste, 4) +
age %ia% choleste, data=d)
f3
```
```{r}
c('AIC f1'=AIC(f1), 'AIC f2'=AIC(f2), 'AIC f3'=AIC(f3))
```
According to AIC, the third model (with a linear $\times$ linear interaction) is most likely to predict future observations the best. $R^{2}_\text{adj}$ does not penalize for complexity as heavily as AIC, and favors the first model.
## Likelihood Ratio $\chi^2$ Tests Against the Full Model
```{r}
lrtest(f1, f2)
lrtest(f1, f3)
```
* Interpret these
To get a LR test for the overall effect of ch (main effort or interacting with age) one could compare `f1` with a model that ignores ch:
```{r}
f4 <- lrm(sigdz ~ rcs(age, 4) * sex, data=d)
lrtest(f1, f4)
```
* Compare this with the appropriate line in the ANOVA table below
## Likelihood Ratio Tests on the Full Model
Print and plot LR chunk tests. A second plot shows relative explained outcome variation by the various risk factors. This is a variable importance measure.
```{r}
#| fig.height: 2.85
a <- anova(f1, test='LR')
a
plot(a)
plot(a, what='proportion chisq')
```
Also show the ANOVA table with dots indicating which parameters are being tested in each row.
```{r}
#| column: page-inset-left
print(a, which='dots')
```
* Interpret the tests
The ANOVA includes tests of linearities of interactions, and needed chunk tests that recognize that a main effects-only test is not meaningful when the main effect is contained in an interaction.
# Diagnostics
From here on we use the simplest model (with respect to age $\times$ ch), `f3`. Use a threshold for DFFITS of 0.25.
```{r}
# Refit f3 to include raw data for diagnostics and validation
f <- update(f3, x=TRUE, y=TRUE)
w <- which.influence(f, 0.25)
w
show.influence(w, d)
```
* Do you do anything about this?
# Partial Effect Plots
We don't use the default `rms` output of showing all partial effects on one large plot, because the ch effect is age-specific. For the ch plot evaluate predictions at selected ages.
```{r}
agec <- c(40, 45, 50, 55, 60)
ggplot(Predict(f, age, sex))
ggplot(Predict(f, choleste, age=agec))
```
Keep age continuous for make three types of 3-d plots. Transform from predicted logits to predicted risks. For continuous $\times$ continuous displays it is important not to extrapolate. The `rms` `perimeter` function finds the boundary in two dimensions where the sample size is 10 subjects beyond the boundary.
```{r}
perim <- with(d, perimeter(choleste, age))
p <- Predict(f, choleste, age, fun=plogis)
bplot(p, perim=perim) # is in rms; type ?bplot for details
bplot(p, lfun=wireframe, alpha.regions=.2, perim=perim)
bplot(p, lfun=contourplot, perim=perim)
```
# Internal Validation
* Use the bootstrap with 300 resamples
* This is a strong internal validation except for not taking into account the tests and AIC that were used to arrive a model `f3`
```{r}
validate(f, B=300)
```
* Interpret
```{r}
cal <- calibrate(f, B=300)
plot(cal)
```
* Interpret
# Computing Environment
```{r echo=FALSE,results='asis'}
markupSpecs$html$session()
```