-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathpropensity.tex
299 lines (284 loc) · 15.5 KB
/
propensity.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
\chapter{Modeling for Observational Treatment Comparisons}
\ddisc{18}
\soundm{ps-1}
The randomized clinical trial is the gold standard for developing
evidence about treatment effects, but on rare occasion an RCT is not
feasible, or one needs to make clinical decisions while waiting years
for an RCT to complete. Observational treatment comparisons can
sometimes help, though many published ones provide information that is
worse than having no information at all due to missing confounder
variables or poor statistical practice.
\section{Challenges}
\soundm{ps-2}
\bi
\item Attempt to estimate the effect of a treatment A using data on
patients who \emph{happen} to get the treatment or a comparator B
\item Confounding by indication
\bi
\item indications exist for prescribing A; not a random process
\item those getting A (or B) may have failed an earlier treatment
\item they may be less sick, or more sick
\item what makes them sicker may not be measured
\ei
\item Many researchers have attempted to use data collected for other
purposes to compare A and B
\bi
\item they rationalize adequacy of the data after seeing what is
available
\item they do not design the study prospectively, guided by unbiased
experts who understand the therapeutic decisions
\ei
\item If the data are adequate for the task, goal is to adjust for all
potential confounders as measured in those data
\item Easy to lose sight of parallel goal: adjust for outcome heterogeneity
\ei
\section{Propensity Score}\label{sec:propens}
\soundm{ps-3}
\bi
\item In observational studies comparing treatments, need to adjust
for nonrandom treatment selection
\item Number of confounding variables can be quite large
\item May be too large to adjust for them using multiple regression,
due to overfitting (may have more potential confounders than
outcome events)
\item Assume that all factors related to treatment selection that are
prognostic are collected
\item Use them in a flexible regression model to predict treatment
actually received (e.g., logistic model allowing nonlinear effects)
\item \textbf{Propensity score} (PS) = estimated probability of getting
treatment B vs.\ treatment A
\item Use of the PS allows one to aggressively adjust for measured
potential confounders
\item Doing an adjusted analysis where the adjustment variable is the
PS simultaneously adjusts for all the variables in the
score \emph{insofar} as confounding is concerned
(but \textbf{not with regard to outcome heterogeneity})
\item If after adjusting for the score there were a residual imbalance
for one of the variables, that would imply that the variable was
not correctly modeled in the PS
\item E.g.: after holding PS constant there are more subjects above
age 70 in treatment B; means that age$>70$ is still predictive of
treatment received after adjusting for PS, or age$>70$ was not
modeled correctly.
\item An additive (in the logit) model where all continuous baseline
variables are splined will result in adequate adjustment in the
majority of cases---certainly better than categorization. Lack of
fit will then come only from omitted interaction effects. E.g.: if
older males are much more likely to receive treatment B than
treatment A than what would be expected from the effects of age and
sex alone, adjustment for the additive propensity would not
adequately balance for age and sex.
\ei
A nice discussion of problems with propensity scores is at \href{https://stats.stackexchange.com/questions/481110}{stats.stackexchange.com/questions/481110}.
\section{Misunderstandings About Propensity Scores}
\soundm{ps-4}
\bi
\item PS can be used as a building block to causal inference but PS is not a
causal inference tool \emph{per se}
\item PS is a \emph{confounding focuser}
\item It is a \emph{data reduction} tool that may reduce the number of
parameters in the outcome model
\item PS analysis is not a simulated randomized trial
\bi
\item randomized trials depend only on chance for treatment
assignment
\item RCTs do not depend on measuring all relevant variables
\ei
\item Adjusting for PS is adequate for adjusting for \textbf{measured}
confounding if the PS model fits observed treatment selection
patterns well
\item But adjusting only for PS is inadequate
\bi
\item to get proper conditioning so that the treatment effect can
generalize to a population with a different covariate mix, one
must condition on important prognostic factors
\item non-collapsibility of hazard and odds ratios is not addressed
by PS adjustment
\ei
\item PS is not necessary if the effective sample size (e.g. number of
outcome events) $>$ e.g. $5p$ where $p$ is the number of measured covariates
\item Stratifying for PS does not remove all the measured confounding
\item Adjusting only for PS can hide interactions with treatment
\item When judging covariate balance (as after PS matching) it is
\textbf{not} sufficient to examine the mean covariate value in the
treatment groups
\ei
\section{Assessing Treatment Effect}
\soundm{ps-5}
\bi
\item Eliminate patients in intervals of PS where there is no overlap
between A and B, or include an interaction between treatment and a
baseline characteristic\footnote{To quote Gelman and
Hill Section 10.3~\cite{gel06dat}, ``Ultimately, one good
solution may be a multilevel model that includes treatment
interactions so that inferences explicitly recognize the decreased
precision that can be obtained outside the region of overlap.'' For example, if one included an interaction between age and treatment and there were no patients greater than 70 years old receiving treatment B, the B:A difference for age greater than 70 would have an extremely wide confidence interval as it depends on extrapolation. So the estimates that are based on extrapolation are not misleading; they are just not informative.}
\item Many researchers stratify the PS into quintiles, get treatment
differences within the quintiles, and average these to get
adjustment treatment effects
\item Often results in imbalances in outer quintiles due to skewed
distributions of PS there
\item Can do a matched pairs analysis but depends on matching
tolerance and many patients will be discarded when their case has
already been matched
\item Inverse probability weighting by PS is a high variance/low power
approach, like matching
\item Usually better to adjust for PS in a regression model
\item Model: $Y = \textrm{treat} + \log\frac{PS}{1-PS} +$
nonlinear functions of $\log\frac{PS}{1-PS} +$ important prognostic
variables
\item Prognostic variables need to be in outcome ($Y$) model even
though they are also in the PS, to account for subject outcome heterogeneity
(susceptibility bias)
\item If outcome is binary and can afford to ignore prognostic
variables, use nonparametric regression to relate PS to outcome
separately in actual treatment A vs.\ B groups
\item Plotting these two curves with PS on $x$-axis and looking at
vertical distances between curves is an excellent
way to adjust for PS continuously without assuming a model
\ei
\subsection{Problems with Propensity Score Matching}\label{sec:psmatch}
\soundm{ps-6}
\bi
\item The choice of the matching algorithm is not principle-based so
is mainly arbitrary. Most matching algorithms are dependent on the
order of observations in the dataset. Arbitrariness of matching
algorithms creates a type of non-reproducibility.
\item Non-matched observations are discarded, resulting in a loss of
precision and power.
\item Matching not only discards hard-to-match observations (thus
helping the analyst correctly concentrate on the propensity overlap
region) but also discards many ``good'' matches in the overlap
region.
\item Matching does not do effective interpolation on the interior of
the overlap region.
\item The choice of the main analysis when matching is used is not
well worked out in the statistics literature. Most analysts just
ignore the matching during the outcome analysis.
\item Even with matching one must use covariate adjustment for strong
prognostic factors to get the right treatment effects, due to
non-collapsibility of odds and hazards ratios.
\item Matching hides interactions with treatment and covariates.
\ei
Most users of propensity score matching do not even entertain the
notion that the treatment effect may interact with propensity to
treat, must less entertain the thought of individual patient
characteristics interacting with treatment.
\section{Recommended Statistical Analysis Plan}
\soundm{ps-7}
\be
\item Be very liberal in selecting a large list of potential
confounder variables that are measured pre-treatment. But respect
causal pathways and avoid collider and other biases.
\item If the number of potential confounders is not large in
comparison with the effective sample size, use direct covariate
adjustment instead of propensity score adjustment. For example, if
the outcome is binary and you have more than 5 events per covariate,
full covariate adjustment probably works OK.
\item Model the probability of receiving treatment using a flexible
statistical model that makes minimal assumptions (e.g., rich
additive model that assumes smooth predictor effects). If there are
more than two treatments, you will need as many propensity scores as
there are treatments, less one, and all of the logic propensity
scores will need to be adjusted for in what follows.
\item Examine the distribution of estimated propensity score
separately for the treatment groups.
\item If there is a non-overlap region of the two distributions, and
you don't want to use a more conservative interaction analysis (see
below), exclude those subjects from the analysis. Recursive
partitioning can be used to predict membership in the non-overlap
region from baseline characteristics so that the research findings
with regard to applicability/generalizability can be better
understood.
\item Overlap must be judged on absolute sample sizes, not proportions.
\item Use covariate adjustment for propensity score for subjects in
the overlap region. Expand logit propensity using a restricted
cubic spline so as to not assume linearity in the logit in relating
propensity to outcome. Also include pre-specified important
prognostic factors in the model to account for the majority of
outcome heterogeneity. It is not a problem that these prognostic
variables are also in the propensity score.
\item As a secondary analysis use a chunk test to assess whether there
is an interaction with logit propensity to treat and actual treatment.
For example, one may find that physicians are correctly judging that
one subset of patients should usually be treated a certain way.
\item Instead of removing subjects outside the overlap region, you \soundm{ps-8}
could allow
propensity or individual predictors to interact with treatment.
Treatment effect estimates in the presence of interactions are
self-penalizing for not having sufficient overlap. Suppose for
example that age were the only adjustment covariate and a propensity
score was not needed. Suppose that for those with age less than 70
there were sufficiently many subjects from either treatment for
every interval of age but that when age exceeded 70 there were only
5 subjects on treatment B. Including an age $\times$ treatment
interaction in the model and obtaining the estimated outcome difference
for treatment A vs.\ treatment B as a function of age will have
a confidence band with minimum width at the mean age, and
above age 70 the confidence band will be very wide. This is to be
expected and is an honest way to report what we know about the
treatment effect adjusted for age. If there were no age $\times$
treatment interaction, omitting the interaction term would yield a
proper model with a relatively narrow confidence interval, and if
the shape of the age relationship were correctly specified the
treatment effect estimate would be valid. So one can say that not
having comparable subjects on both treatments for some intervals of
covariates means that either (1) inference should be restricted to
the overlap region, or (2) the inference is based on model
assumptions.
\item See \href{https://fharrell.com/post/ia}{fharrell.com/post/ia}
for details about interaction, confidence interval width, and
relationship to generalizability.
\ee
Using a full regression analysis allows interactions to be explored,
as briefly described above.
Suppose that one uses a restricted cubic spline in the logit
propensity to adjust for confounding, and all these spline terms are
multiplied by the indicator variable for getting a certain treatment.
One can make a plot with predicted outcome on the $y$-axis and PS on
the $x$-axis, with one curve per treatment. This allows inspection of
parallelism (which can easily be formally tested with the chunk test)
and whether there is a very high or very low PS region where treatment
effects are different from the average effect. For example, if
physicians have a very high probability of always selecting a certain
treatment for patients that actually get the most benefit from the
treatment, this will be apparent from the plot.
\section{Reasons for Failure of Propensity Analysis}
\soundm{ps-9}
Propensity analysis may not sufficiently adjust for confounding in non-randomized studies when
\bi
\item prognostic factors that are confounders are not measured and are not highly correlated with factors that are measured
\item the propensity modeling was too parsimonious (e.g., if the researchers excluded baseline variables just because they were insignificant)
\item the propensity model assumed linearity of effects when some were really nonlinear (this would cause an imbalance in something other than the mean to not be handled)
\item the propensity model should have had important interaction terms that were not included (e.g., if there is only an age imbalance in males)
\item the researchers attempted to extrapolate beyond ranges of overlap in propensity scores in the two groups (this happens with covariate adjustment sometimes, but can happen with quantile stratification if outer quantiles are very imbalanced)
\ei
\section{Sensitivity Analysis}
\soundm{ps-10}
\bi
\item For $n$ patients in the analysis, generate $n$ random values of
a hypothetical unmeasured confounder $U$
\item Constrain $U$ so that the effect of $U$ on the response $Y$ is
given by an adjusted odds ratio of $OR_{Y}$ and so that $U$'s distribution is
unbalanced in group A vs.\ B to the tune of an odds ratio of
$OR_{treat}$.
\item Solve for how large $OR_{Y}$ and $OR_{treat}$ must be before the
adjusted treatment effect reverses sign or changes in statistical
significance
\item The larger are $OR_Y$ and $OR_{treat}$ the less plausible it is
that such an unmeasured confounder exists
\ei
See the \R\ \co{rms} package \co{sensuc} function.
\section{Reasons To Not Use Propensity Analysis}
\soundm{ps-11}
Chen et al.~\cite{che16too} demonstrated advantages of using a unified
regression model to adjust for ``too many'' predictors by using
penalized maximum likelihood estimation, where the exposure variable
coefficients are not penalized but all the adjustment variable
coefficients have a quadratic (ridge) penalty.
\section{Further Reading}
\href{http://www.stat.columbia.edu/~gelman/arm/chap10.pdf}{Gelman} has
a nice chapter on causal inference and matching from Gelman and
Hill~\cite{gel06dat}.
Gary King as expressed a number of reservations about PS
matching; see\\ \href{https://gking.harvard.edu/files/gking/files/psnot.pdf}{gking.harvard.edu/files/gking/files/psnot.pdf}.