r-causal · LucyMcGowan · Feb 13, 2025 · Feb 7, 2025 · Feb 13, 2025 · Feb 13, 2025
diff --git a/chapters/06-stats-models-ci.qmd b/chapters/06-stats-models-ci.qmd
@@ -96,6 +96,7 @@ satisfaction1 <- tibble(
   )
 ) |>
   mutate(
+    satisfaction = as.numeric(scale(satisfaction)),
     update_frequency = factor(
       update_frequency,
       labels = c("weekly", "daily")
@@ -223,6 +224,7 @@ satisfaction2 <- tibble(
     15 * customer_service + rnorm(n),
 ) |>
   mutate(
+    satisfaction = as.numeric(scale(satisfaction)),
     customer_type = factor(
       customer_type,
       labels = c("free", "premium")
@@ -331,13 +333,14 @@ Organizations with more users get more updates and have slightly lower satisfact
 ```{r}
 satisfaction3 <- tibble(
   # Number of users
-  num_users = rnorm(n, mean = 100, sd = 50),
+  num_users = runif(n, min = 1, max = 500),
   # Larger customers more likely to have daily updates
   update_frequency = rbinom(n, 1, plogis(num_users / 100)),
   # with more users come less satisfaction
-  satisfaction = 70 + -0.2 * num_users
+  satisfaction = 70 + -0.2 * num_users + rnorm(n)
 ) |>
   mutate(
+    satisfaction = as.numeric(scale(satisfaction)),
     update_frequency = factor(
       update_frequency,
       labels = c("weekly", "daily")
@@ -376,17 +379,94 @@ satisfaction3_strat |>
   summarise(estimate = mean(daily - weekly))
 ```
 
-As opposed to binary and categorical confounders, grouping by bins for continuous confounders does not completely account for the variable; the coarser the bins, the more residual confounding there will be, and the finer the bins, the closer we'll get to the continuous version (but the fewer values we'll have per bin).
+As opposed to binary and categorical confounders, grouping by bins for continuous confounders does not completely account for the variable; the coarser the bins, the more residual confounding there will be, and the finer the bins, the closer we'll get to the continuous version (but the fewer values we'll have per bin, see @tip-bins).
+
+::: {#tip-bins .callout-tip}
+
+## What would happen if we change the number of bins?
+
+Let's see what happens if we increase the number of bins. In the figure below, we've changed the number of bins from 5 in the example in the text to range from 3 to 20. Notice as we increase the number of bins the bias decreases. 
+
+```{r}
+#| code-fold: true
+update_bins <- function(bins) {
+  satisfaction3 |>
+    mutate(num_users_q = ntile(num_users, bins)) |>
+    group_by(num_users_q, update_frequency) |>
+    summarise(
+      avg_satisfaction = mean(satisfaction),
+      .groups = "drop"
+    ) |>
+    ungroup() |>
+    pivot_wider(
+      names_from = update_frequency,
+      values_from = avg_satisfaction
+    ) |>
+    summarise(
+      bins = bins,
+      estimate = mean(daily - weekly)
+    )
+}
+
+map_dfr(3:20, update_bins) |>
+  ggplot(aes(x = bins, y = abs(estimate))) +
+  geom_point() +
+  geom_line() +
+  labs(y = "Bias", x = "Number of bins")
+```
+
+For example, looking at the output below we see that the estimate is much closer to the truth (0) when we have 20 bins compared to 5. 
+
+```{r}
+satisfaction3 |>
+  mutate(num_users_q = ntile(num_users, 20)) |>
+  group_by(num_users_q, update_frequency) |>
+  summarise(
+    avg_satisfaction = mean(satisfaction),
+    .groups = "drop"
+  ) |>
+  ungroup() |>
+  pivot_wider(
+    names_from = update_frequency,
+    values_from = avg_satisfaction
+  ) |>
+  summarise(estimate = mean(daily - weekly))
+```
+
+As with many good things, however, there is a limit to the utility of increasing the number of bins. For example, let's see what happens if we try to have 30 bins. 
+
+```{r}
+satisfaction3 |>
+  mutate(num_users_q = ntile(num_users, 30)) |>
+  group_by(num_users_q, update_frequency) |>
+  summarise(
+    avg_satisfaction = mean(satisfaction),
+    .groups = "drop"
+  ) |>
+  ungroup() |>
+  pivot_wider(
+    names_from = update_frequency,
+    values_from = avg_satisfaction
+  ) |>
+  summarise(estimate = mean(daily - weekly))
+```
+
+The estimate is `NA` because some of our bins didn't have anyone in one of the exposure groups, making their difference inestimable. Now this analysis violates our *positivity* assumption. Here, this is a stochastic violation, it has to do with our sample size, `r scales::comma(n)`, and number of bins, in this case 30. By chance, we ended up with at least one of the 30 bins without anyone in one of the exposure groups, making our causal effect inestimable. This non-parametric method, while flexible, has limitations due to the sample size. Parametric models are useful because they allow us to extrapolate under certain assumptions, which makes them more efficient (assuming the assumptions are true, let's learn more in @sec-parametric). 
+
+
+:::
 
 The approach we've been using with `group_by()` and `summarize()` is often called **stratification**.
 You can also think of it as a type of non-parametric approach.
 We aren't using any parameterization from a statistical model to restrict the form of the variables the way we might with, say, a linear regression.
 (This is only partially true for continuous confounders because it's not practical to stratify by all values of the continuous variable).
 
+
+
 Stratification can be powerful for simple problems or when you have lots of data because you can sometimes avoid model misspecification problems.
 However, with many confounders (especially continuous ones), we quickly encounter the curse of dimensionality, making it impractical because we have too few observations by combinations of confounder levels.
 
-## Parametric outcome models
+## Parametric outcome models {#sec-parametric}
 
 You can think of stratification as calculating conditional means.
 A more general extension of conditional means is multivariable linear regression.
@@ -420,9 +500,94 @@ lm(
 This generalization doesn't come for free, though: we've now introduced a parametric statistical model to make estimates across the sparse regions of our data.
 The estimate we get for `satisfaction ~ update_frequency + num_users` gives us exactly the right answer because the statistical model underlying `lm()` perfectly matches our simulation.
 For example, the relationship between `satisfaction` and `num_users` is linear, so when we fit this model, we don't suffer from the problem of dimensionality (although linear regression has its own limits in terms of the number of rows and columns).
-In other words, we're now dependent on the correct **functional form**, the mathematical representation of the relationship between variables in a model.
+In other words, we're now dependent on the correct **functional form**, the mathematical representation of the relationship between variables in a model (See @tip-functional-form for more details).
 We need the correct functional form for both the exposure and the confounders.
 Modeling this well requires an understanding of the nature of the relationship between these variables and the outcome.
+
+::: {#tip-functional-form .callout-warning}
+
+## Functional form in parametric models
+
+In the text we simulated the relationship between the outcome and the confounders to be linear, that is, it exactly met the assumptions underlying `lm()`, so we got the right answer when we fit our parameteric model. What would happen if our simulation did not match the assumptions underlying `lm()`? Let's take a look.
+
+```{r}
+set.seed(1)
+satisfaction4 <- tibble(
+  # Number of users
+  num_users = runif(n, 1, 500),
+  # Larger customers more likely to have daily updates
+  update_frequency = rbinom(n, 1, plogis(num_users / 100)),
+  # non-linear relationship between satisfaction and number of users
+  satisfaction = 70 - 0.001 * (num_users-300)^2 - 0.001 * (num_users - 300)^3
+) |>
+  mutate(
+    satisfaction = as.numeric(scale(satisfaction)),
+    update_frequency = factor(
+      update_frequency,
+      labels = c("weekly", "daily")
+    )
+  )
+ggplot(satisfaction4, aes(x = num_users, y = satisfaction)) +
+  geom_point()
+```
+
+In the figure above we see that now there is a non-linear relationship between our confounder, number of users, and our outcome, satisfaction. Let's see what happens if we fit an (incorrect) parameteric model to these data.
+
+```{r}
+lm(
+  satisfaction ~ update_frequency + num_users,
+  data = satisfaction4
+) |>
+  tidy(conf.int = TRUE) |>
+  filter(term == "update_frequencydaily") |>
+  select(estimate, starts_with("conf"))
+```
+Now our estimates are far off from the truth (which should be zero); the truth is not even contained in the confidence interval. What went wrong? Our parametric model assumed that the functional form of the relationship between the number of users and satisfaction was linear, but we generated it non-linearly. There is a solution that still allows for the use of a parameteric model - if we knew the true functional form, we could use that. Let's see how that looks.
+
+```{r}
+lm(
+  satisfaction ~ update_frequency + poly(num_users, 3),
+  data = satisfaction4
+) |>
+  tidy(conf.int = TRUE) |>
+  filter(term == "update_frequencydaily") |>
+  select(estimate, starts_with("conf"))
+```
+
+Beautiful! Now, this model was fit *exactly* as the data were generated, and again we end up with the exact right answer. Often in the real world, we do not know what the data generating mechanism is, but we can still fit flexible parametric models. A great way to do this is through natrual cubic splines.
+
+```{r}
+lm(
+  satisfaction ~ update_frequency + splines::ns(num_users, 3),
+  data = satisfaction4
+) |>
+  tidy(conf.int = TRUE) |>
+  filter(term == "update_frequencydaily") |>
+  select(estimate, starts_with("conf"))
+```
+We can also use our original non-parametric method, if we stratify this into 20 bins, we also get a less biased estimate (i.e. it is very close to the true value, 0).
+
+```{r}
+satisfaction4_strat <- satisfaction4 |>
+  mutate(num_users_q = ntile(num_users, 20)) |>
+  group_by(num_users_q, update_frequency) |>
+  summarise(
+    avg_satisfaction = mean(satisfaction),
+    .groups = "drop"
+  )
+
+satisfaction4_strat |>
+  ungroup() |>
+  pivot_wider(
+    names_from = update_frequency,
+    values_from = avg_satisfaction
+  ) |>
+  summarise(estimate = mean(daily - weekly))
+```
+
+
+:::
+
 We'll also later explore ways to use data-adaptive methods like machine learning to reduce this assumption (@sec-causal-ml).
 
 <!-- TODO: link appendix on model checks here -->
@@ -456,15 +621,15 @@ Whether increasing the update frequency to daily for everyone is beneficial depe
 
 -   If 50% of the customers are premium and 50% are free, the average effect of switching to daily updates would be:
 
-    $(0.5 * 5) + (0.5 * -5) = 0$
+$(0.5 * 5) + (0.5 * -5) = 0$
 
 -   If 100% of the customers are premium, the average effect would be:
 
-    $(1 * 5) + (0 * -5) = 5$
+$(1 * 5) + (0 * -5) = 5$
 
 -   If 100% of the customers are free, the average effect would be:
 
-    $(0 * 5) + (1 * -5) = -5$
+$(0 * 5) + (1 * -5) = -5$
 
 Marginalization tells us the average effect for the distribution of covariates in the data.
 Of course, we might want to estimate the causal effect *by* customer type; we'll discuss interaction effects in depth in @sec-interaction.
@@ -489,10 +654,10 @@ For the rest of the book, however, we will focus on other causal methods, which
 Here's a brief summary of some of the unconfoundedness methods we'll cover and what they do.
 
 -   *Unconfoundedness methods*
-    -   **Inverse probability weighting** (propensity score weighting): Using a propensity score (predicted probability of treatment), we reweight units to create a pseudo-population where exchangeability holds. Extends to time-varying treatments.
-    -   **Matching** (propensity score matching and other methods): Find treated and untreated units with similar propensity scores (or other measures of similarity) to match, creating a subpopulation where exchangeability holds.
-    -   **G-computation** (also called standardization or marginal effects): Fit an outcome model but marginalize to get a marginal effect estimate. Extends to time-varying treatments.
-    -   **Doubly robust methods**: Fit models for both the outcome and treatment. Using doubly robust methods, only one of these models needs to be correct for the estimate to be correct. Doubly robust methods also allow us to use machine learning algorithms. We'll discuss **targeted learning (TMLE)** and **augmented propensity scores**.
+-   **Inverse probability weighting** (propensity score weighting): Using a propensity score (predicted probability of treatment), we reweight units to create a pseudo-population where exchangeability holds. Extends to time-varying treatments.
+-   **Matching** (propensity score matching and other methods): Find treated and untreated units with similar propensity scores (or other measures of similarity) to match, creating a subpopulation where exchangeability holds.
+-   **G-computation** (also called standardization or marginal effects): Fit an outcome model but marginalize to get a marginal effect estimate. Extends to time-varying treatments.
+-   **Doubly robust methods**: Fit models for both the outcome and treatment. Using doubly robust methods, only one of these models needs to be correct for the estimate to be correct. Doubly robust methods also allow us to use machine learning algorithms. We'll discuss **targeted learning (TMLE)** and **augmented propensity scores**.
 
 While the book focuses primarily on unconfoundedness methods, we later cover methods that make other assumptions (@sec-iv-friends and @sec-did).
 Here's a brief summary of when we might want to explore these methods instead of trying to achieve exchangeability:
@@ -542,6 +707,7 @@ satisfaction_randomized <- tibble(
     15 * customer_service + rnorm(n),
 ) |>
   mutate(
+    satisfaction = as.numeric(scale(satisfaction)),
     customer_type = factor(
       customer_type,
       labels = c("free", "premium")
@@ -570,7 +736,7 @@ plot_estimates <- function(d) {
     )) |>
     filter(term == "update_frequency") |>
     mutate(model = "unadjusted")
-
+  
   adj_model <- lm(
     satisfaction ~ update_frequency + business_hours +
       customer_type,
@@ -584,12 +750,12 @@ plot_estimates <- function(d) {
     )) |>
     filter(term == "update_frequency") |>
     mutate(model = "direct\nadjustment")
-
+  
   df <- d |>
     mutate(across(where(is.factor), as.integer)) |>
     mutate(update_frequency = update_frequency - 1) |>
     as.data.frame()
-
+  
   x <- PSW::psw(
     df,
     "update_frequency ~ business_hours + customer_type",
@@ -607,7 +773,7 @@ plot_estimates <- function(d) {
     p.value = NA,
     model = "inverse\nprobability\nweighting"
   )
-
+  
   models <- bind_rows(unadj_model, adj_model, psw_model) |>
     mutate(model = factor(
       model,
@@ -617,7 +783,7 @@ plot_estimates <- function(d) {
         "inverse\nprobability\nweighting"
       )
     ))
-
+  
   models |>
     select(model, estimate, std.error, starts_with("conf")) |>
     pivot_longer(