40_example_gee_Respiratory.Rmd

# GEE, Binary Outcome: Respiratory Illness

```{r, include=FALSE}
knitr::opts_chunk$set(comment     = "",
                      echo        = TRUE, 
                      warning     = FALSE, 
                      message     = FALSE,
                      fig.align   = "center", # center all figures
                      fig.width   = 6,        # set default figure width to 4 inches
                      fig.height  = 4)        # set default figure height to 3 inches
```

## Packages

### CRAN

```{r, message=FALSE, error=FALSE}
library(tidyverse)    # all things tidy
library(pander)       # nice looking genderal tabulations
library(furniture)    # nice table1() descriptives
library(texreg)       # Convert Regression Output to LaTeX or HTML Tables
library(psych)        # contains some useful functions, like headTail

library(interactions)
library(performance)
library(lme4)         # Linear, generalized linear, & nonlinear mixed models

library(corrplot)     # Vizualize correlation matrix
library(gee)          # Genderalized Estimation Equation Solver
library(geepack)      # Genderalized Estimation Equation Package 
library(MuMIn)        # Multi-Model Inference (caluclate QIC)

library(HSAUR)        # package with the dataset
```

### GitHub

Helper `extract` functions for exponentiation of parameters form generalized regression models within a `texreg` table of model parameters.

```{r, message=FALSE, error=FALSE}
# remotes::install_github("sarbearschwartz/texreghelpr") # first time

library(texreghelpr)
```


## Background

> This dataset was used as an example in Chapter 11 of "A Handbook of Statistical Analysis using R" by Brian S. Everitt and Torsten Hothorn.  The authors include this data set in their **HSAUR** package on `CRAN`. 


**The Background**

In each of two centers, eligible patients were randomly assigned to active treatment or placebo. During the treatment, the respiratory status *(categorized poor or good)* was determined at each of four, monthly visits. The trial recruited 111 participants *(54 in the active group, 57 in the placebo group)* and there were no missing data for either the responses or the covariates. 


**The Research Question**

The question of interest is to assess whether the **treatment** is effective and to estimate its effect.  

> Note: We are NOT interested in change over time, but rather mean differences in the treatment group compared to the placebo group, net of any potential confounding due to age, sex, and site.


**The Data**

Note that the data *(555 observations on the following 7 variables)* are in long form, i.e, repeated measurements are stored as additional rows in the data frame.

* Indicators    

    + `subject` the patient ID, a factor with levels 1 to 111
    + `centre` the study center, a factor with levels 1 and 2    
    + `month` the month, each patient was examined at months 0, 1, 2, 3 and 4  
    
    
* Outcome or dependent variable    

    + `status` the respiratory status (response variable), a factor with levels poor and good
    
    
* Main predictor or independent variable of interest   

    + `treatment` the treatment arm, a factor with levels placebo and treatment  
    
    
* Time-invariant Covariates to control for    

    + `sex` a factor with levels female and male    
    + `age` the age of the patient


### Read in the data

```{r}
data("respiratory", package = "HSAUR")

str(respiratory)
```


```{r}
psych::headTail(respiratory)
```

### Wide Format

> Wide format has one line per participant.

```{r}
data_wide <- respiratory %>% 
  tidyr::spread(key = month,
                value = status,
                sep = "_") %>% 
  dplyr::rename("BL_status" = "month_0") %>% 
  dplyr::arrange(subject) %>% 
  dplyr::select(subject, centre, 
                sex, age, treatment, 
                BL_status, starts_with("month"))  

tibble::glimpse(data_wide)
```


```{r}
psych::headTail(data_wide)
```

### Long Format


> Long format has one line per observation.

```{r}
data_long <- data_wide%>% 
  tidyr::gather(key = month,
                value = status,
                starts_with("month")) %>% 
  dplyr::mutate(month = str_sub(month, start = -1) %>% as.numeric) %>% 
  dplyr::mutate(status = case_when(status == "poor" ~ 0,
                                   status == "good" ~ 1)) %>% 
  dplyr::arrange(subject, month) %>% 
  dplyr::select(subject, centre, sex, age, treatment, BL_status, month, status) 


tibble::glimpse(data_long)
```


```{r}
psych::headTail(data_long)
```


## Exploratory Data Analysis

### Summary Statistics

#### Demographics and Baseline Measure

Notice that numerical summaries are computed for all variables, even the categorical variables *(i.e. factors)*.  The have an `*` after the variable name to remind you that the `mean`, `sd`, and `se` are of limited use.

> Notice: the mean age is 33

```{r}
data_wide %>% 
  psych::describe(skew = FALSE)
```




```{r}
data_wide %>% 
  dplyr::group_by(treatment) %>% 
  furniture::table1("Center" = centre, 
                    "Sex" = sex, 
                    "Age" = age, 
                    "Baseline Status" = BL_status, 
                    caption = "Participant Demographics",
                    output = "markdown",
                    na.rm = FALSE,
                    total = TRUE,
                    test = TRUE)
```

#### Status  Over Time

```{r}
data_wide %>% 
  dplyr::group_by(treatment) %>% 
  furniture::table1("Month One" = month_1, 
                    "Month Two" = month_2, 
                    "Month Three" = month_3, 
                    "Month Four" = month_4, 
                    caption = "Respiratory Status Over Time",
                    output = "markdown",
                    na.rm = FALSE,
                    total = TRUE,
                    test = TRUE)
```


Correlation between repeated observations:

```{r}
data_wide %>% 
  dplyr::select(starts_with("month")) %>% 
  dplyr::mutate_all(function(x) x == "good") %>% 
  cor() %>% 
  corrplot::corrplot.mixed()
```



```{r}
data_month_trt_prop <- data_long %>% 
  dplyr::group_by(treatment, month) %>% 
  dplyr::summarise(n = n(),
                   prop_good = mean(status),
                   prop_sd = sd(status),
                   prop_se = prop_sd/sqrt(n))

data_month_trt_prop
```






### Visualization

#### Age

```{r}
data_wide %>% 
  ggplot(aes(age)) +
  geom_histogram(binwidth = 5,
                 color = "black",
                 alpha = .3) +
  geom_vline(xintercept = c(20, 35, 45),
             color = "red") +
  theme_bw() +
  facet_grid(treatment ~ .)
```

#### Status by Age 


```{r}
data_wide %>% 
  dplyr::mutate(n_good = furniture::rowsums(month_1 == "good", 
                                            month_2 == "good",
                                            month_3 == "good",
                                            month_4 == "good")) %>% 
  ggplot(aes(x = age,
             y = n_good)) +
  geom_count() +
  geom_smooth() +
  theme_bw() +
  labs(x = "Age in Years",
       y = "Number of Visits out of Four, with 'Good' Respiration")
```


#### Status Over Time

It apears that status is fairly constant over time.  

```{r}
data_month_trt_prop %>% 
  ggplot(aes(x = month,
             y = prop_good,
             group = treatment,
             color = treatment)) +
  geom_errorbar(aes(ymin = prop_good - prop_se,
                    ymax = prop_good + prop_se),
                width = .25) +
  geom_point() +
  geom_line() +
  theme_bw() +
  labs(x = "Time, months",
       y = "Proportion of Participants\nwith 'Good' Respiratory Status") +
  theme(legend.position = "bottom",
        legend.key.width = unit(1.5, "cm"))
```



```{block type='rmdimportant', echo=TRUE}
It is NOT the purpose of this analysis to investigate change over time! 

Since status is largely stable over time, no linear *(or even quadratic)* effect of the `month` variable will be modeled.  

Instead, the four observations on each subject are treated as correlated *(at least with non-independent correlation structure in GEE)*, but no time trend will be included.
```




## Logisitc Regression (GLM)

> **Note:** THIS IS NEVER APPROPRIATE TO CONDUCT A GLM ON REPEATED MEASURES.  THIS IS DONE FOR ILLUSTRATION PURPOSES ONLY!
 
Since participants in the middle ages seem to do worse than either extreme, the potential quadratic effect of age will be included.  Age is also being grand-mean centered to make the intercept more meaningful.

### Fit the Model

```{r}
resp_glm <- glm(status ~ centre + treatment + sex + BL_status + 
                  I(age-33) + I((age-33)^2),
                data = data_long,
                family = binomial(link = "logit"))

summary(resp_glm)
```

### Tabulate the Model Parameters

```{r, results='asis'}
texreg::knitreg(list(resp_glm,
                     texreghelpr::extract_glm_exp(resp_glm,
                                                  include.any = FALSE)),
                custom.model.names = c("b (SE)", "OR [95 CI]"),
                caption = "Logistic Generalized Linear Regression (GLM)",
                single.row = TRUE,
                digits = 3,
                ci.test = 1)
```



## Generalized Estimating Equations (GEE)

Again, since participants age is likely to be a risk at either end of the spectrum, the potential quadratic effect of age will be modeled.  Age is also being grand-mean centered to make the intercept more meaningful.

### Indepdendence

Using an `"independence"` correlation structure is equivalent to using a GLM analysis *(logistic regression in this case)* and is never appropriate for repeated measures data.  It is only being done here for comparison purposes.

```{r}
resp_gee_in <- gee::gee(status ~ centre + treatment + sex + BL_status + 
                          I(age-33) + I((age-33)^2),
                        data = data_long,
                        family = binomial(link = "logit"),
                        id = subject,
                        corstr = "independence",
                        scale.fix = TRUE,
                        scale.value = 1)

summary(resp_gee_in)
```

The results for GEE fit with the independence correlation structure produces results that are nearly identical to the GLM model.

The **robust (sandwich) standard errors** are however much larger than the naive standard errors.


### Exchangeable

```{r}
resp_gee_ex <- gee::gee(status ~ centre + treatment + sex + BL_status + 
                          I(age-33) + I((age-33)^2),
                        data = data_long,
                        family = binomial(link = "logit"),
                        id = subject,
                        corstr = "exchangeable",
                        scale.fix = TRUE,
                        scale.value = 1)

summary(resp_gee_ex)
```

Notice that the naive standard errors are more similar to the robust (sandwich) standard errors, supporting that this is a better fitting model



### Paramgeter Estimates Table

The GEE models display the robust (sandwich) standard errors.

#### Raw Estimates (Logit Scale)

```{r, results="asis"}
texreg::knitreg(list(resp_glm, 
                     resp_gee_in, 
                     resp_gee_ex),
                custom.model.names = c("GLM", 
                                       "GEE-INDEP", 
                                       "GEE-EXCH"),
                caption = "Estimates on Logit Scale",
                single.row = TRUE,
                digits = 4)
```

Comparing the two GEE models: parameters are identical and so are the robust (sandwich) standard errors.



#### Exponentiate the Estimates (odds ratio scale)

```{r, results="asis"}
texreg::knitreg(list(extract_glm_exp(resp_glm), 
                     extract_gee_exp(resp_gee_in), 
                     extract_gee_exp(resp_gee_ex)),
                custom.model.names = c("GLM", 
                                       "GEE-INDEP", 
                                       "GEE-EXCH"),
                caption = "Estimates on Odds-Ratio Scale",
                single.row = TRUE,
                ci.test = 1,
                digits = 3)
```


#### Manual Extraction

```{r}
resp_gee_ex %>% coef() %>% exp()
```


```{r}
trt_EST <- summary(resp_gee_ex)$coeff["treatmenttreatment", "Estimate"] 
trt_EST
exp(trt_EST)
```


```{r}
Trt_SE <- summary(resp_gee_ex)$coeff["treatmenttreatment", "Robust S.E."] 
Trt_SE
```

```{r}
trt_95ci <- trt_EST + c(-1, +1)*1.96*Trt_SE
trt_95ci
exp(trt_95ci)
```


### Final Model

#### Estimates on both the logit and odds-ratio scales

```{r,  results='asis'}
texreg::knitreg(list(resp_gee_ex, 
                     texreghelpr::extract_gee_exp(resp_gee_ex,
                                                  include.any = FALSE)),
                custom.model.names = c("b (SE)", 
                                       "OR [95 CI]"),
                custom.coef.map = list("(Intercept)" = "(Intercept)",
                                       centre2 = "Center 2 vs. 1",
                                       sexmale = "Male vs. Female",
                                       BL_statusgood = "BL Good vs. Poor",
                                       "I(age - 33)" = "Age (Yrs post 33)",
                                       "I((age - 33)^2)" = "Age-Squared",
                                       treatmenttreatment = "Treatment"), 
                caption = "GEE: Final Model (exchangable)",
                ci.test = 1,
                single.row = TRUE,
                digits = 3)
```


#### Interpretation

* `centre`: Controlling for baseline status, sex, age, and treatment, participants at the two centers did not significantly differ in respiratory status during the intervention


* `sex`: Controlling for baseline status, center, age, and treatment, a participant's respiratory status did not differ between the two sexes.

* `BL_status`: Controlling for sex, center, age, and treatment, those with good baseline status had nearly 7 times higher odds of having a good respiratory status, compared to participants that starts out poor.

* `age`:  Controlling for baseline status, sex, center, and treatment, the role of age was non-linear, such that the odds of a good respiratory status was lowest for patients age 40 and better for those that were either younger or older.

**Most importantly:**

* `treatment`: Controlling for baseline status, sex, age, and center, those on the treatment had 3.88 time higher odds of having a good respiratory status, compared to similar participants that were randomized to the placebo.




### Predicted Probabilities


#### Refit with the `geepack` package

This package lets you plot, but not put the results in a table.

```{r}
resp_geeglm_ex <- geepack::geeglm(status ~ centre + treatment + sex + BL_status + 
                                    I(age-33) + I((age-33)^2),
                                  data = data_long,
                                  family = binomial(link = "logit"),
                                  id = subject,
                                  waves = month,
                                  corstr = "exchangeable")
```



#### Make predictions

What is the change a 40 year old man in poor condition at center 1 change of being rated as being in "Good" respiratory condition?


```{r}
resp_geeglm_ex %>% 
  emmeans::emmeans(pairwise ~ treatment,
                   at = list(centre = "1",
                             sex = "male",
                             age = 40,
                             BL_status = "poor"),
                   type = "response")
```


A 40 year old man in poor condition at center 1 has a 15.8% change of being rated as being in "Good" respiratory condition if he was randomized to placebo.


A 40 year old man in poor condition at center 1 has a 42.1% change of being rated as being in "Good" respiratory condition if he was randomized to placebo.

The odds ratio for treatment is:

```{r}
(.421/(1 -.421) )/( .158/(1 - .158))
```


#### Quick








### Marginal Plot to Visualize the Model


#### Quickest Version


The `interactions::interact_plot()` function can only investigate 3 variables at once:

* `pred` the x-axis *(must be continuous)*
* `modx` different lines *(may be categorical or continuous)*
* `mod2` side-by-side panels *(may be categorical or continuous)*

All other variables must be held constant.

```{r}
interactions::interact_plot(model = resp_geeglm_ex,   # model name
                            pred = age,               # x-axis
                            modx = treatment,         # lines
                            mod2 = sex,               # panels
                            data = data_long,
                            at = list(centre = "1",
                                      BL_status = "good"),  # hold constant
                            type = "mean_subject") +
  theme_bw()
```




#### Full Version



Makes a dataset with all combinations


```{r}
resp_geeglm_ex %>% 
  emmeans::emmeans(~ centre + treatment + sex + age + BL_status,
                   at = list(age = c(20, 35, 50)),
                   type = "response") %>% 
  data.frame()
```



In order to include all **FIVE** variables, we must do it the LONG way...


```{r}
resp_geeglm_ex %>% 
  emmeans::emmeans(~ centre + treatment + sex + age + BL_status,
                   at = list(age = seq(from = 11, to = 68, by = 1)),
                   type = "response",
                   level = .68) %>% 
  data.frame() %>% 
  ggplot(aes(x = age,
             y = prob,
             group = interaction(sex, treatment))) +
  geom_ribbon(aes(ymin = asymp.LCL,
                  ymax = asymp.UCL,
                  fill = fct_rev(sex)),
              alpha = .3) +
  geom_line(aes(color = fct_rev(sex),
                linetype = fct_rev(treatment))) +
  theme_bw() + 
  facet_grid(centre ~ BL_status, labeller = label_both) +
  labs(x = "Age, years",
       y = "Predicted Probability of GOOD Respiratory Status",
       color = "Sex:",
       fill = "Sex:",
       linetype = "Assignment:") +
  theme(legend.position = "bottom")
```


#### Females in Center 1


This example uses default settings.

```{r}
interactions::interact_plot(model = resp_geeglm_ex,
                            pred = age,
                            modx = treatment,
                            mod2 = BL_status,
                            at = list(sex = "female",
                                      centre = "1")) 
```




#### Males in Center 2

This example is more preped for publication.




```{r}
interactions::interact_plot(model = resp_geeglm_ex,
                            pred = age,
                            modx = treatment,
                            mod2 = BL_status,
                            at = list(sex = "male",
                                      centre = "2"),
                            x.label = "Age in Years",
                            y.label = "Predicted Probability of 'Good' Respiratory Status",
                            legend.main = "Intervention: ",
                            mod2.labels = c("Poor at Baseline",
                                            "Good at Baseline"),
                            colors = rep("black", times = 2)) +
  theme_bw() +
  theme(legend.position = c(1, 0),
        legend.justification = c(1.1, -0.1),
        legend.background = element_rect(color = "black"),
        legend.key.width = unit(1.5, "cm")) +
  labs(caption = "Note: Probibilities shown are specific to males at center 2")
```





```{r}
resp_geeglm_ex %>% 
  emmeans::emmeans(~ centre + treatment + sex + age + BL_status,
                   at = list(age = seq(from = 11, to = 68, by = 1),
                             sex = "male",
                             centre = "2"),
                   type = "response",
                   level = .68) %>% 
  data.frame() %>% 
  ggplot(aes(x = age,
             y = prob,
             group = interaction(sex, treatment))) +
  geom_ribbon(aes(ymin = asymp.LCL,
                  ymax = asymp.UCL),
              alpha = .2) +
  geom_line(aes(linetype = fct_rev(treatment))) +
  theme_bw() + 
  facet_grid(~ BL_status) +
  labs(x = "Age, years",
       y = "Predicted Probability of\nGOOD Respiratory Status",
       color = "Sex:",
       fill = "Sex:",
       linetype = "Assignment:") +
  theme(legend.position = "bottom")
```



## Conclusion


**The Research Question**

The question of interest is to assess whether the **treatment** is effective and to estimate its effect.  


**The Conclusion**

After accounting for baseline status, age, sex and center, participants in the active treatment group had nearly four times higher odds of having 'good' respiratory status, when compared to the placebo, exp(b) = 3.881, p<.001, 95% CI [1.85,  8.14].