index.rmd

---
title: "Open Case Studies: Influence of Multicollinearity on Measured Impact of Right-to-Carry Gun Laws"
css: style.css
output:
  html_document:
    includes:
      in_header: GA_Script.Rhtml
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes
---

<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/icon-bahi.png");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>


---


```{r setup, include=FALSE}
knitr::opts_chunk$set(
  include = TRUE, comment = NA, echo = TRUE,
  message = FALSE, warning = FALSE, cache = FALSE, fig.width = 10, fig.height = 7,
  fig.align = "center", out.width = "90%"
)
library(here)
library(knitr)
library(magrittr)
remotes::install_github("opencasestudies/OCSdata")
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)
library(OCSdata)

rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```

<div id="google_translate_element"></div>

<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>

<script type="text/javascript">
function googleTranslateElementInit() {
  new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>


#### {.outline}

```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "mainplot.png"))
```

#### 

#### {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

####

#### {.license_block}

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"}  United States License.

####

#### {.reference_block}

To cite this case study please use:

Wright, Carrie and Ontiveros, Michael and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). [https://github.com//opencasestudies/ocs-bp-RTC-analysis](https://github.com//opencasestudies/ocs-bp-RTC-analysis). Influence of Multicollinearity on Measured Impact of Right-to-Carry Gun Laws (Version v1.0.0).

####

To access the GitHub repository for this case study see here: https://github.com//opencasestudies/ocs-bp-RTC-analysis.  

You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).

***

The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below: 

```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
  dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
  dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
  dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```

***

**Readability Score: **

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age. 

```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>%  # reading table into dataframe, selecting relevant factors
  tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
                                     age = round(as.numeric(age))
                                     )
knitr::kable(df, format="markdown")
```

***

Please help us by filling out our survey.


<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>


# **Motivation**
*** 

This case study will introduce the topic of [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity){target="_blank"}, which occurs in regression when one or more independent variables can be predicted by other independent variables. 

We will do so by showcasing a real world example where multicollinearity in part resulted in historically controversial and conflicting findings about the influence of the adoption of right-to-carry (RTC) concealed handgun laws on violent crime rates in the United States. 

We will focus on two articles:

1. The first analysis by [Mustard and Lott](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"} published in 1996 suggests that RTC laws reduce violent crime. Lott authored a book extending these findings in 1998 called [***More Guns, Less Crime***](https://en.wikipedia.org/wiki/More_Guns,_Less_Crime){target="_blank"}.

```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "Lott.png"))
```

##### [[source]](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"}

2. The second analysis is a recent article by [Donohue, et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} published in 2017 that suggests that RTC laws increase violent crime. Donohue has also published previous articles with titles such as [***Shooting down the "More Guns, Less Crime" Hypothesis***](https://www.jstor.org/stable/1229603?seq=1){target="_blank"}. 

```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "Donohue.png"))
```

##### [[source]](https://www.nber.org/papers/w23510.pdf){target="_blank"}

This has been a controversial topic as many other analyses also produced conflicting results. See [here](https://en.wikipedia.org/wiki/More_Guns,_Less_Crime){target="_blank"} for a list of studies.

The [Donohue, et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} article discusses how there are many other important methodological aspects besides [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity){target="_blank"} (which occurs when predictor or input variables are highly related in a regression analysis) that could account for the historically conflicting results in these previous manuscripts.

In fact, nearly every aspect of the data analysis process was different between the [Donohue, et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} and [Mustard and Lott](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"} analyses.

```{r, echo=FALSE, out.height = '75%', out.width = '75%', fig.align='center'}
knitr::include_graphics(here("img", "Educational_Graphic1.jpg"))
```


However, we will focus particularly on multicollinearity and how it can influence the results we get from linear regression. 
Specifically, this analysis will demonstrate how methodological details can be critically influential for our overall conclusions and can result in important policy related consequences. The [Donohue, et al. article]((https://www.nber.org/papers/w23510.pdf){target="_blank"}) will provide the motivation and illustration. 

#### {.reference_block}

John J. Donohue et al., Right‐to‐Carry Laws and Violent Crime: A Comprehensive Assessment Using Panel Data and a State‐Level Synthetic Control Analysis. *Journal of Empirical Legal Studies*, 16,2 (2019).

David B. Mustard & John Lott. Crime, Deterrence, and Right-to-Carry Concealed Handguns. *Coase-Sandor Institute for Law & Economics* Working Paper No. 41, (1996).

####


Before we leave this section, we provide a high-level overview of what variables were (or were not) included in the [Donohue, Aneja and Weber](https://www.nber.org/papers/w23510.pdf){target="_blank"} (DAW) paper and the [Mustard and Lott](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"} (ML) paper:


```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "Donohue_Table2_edited.png"))
```

##### [[source]](https://www.nber.org/papers/w23510.pdf){target="_blank"}


###### *ML is abbreviated as LM in the source article

**Note**: We are not attempting to re-create the analyses from the original authors. Instead, we aim to use a subset of the listed explanatory variables in this case study to demonstrate multicollinearity. These variables will be consistent for both analyses that we will perform, with the exception that one analysis will have 6 demographic variables as in the analysis in the [Donohue, et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} article and the other will have 36 demographic variables, grouping individuals into more specific categories, as in the analysis in the [Mustard and Lott](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"} article.


# **Main Question**
*** 

#### {.main_question_block}
<b><u> Our main question: </u></b>

What is the effect of multicollinearity on coefficient estimates from linear regression models when analyzing right to carry laws and violence rates?

####


Specifically, we will consider the two ways to define the demographic variables (as described above) and investigate how the inclusion of different numbers of age groups influences the results of an analysis of right to carry laws and violence rates.


# **Learning Objectives** 
*** 

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

<u>**Data Science Learning Objectives:**</u>  

1. Create correlation scatterplots and heatmaps (`GGally`, `ggcorrplot`)  
2. Create interactive tables (`DT`)  
3. Sample subsets of data (`rsample`)  
4. Combine multiple plots (`cowplot`)  
5. Create data visualizations with equations and text(`ggplot2` and `latex2exp`)  

<u>**Statistical Learning Objectives:**</u> 

1. Understand what multicollinearity is and how it can influence linear regression coefficients
2. Recognize signs for the presence of multicollinearity and determine its severity
3. Illustrate the difference between multicollinearity and correlation 
4. Implement panel regression analysis in R (`plm`)
5. Relate variance inflation factors (VIFs) to multicollinearity and calculate VIFs in R (`car`)

To see another case study about how the original raw data was imported and wrangled please see [here](https://www.opencasestudies.org/ocs-bp-RTC-wrangling/){target="_blank"}.

We will especially focus on using packages and functions from the [`tidyverse`](https://www.tidyverse.org/){target="_blank"}, such as `dplyr` and `ggplot2`. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.

```{r, out.width = "20%", echo = FALSE, fig.align ="center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```

# **Context**
***

So what exactly is a **right-to-carry law**?

It is a law that specifies _if_ and _how_ citizens are allowed to have a firearm on their person or nearby (for example, in a citizen's car) in public. In this discussion, we will use the [National Rifle Association (NRA)](https://www.nraila.org/gun-laws/){target="_blank"} terminology. Please keep in mind that there are other terms that people use. 

The [Second Amendment](https://en.wikipedia.org/wiki/Second_Amendment_to_the_United_States_Constitution){target="_blank"} to the United States Constitution guarantees the right to "keep and bear arms". The amendment was ratified in 1791 as part of the [Bill of Rights](https://en.wikipedia.org/wiki/United_States_Bill_of_Rights){target="_blank"}.

```{r, echo=FALSE, out.height = '50%', out.width = '50%', fig.align='center'}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/7/79/Bill_of_Rights_Pg1of1_AC.jpg")
```

##### [[source]](https://upload.wikimedia.org/wikipedia/commons/7/79/Bill_of_Rights_Pg1of1_AC.jpg){target="_blank"}

However, there are no federal laws about carrying firearms in public. 

These laws are created and enforced at the US state level. 
States vary greatly in their laws about the right to carry firearms. 
Some require extensive effort to obtain a permit to legally carry a firearm, while other states require very minimal effort to do so. An increasing number of states do not require permits at all.

<details> <summary> Click here for more information on history of right-to-carry policies in the US. </summary>

According to the [Wikipedia entry](https://en.wikipedia.org/wiki/History_of_concealed_carry_in_the_U.S.){target="_blank"} about the history of right-to-carry policies in the United States:

> Public perception on concealed carry vs open carry has largely flipped. In the early days of the United States, open carrying of firearms, long guns and revolvers was a common and well-accepted practice. Seeing guns carried openly was not considered to be any cause for alarm. Therefore, anyone who would carry a firearm but attempt to conceal it was considered to have something to hide, and presumed to be a criminal. For this reason, concealed carry was denounced as a detestable practice in the early days of the United States.

> Concealed weapons bans were passed in Kentucky and Louisiana in 1813. (In those days open carry of weapons for self-defense was considered acceptable; concealed carry was denounced as the practice of criminals.) By 1859, Indiana, Tennessee, Virginia, Alabama, and Ohio had followed suit. By the end of the nineteenth century, similar laws were passed in places such as Texas, Florida, and Oklahoma, which protected some gun rights in their state constitutions. Before the mid 1900s, most U.S. states had passed concealed carry laws rather than banning weapons completely. Until the late 1990s, many Southern states were either "No-Issue" or "Restrictive May-Issue". Since then, these states have largely enacted "Shall-Issue" licensing laws, with numerous states legalizing "Unrestricted concealed carry".

</details>

There are five broad categories of right-to-carry laws according to the NRA:

```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "RTC.png"))
```

##### [[source]](https://www.nraila.org/gun-laws/){target="_blank"}

```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "RTC_map.png"))
```

##### [[source]](https://www.nraila.org/gun-laws/){target="_blank"}

You can see that no state in the US currently (this map is from 2020) has a "Rights Infringed/Non-Issue" law (the gray category) -- meaning that all 50 states in the US allow the right to carry firearms at least in some way. 
However the level of restrictions is dramatically different from one state to another.
For comparison purposes, many authors use the terms "shall issue", "shall issue with discretion", "no permit required", "may issue", and "non-issue" to describe these categories instead of the NRA terminology.


<details> <summary> Click here for more information about how restrictions vary from one state to another. </summary>

There is variation from state to state even within the same general category:

For example here is an abridged version of the [current carry laws in Idaho](https://www.nraila.org/gun-laws/state-gun-laws/idaho/) which is considered an "Unrestricted - no permit required" state:

> State law ... allows any resident of Idaho or a current member of the armed forces of the United States to carry a concealed handgun without a license to carry, provided the person is over 18 years old and not disqualified from being issued a license to carry concealed weapons under state law. An amendment to state law that takes effect on July 1, 2020 changes the reference in the above law from “a resident of Idaho” to “any citizen of the United States.”  


And here are is an abridged version of the [current carry laws in Arizona](https://www.nraila.org/gun-laws/state-gun-laws/arizona/) which is also considered an "Unrestricted - no permit required" state:

> Any person 21 years of age or older, who is not prohibited possessor, may carry a weapon openly or concealed without the need for a license...

Notice that citizens in Idaho only need to be 18 to carry a firearm, whereas they must be 21 in Arizona. 

</details>


# **Limitations**
*** 

There are some important considerations regarding this data analysis to keep in mind: 

1. We do not use all of the data used by either the [Mustard and Lott](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"} or [Donohue, et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} analyses, nor do we perform the same analysis as in each article. We instead perform a much simpler analysis with fewer variables for the purposes of illustration of the concept of multicollinearity and its influence on regression coefficients, not to reproduce either analysis.

2. Our analysis accounts for either the adoption or lack of adoption of a permissive right-to-carry law in each state, but does not account for differences in the level of permissiveness of the laws.

Recall that these are the categories of right to carry laws:
```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "RTC.png"))
```

States with laws of the category rights restricted - very limited issue (red) are considered as not having a permissive right-to-carry law. Recall that no states currently have a rights infringed/non-issue law.

States of all other categories (shall issue, discretionary/reasonable issue, and no permit required, all shades of blue) are considered the same in our analysis, as having a permissive right-to-carry law.

3) Because our analysis is an oversimplification, the results presented here should not be used for determining policy changes; instead we suggest that users interested in such a determination consult with a specialist.


4) The inclusion of race as an explanatory variable in an epidemiological study can be useful in certain circumstances. However, there are limitations and issues around defining, determining, and reporting race, as well as in interpreting differences in public health outcomes by race. For more information on this topic, we have included a [link](https://academic.oup.com/epirev/article/22/2/187/456942) to a paper on the use of race as a measure in epidemiology. We include race in this analysis to demonstrate and consider the limitations of what the previous papers have done to analyze the influence of RTC laws on violent crime, with a focus on multicollinearity. Thus in our analysis we have also defined race as was previously done in these papers. Furthermore, we want to point out that reporting analyses about crime with race as a variable can have very unexpected consequences and thus care should be taken. See [here](https://journals.sagepub.com/doi/full/10.1177/0963721418763931) for suggestions. Any association between demographic variables (indicating the proportion of the population from specific race and age groups) and violent crime does not necessarily indicate that the two are linked causally, as aside from the issues presented in the [article]((https://academic.oup.com/epirev/article/22/2/187/456942)), this may instead indicate higher rates of police engagement with certain racial groups due to [racial profling](https://www.aclu.org/other/racial-profiling-definition).

The ACLU defines racial profiling as:

>"Racial Profiling" refers to the discriminatory practice by law enforcement officials of targeting individuals for suspicion of crime based on the individual's race, ethnicity, religion or national origin.

***


We will begin by loading the packages that we will need:

```{r}
library(here)
library(dplyr)
library(magrittr)
library(purrr)
library(tibble)
library(ggplot2)
library(naniar)
library(ggrepel)
library(plm)
library(broom)
library(GGally)
library(ggcorrplot)
library(rsample)
library(DT)
library(car)
library(stringr)
library(cowplot)
library(latex2exp)
library(OCSdata)
```

<u>**Packages used in this case study:** </u>  

  Package   | Use in this case study                                                                        
---------- |-------------
[`here`](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[`dplyr`](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[`magrittr`](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} | to use the compound assignment pipe operator `%<>%`
[`purrr`](https://purrr.tidyverse.org/){target="_blank"}   | to import the data in all the different excel and csv files efficiently  
[`tibble`](https://tibble.tidyverse.org/){target="_blank"}     | to create data objects that we can manipulate with `dplyr`/`stringr`/`tidyr`/`purrr`  
[`ggplot2`](https://ggplot2.tidyverse.org/){target="_blank"}     | to create plots 
[`naniar`](https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html){target="_blank"}  | to quickly visualize missing data     
[`ggrepel`](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html){target="_blank"}    | to allow labels in figures not to overlap  
[`plm`](https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html){target="_blank"} | to work with panel data fitting fixed effects and linear regression models  
[`broom`](https://cran.r-project.org/web/packages/broom/vignettes/broom.html){target="_blank"} | to create nicely formatted model output  
[`GGally`](https://github.com/ggobi/ggally){target="_blank"} | to extend ggplot2 functionality to easily create more complex plots  
[`ggcorrplot`](https://www.rdocumentation.org/packages/ggcorrplot/versions/0.1.3){target="_blank"} | to easily visualize a correlation matrix  
[`rsample`](https://rsample.tidymodels.org){target="_blank"} | to split our sample for the simulation analysis   
[`DT`](https://rstudio.github.io/DT/){target="_blank"}  | to create interactive and searchable tables  
[`car`](https://cran.r-project.org/web/packages/car/vignettes/embedding.pdf){target="_blank"}  | to calculate VIF values on linear model output  
[`stringr`](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the character strings within the data  
[`cowplot`](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined 
[`latex2exp`](https://cran.r-project.org/web/packages/latex2exp/vignettes/using-latex2exp.html){target="_blank"} | to convert latex math formulas to R's plotmath expressions  
[`OCSdata`](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files

The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


# **What are the data?**
***

Below is a table from the [Donohue, et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} paper that shows the data used in both analyses, where DAW stands for [Donohue, et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} and LM stands for [Mustard and Lott](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"}.


```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "Donohue_AppendixJ.png"))
```

We will be using a subset of these variables, which are highlighted in green:


```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "ourdata.png"))
```


# **Data Import and Wrangling**
***

See [this case study](https://www.opencasestudies.org/ocs-bp-RTC-wrangling/){target="_blank"} for details about data import and wrangling. We include data sources here for completeness, but will start from processed data in this case study.

## **Demographic and population data**
***

To obtain information about age, sex, race, and overall population size we will use US Census Bureau data, as was done in both of the articles. The census data is available for different time spans. Here are the links for the years used in our analysis. We will use data from 1977 to 2010.

Data   | Link                                                                        
---------- |-------------
**years 1977 to 1979**  | [link](https://www2.census.gov/programs-surveys/popest/tables/1900-1980/state/asrh/)  
**years 1980 to 1989**  | [link](https://www2.census.gov/programs-surveys/popest/tables/1980-1990/counties/asrh/) * county data was used for this decade which also has state information
**years 1990 to 1999**  | [link](https://www2.census.gov/programs-surveys/popest/tables/1990-2000/state/asrh/)
**years 2000 to 2010**  | [link](https://www.census.gov/data/datasets/time-series/demo/popest/intercensal-2000-2010-state.html) <br> [technical documentation](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2000-2010/intercensal/state/st-est00int-alldata.pdf){target="_blank"}

## **State FIPS codes**
***

The  data was downloaded from the [US Census Bureau](https://www.census.gov/geographies/reference-files/2014/demo/popest/2014-geocodes-state.html){target="_blank"}.

## **Police staffing data**
***

The following data was downloaded from the [Federal Bureau of Investigation](https://crime-data-explorer.fr.cloud.gov/downloads-and-docs){target="_blank"}. 

## **Unemployment data**
***

The following data was downloaded from the [U.S. Bureau of Labor Statistics](https://data.bls.gov/cgi-bin/dsrv?la){target="_blank"}. 

## **Poverty data**
***

Extracted from Table 21 from [US Census Bureau Poverty Data ](https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-poverty-people.html){target="_blank"}.

## **Violent crime**
***

Violent crime data was obtained from [here](https://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm){target="_blank"}. 

## **Right-to-carry data**
***

This data is extracted from table in [Donohue paper](https://www.nber.org/papers/w23510.pdf){target="_blank"}. 


***

Here is the table from the [Donohue paper](https://www.nber.org/papers/w23510.pdf){target="_blank"} that compares the data used in the analyses:


```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "Donohue_Table2_edited.png"))
```

##### [[source]](https://www.nber.org/papers/w23510.pdf){target="_blank"}


###### *ML is abbreviated as LM in the source article

We can see that only the percentage of males that were from age 15-39 of the race groups (black, white, and other) were used in the Donohue analysis, whereas the Mustard and Lott paper, individuals from ages 10 to over 65 were used.

The final products from our data wrangling which is described in [another case study](https://www.opencasestudies.org/ocs-bp-RTC-wrangling/){target="_blank"} are two tibbles of data each with variables selected to be similar to those used in either the Donohue or Mustard and Lott analysis. The overarching idea of this case study is to compare the model results from these two datasets, which differ only in the demographic variables. This is an oversimplification of the actual differences between the datasets and approaches taken by Donohue and Lott/Mustard, but will be useful for illustrating the impact of multicollinearity on our modeling results, which is the main objective of this case study.

If you have trouble accessing the GitHub Repository, the "Wrangled_data.rda" data can be downloaded from [here](https://github.com/opencasestudies/ocs-bp-RTC-analysis/blob/master/data/Wrangled_data.rda).

We will load this data now. 

In our case, we downloaded this data and put it within a "wrangled" subdirectory of the "data" directory for our project. If you use an RStudio project, then you can use the `here()` function of the `here` package to make the path for loading this data simpler. The `here` package automatically starts looking for files based on where you have a `.Rproj` file which is created when you start a new RStudio project. We can specify that we want to look for the file within the "wrangled" subdirectory of the "data" directory within a directory where our `.Rproj` file is located by separating the name of the "data" directory and "wrangled" and the file name using commas.

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects. 

</details>

***

```{r}
load(file = here::here("data", "wrangled", "Wrangled_data.rda"))
```

You may also download the data files with the `OCSdata` package. If you prefer this method, use the following code chunk to download and load the data:

```{r, eval=FALSE}
# install.packages("OCSdata")
library(OCSdata)
wrangled_rda("ocs-bp-RTC-analysis", outpath = getwd())
# This will save the wrangled data files in a "OCSdata/data/wrangled/" 
# sub-folder in your current working directory
load(file = here::here("OCSdata", "data", "wrangled", "Wrangled_data.rda"))
# this will load the data into R
```

We will check the dimensions of each tibble using the base `dim()` function:
```{r}
dim(LOTT_DF)
dim(DONOHUE_DF)
```

As expected the `Lott_DF` is 30 columns larger, due to the 30 additional demographic variables. We can check those now as well.

```{r}
LOTT_DF %>%
  colnames()

DONOHUE_DF %>%
  colnames()
```

Lastly, we will check that the `YEAR` values are the same, i.e., that the two tibbles contain data from the same set of years. We can use the `setequal()` function of the `dplyr` package to see if the values are the same. 

```{r}
setequal(
  DONOHUE_DF %>% distinct(YEAR),
  LOTT_DF %>% distinct(YEAR)
)
```


# **Data Exploration**
***

Let's do some quick visualizations to get a sense of our outcome of interest, the violent crime data. 
 
First we will plot the rate of violent crime over time to get a sense of the general trend.

To do so we need to summarize the data for each year across all of the states. 
Thus we will use the `group_by()` function and the `summarize()` functions to calculate an overall number of violent crimes relative to the total population for each year. In fact, we will calculate the log of the number of violent crimes per 100,000 individuals, as this will make it easier to see the trend in the data. In addition, by looking at the rate per 100,000 individuals, we can more directly compare crime statistics across groups like states with different populations, as you will see below.

Then we will use the `ggplot2` package to plot the data. The first step in creating a plot with this package is to use the `ggplot()` function and the `aes()` argument to specify what data should be plotted on the x-axis and what data should be plotted in on the y-axis. Then we select what type of plot we would like to make using one of the `geom_*()` functions. Please see [this case study](https://opencasestudies.github.io/ocs-bp-co2-emissions/){target="_blank"}  for more details.

We can use the `scale_x_continuous()` and `scale_y_continuous()` functions to modify the axis tick marks and their labels.

The `labs()` function can be used to add labels to the plot, while the `theme()` function allows for manipulation of the details of the labels, like size and angle. 

All of these functions are part of the `ggplot2` package.
 
```{r}

DONOHUE_DF %>%
  group_by(YEAR) %>%
  summarize(
    Viol_crime_count = sum(Viol_crime_count),
    Population = sum(Population),
    .groups = "drop"
  ) %>%
  mutate(Viol_crime_rate_100k_log = log((Viol_crime_count * 100000) / Population)) %>%
  ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log)) +
  geom_line() +
  scale_x_continuous(
    breaks = seq(1980, 2010, by = 1),
    limits = c(1980, 2010),
    labels = c(seq(1980, 2010, by = 1))
  ) +
  scale_y_continuous(
    breaks = seq(5.75, 6.75, by = 0.25),
    limits = c(5.75, 6.75)
  ) +
  labs(
    title = "Crime rates fluctuate over time",
    x = "Year",
    y = "ln(violent crimes per 100,000 people)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90))
```
Interesting! It appears that there was an overall national peak in violent crime in the early 1990s that has since then declined.  


Now let's take a look at each state.

We will use the `ggrepel` package to add text to the plot using the `geom_text_repel()` function. This is especially useful when there is a lot of text, as this function reduces the overlap of text labels. Again see [this case study](https://opencasestudies.github.io/ocs-bp-co2-emissions/){target="_blank"}  for more details on how to add labels to elements of plots.
 
```{r}

DONOHUE_DF %>%
  mutate(Viol_crime_rate_100k_log = log((Viol_crime_count * 100000) / Population)) %>%
  ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log, color = STATE)) +
  geom_point(size = 0.5) +
  geom_line(aes(group = STATE),
    size = 0.5,
    show.legend = FALSE
  ) +
  geom_text_repel(
    data = DONOHUE_DF %>%
      mutate(Viol_crime_rate_100k_log = log((Viol_crime_count * 100000) / Population)) %>%
      filter(YEAR == last(YEAR)),
    aes(
      label = STATE,
      x = YEAR,
      y = Viol_crime_rate_100k_log
    ),
    size = 3,
    alpha = 1,
    nudge_x = 10,
    direction = "y",
    hjust = 1,
    vjust = 1,
    segment.size = 0.25,
    segment.alpha = 0.25,
    force = 1,
    max.iter = 9999
  ) +
  guides(color = FALSE) +
  scale_x_continuous(
    breaks = seq(1980, 2015, by = 1),
    limits = c(1980, 2015),
    labels = c(seq(1980, 2010, by = 1), rep("", 5))
  ) +
  scale_y_continuous(
    breaks = seq(3.5, 8.5, by = 0.5),
    limits = c(3.5, 8.5)
  ) +
  labs(
    title = "States have different levels of crime",
    x = "Year",
    y = "ln(violent crimes per 100,000 people)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90))
```

It looks like the crime rates vary quite a bit from one state to another. Some states show increased crime over time while others show decreased crime. 


Now let's take a closer look at some of our other variables.

As we do this, we may start to run into issues with missing data for some of our variables, and it is always good to be aware of missing data in any data analysis. We can use the `vis_miss()` function of the `naniar` package to confirm that there are no missing values.

```{r}
DONOHUE_DF %>%
  naniar::vis_miss()
```

Looks like no missing data! 

```{r}
LOTT_DF %>%
  naniar::vis_miss()
```
Same for the `LOTT_DF`.

We can use the `skim()` of the `skimr` package to get a better sense of the data. This also shows missingness, as well as standard deviations, spread, and means for our data. Also notice that there is a small histogram of each variable in the final column.

```{r, eval=FALSE}
skimr::skim(DONOHUE_DF)
```

```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "skim1.png"))
knitr::include_graphics(here("img", "skim2.png"))
```

```{r, eval=FALSE}
skimr::skim(LOTT_DF)
```

```{r, echo=FALSE, out.height = '100%', out.width = '100%', fig.align='center'}
knitr::include_graphics(here("img", "skim3.png"))
knitr::include_graphics(here("img", "skim4.png"))
knitr::include_graphics(here("img", "skim5.png"))
```

We can see from this function that we have the number of variables of the class types that we expect for each tibble. We can also see that the means of the variables that should be the same for each tibble are in fact the same. 
We can also tell that the values for the variables are in general what we would expect.

# **Data Analysis**
***

## **Panel Analysis**
***

In these datasets, we have what is called [panel data](https://en.wikipedia.org/wiki/Panel_data){target="_blank"}, a special type of longitudinal data. Longitudinal data are data measurements taken over time. Panel data are data repeatedly measured for multiple panel members or individuals over time. This is in contrast with time series data, which measures one individual over time and cross sectional data, which measures multiple individuals at one point in time.  In other words, panel data is a combination of both, with measurements for multiple individuals/units of observation over multiple time periods.  In our case, we have measurements of violent crime and other variables for each state over many years. Therefore we are using measurements about the same states over time.  

In a panel analysis there are $N$ individual panel members and $T$ time points.  

There are two types of panels:  
1. **Balanced** - At each time point ($T$), there are data points for each individual($N$). 

Time Points ($T$)  | Individuals ($N$)                                                                      
---------- |-------------
1977  | Nevada 
1977  | Alabama
1977  | Kansas
1978 | Nevada
1978 | Alabama
1978  | Kansas
1979 | Nevada
1979 | Alabama
1979 | Kansas

2. **Unbalanced** - There may be data points missing for some individuals ($N$) at some time points ($T$).

Time Points ($T$)  | Individuals ($N$)                                                                         
---------- |-------------
1977  | Nevada 
1977  | Alabama
1978 | Nevada
1978 | Alabama
1979 | Nevada
1979 | Alabama
1979 | Kansas


Overall in a balanced panel, we have $n$ observations, where $n = N*T$.  

In an unbalanced panel, the number of observations is less than $N*T$.


In our case we have:  
$N$ = 45 states (in the data wrangling process we removed those who had adopted an RTC law before 1980)  
$T$ =  31 years (1980 - 2010)  

In every year we have measurements for each state (as we just saw above), thus our panel is balanced.

So, our total observations $n = 45*31$, thus $n$ = `r 45*31`.  


We will be performing a **panel linear regression model analysis**. 


In such an analysis we will model our data according to this generic model:

$$Y_{it}=β_{0}+β_{1}X_{1it}+...+β_{K}X_{Kit}+e_{it}$$

Where $i$ is the individual dimension (in our case individual states) and $t$ is the time dimension.

Some explanatory/independent variables or regressors $X_{it}$ will vary across individuals and time, while others will be fixed across the time of the study (or don't change over time), while others still will be fixed across individuals but vary across time periods.


There are three general sub-types of [panel regression analysis](https://en.wikipedia.org/wiki/Panel_analysis){target="_blank"}.

Overall, they assume that the different individuals are independent, however the same data for the same individual may be correlated across time.

The main difference between the three sub-types are the assumptions about unobserved differences between individuals.

If you are familiar with fixed and random effects in the context in the statistical literature the following econometric definitions will sound a bit different.

From the `plm` package [vignette](https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html):

> In the mixed models literature,...fixed effect indicates a parameter that is assumed constant, while random effects are parameters that vary randomly around zero according to a joint multivariate normal distribution.

> ...Having fixed effects in an **econometric model** has the meaning of allowing the intercept to vary with group, or time, or both, while the other parameters are generally still assumed to be homogeneous. Having random effects means having a group– (or time–, or both) specific component in the error term.


OK, so now that we know not to expect the typical mixed model definitions of fixed and random effects, let's get back to the three sub-types of panel regression analysis:

1) **independently pooled panels**  - assumes that there are no individual effects that are independent of time and also no effect of time on all the individuals. In other words, the independent variables are not correlated with any error term. This is essentially an ordinary least squares linear regression. In our setting, it would mean treating each year-state observation as unrelated to the others, which probably does not make sense, since we expect observations within a state to be related to one another. This type of panel regression makes the most assumptions and is therefore typically not used for panel data. In this case the model formulation is:
$$Y_{it}=\beta_{0}+\beta_{1}x_{1it}+...+\beta_{K}X_{Kit} + e_{it}$$
where the intercept $\beta_{0it}=\beta_0$for all $i,t$ and slope $\beta_{kit}=\beta_k$ for all $i,t$.

2) **fixed effects** - assumes that there are unknown or unobserved unique aspects about the individuals or heterogeneity among individuals $a_i$ that are not explained by the independent variables but influence the outcome variable of interest. They do not vary with time or in other words are fixed over time but may be **correlated** with independent variables $X_{it}$.  
 
In this case the intercept can be different for each individual $\beta_{0i}$, but the other coefficients are assumed to be the same across all the individuals.
 
These individual $a_i$ effects can be correlated with the independent variables $X$. This model can be expressed as
$$Y_{it}=\beta_{0}+\beta_{1}X_{1it}+...\beta_{K}X_{Kit}+ a_i +e_{it}$$
or alternatively the individual effects can be absorbed into an individual-specific intercept term $\beta_{0i}$:
$$Y_{it}=\beta_{0i}+\beta_{1}X_{1it}+...\beta_{k}X_{kit} +e_{it}$$
This type of panel regression makes the fewest assumptions.


3) **random effects** - assumes that there are unknown or unobserved unique qualities about the individuals that influence the outcome variable of interest that are **not correlated** with the independent variables. Thus, the random effects model actually makes **more assumptions** than the fixed effect model. We will not consider this model here.

There is quite a lot of discussion in econometrics around which model is appropriate to use and when, but we are not going to dwell on this topic here, as our main focus in this case study is on multicollinearity. However, here are some additional references on these topics for further reading:
See [here](https://www.bauer.uh.edu/rsusmel/phd/ec1-15.pdf) and [here](https://sites.google.com/site/econometricsacademy/econometrics-models/panel-data-models) and [here](https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html) for more information about these different models.


## **Donohue, et al.**
***

OK! We are now ready to start analyzing our data! 

In our case, we will be performing a fixed effect panel regression analysis, as we do in fact think that some of the unobserved qualities about the different states may be correlated with some of our independent variables. For example, the level of economic opportunity might be an unobserved feature about the states that influences violent crime rate and would be possibly correlated with poverty rate and unemployment. There are statistical tests for evaluating which model is the most appropriate, which are implemented in the `plm` package that we will be using.


To perform our analysis we will be using the `plm` package. This stands for Panel Linear Model.

We need to use a special type of data to use this package, called a `pdata.frame` which is short for panel data frame. This allows us to specify that we are using panel data and what the panel structure looks like, i.e., how the different observations are meant to be grouped together, both by state and by time. 

We need to indicate which variable should be used to identify the individuals in our panel, and what variable should be used to identify the time periods in our panel. In our case the `STATE` variable identifies the individuals and the `YEAR` variable identifies the time periods.

We can specify this structure using the `pdata.frame()` function of the `plm` package, by using the `index` argument, where the individual variable is specified first followed by the time variable, like so: `index=c("Individual_Variable_NAME", "Time_Period_Variable_NAME")`.

```{r}
d_panel_DONOHUE <- pdata.frame(DONOHUE_DF, index = c("STATE", "YEAR"))

class(d_panel_DONOHUE)

slice_head(d_panel_DONOHUE, n = 3)
```

Indeed we have now created a pdata.frame object and we can see that the row names show the individual states and time period years.

OK, now we are ready to run our panel linear model on our panel data frame. 

To do so we will use the `plm()` function and we will specify the formula for our model, where the dependent variable `Viol_crime_rate_1k_log` will be on the left of our `~` sign and all of the independent variables will be listed on the right with `+` signs in between each.

As discussed above, there are different types of panel data analysis, and which type of model is fit is controlled by the `effect` and `model` arguments to `plm()`. So we need to specify what type of `effect` we would like to model and what type of `model` we would like to use.

There are three main options for the `effect` argument:
1) individual  - model for the effect of individual identity
2) time - model for the effect of time
3) twoways - meaning modeling for the effect of both individual identity and time  

There are four main options for the `model` argument:  
1) pooling - standard pooled ordinary least squares regression model  
2) within - fixed effects model (variation between individuals is ignored, model compares individuals to themselves at different periods of time)  
3) between - fixed effects model (variation within individuals from one time point to another is ignored, model compares different individuals at each point of time)  
4) random - random effects (each state has a different intercept but force it to follow a normal distribution - requires more assumptions)

Typically it is best to think about what you are trying to evaluate with your data in trying to choose how to model your data. However, there are also some tests that can help to assess this which we will briefly cover.

We are interested in how violence in each state varied over time, thus we are interested in within `STATE`variation, so we will perform our PLM analysis with the `model = within` argument to perform this particular type of fixed effects model. 

We also speculate that there is an effect of individual `STATE` identity and time on violent crime rate. In other words, we expect some states to have high rates of crime, and others to have low rates of crime. We also expect crime to change over time. This means we want to use the `effect = "twoways"` argument to `plm()`.

Here is how we would implement this for the Donohue data:

```{r}
DONOHUE_OUTPUT <- plm(Viol_crime_rate_1k_log ~
                      RTC_LAW +
                      White_Male_15_to_19_years +
                      White_Male_20_to_39_years +
                      Black_Male_15_to_19_years +
                      Black_Male_20_to_39_years +
                      Other_Male_15_to_19_years +
                      Other_Male_20_to_39_years +
                      Unemployment_rate +
                      Poverty_rate +
                      Population_log +
                      police_per_100k_lag,
                      effect = "twoways",
                      model = "within",
                      data = d_panel_DONOHUE
                      )
```

To see the results we can use the base `summary()` function. We can view this output in tidy format using the `tidy()` function of the `broom` package.

We will add an `analysis` variable as a label for plots.

```{r}
summary(DONOHUE_OUTPUT)

DONOHUE_OUTPUT_TIDY <- tidy(DONOHUE_OUTPUT, conf.int = 0.95)

DONOHUE_OUTPUT_TIDY

DONOHUE_OUTPUT_TIDY$Analysis <- "Analysis 1"
```

As discussed above, we are choosing to use a fixed effects model because we believe it is most appropriate for our data. However, there are statistical tests implemented in the `plm` package that would allow us to test whether one of the other models would be as appropriate for our data. For example, we could perform a test to determine if we could have simply used a pooled model, i.e., the test evaluates whether the coefficients (including the intercepts) are equal across individuals. This test is implemented in the `pooltest()` function of the `plm` package and performs a comparison between the pooled model to the fixed effect within model. To test if using a random effect model would be more appropriate compared to the fixed effect model, one could use the [Hausman test](https://en.wikipedia.org/wiki/Durbin%E2%80%93Wu%E2%80%93Hausman_test){target="_blank"} (also called the Durbin-Wu-Hausman test). This test is implemented using the `phtest()` function of the `plm` package.

Finally, we could also perform a test to evaluate if there is indeed an individual (state) effect and a time effect in our model using the `plmtest()` function of the `plm` package. 

For more information on these tests and this package, see [here](https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html){target="_blank"}  and [here](http://www.princeton.edu/~otorres/Panel101R.pdf){target="_blank"}.


## **Mustard and Lott**
***

OK, now we will do the same for the Mustard and Lott analysis. In this case we would have a very large formula to write. So instead, we can use the `as.formula()` function of the `stats` package and the base `paste()` function to combine all of our explanatory variables into one formula without making a mistake. First we will create an object where we select only the explanatory variables. 

```{r}
LOTT_variables <- LOTT_DF %>%
  dplyr::select(
    RTC_LAW,
    contains(c("White", "Black", "Other")),
    Unemployment_rate,
    Poverty_rate,
    Population_log,
    police_per_100k_lag
  ) %>%
  colnames()


LOTT_fmla <- as.formula(paste(
  "Viol_crime_rate_1k_log ~",
  paste(LOTT_variables, collapse = " + ")
))

LOTT_fmla
```

That is quite the formula!

OK, now again we will make a panel data frame and we will fit a fixed effect two-way model for time and individuals (`STATE`) with this data as well.

```{r}

d_panel_LOTT <- pdata.frame(LOTT_DF, index = c("STATE", "YEAR"))

LOTT_OUTPUT <- plm(LOTT_fmla,
  model = "within",
  effect = "twoways",
  data = d_panel_LOTT
)

summary(LOTT_OUTPUT)

LOTT_OUTPUT_TIDY <- tidy(LOTT_OUTPUT, conf.int = 0.95)

LOTT_OUTPUT_TIDY$Analysis <- "Analysis 2"
```


## **RTC coefficient comparison**
***

Now let's make a plot to compare the coefficient estimate for the Right-to-carry law adoption variable in each model.

First we will combine model fit information for this coefficient for each model.

```{r}
comparing_analyses <- DONOHUE_OUTPUT_TIDY %>%
  bind_rows(LOTT_OUTPUT_TIDY) %>%
  filter(term == "RTC_LAWTRUE")

comparing_analyses
```

We can see that for the first analysis (similar to the [Donohue et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} study) the coefficient estimate for the presence of a permissive right-to-carry law is positive, while for the second analysis (similar to the [Mustard and Lott](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"} study) the coefficient estimate is negative. Thus in the first analysis we could conclude that the effect of adopting permissive right-to-carry laws  may be associated with increases in violent crime (although this was not a significant result (in contrast with the real [Donohue et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"} study )); while in the other analysis we could conclude that the laws may be associated with decreases in violent crime.

Let's make a plot of this finding. We will show error bars for the coefficient estimates for both analyses using the `geom_errorbar()` function of the `ggplot2` package. This requires specifying the minimum and maximum for our error bar, which in our case will be the low and high values of our confidence intervals for the coefficient estimates. We will also add a horizontal line at y = 0 using the `geom_hline()` function of the `ggplot2` package. 

Finally we will add arrows to emphasize the difference in the direction of the findings using the `geom_segment()` function of the `ggplot2` package. Using the `arrow()` function, we can specify details about the arrow we would like to add.

```{r}
comparing_analyses_plot <- ggplot(comparing_analyses) +
  geom_point(aes(x = Analysis, y = estimate)) +
  geom_errorbar(aes(x = Analysis, ymin = conf.low, ymax = conf.high), width = 0.25) +
  geom_hline(yintercept = 0, color = "red") +
  scale_y_continuous(
    breaks = seq(-0.2, 0.2, by = 0.05),
    labels = seq(-0.2, 0.2, by = 0.05),
    limits = c(-0.2, 0.2)
  ) +
  geom_segment(aes(x = 1, y = 0.125, xend = 1, yend = 0.175),
    arrow = arrow(angle = 45, ends = "last", type = "open"),
    size = 2,
    color = "green",
    lineend = "butt",
    linejoin = "mitre"
  ) +
  geom_segment(aes(x = 2, y = -0.125, xend = 2, yend = -0.175),
    arrow = arrow(angle = 45, ends = "last", type = "open"),
    size = 2,
    color = "red",
    lineend = "butt",
    linejoin = "mitre"
  ) +
  theme_minimal() +
  theme(
    axis.title.x = element_blank(),
    axis.text = element_text(size = 8, color = "black")
  ) +
  labs(
    title = "Effect estimate on ln(violent crimes per 100,000 people)",
    y = "  Effect estimate (95% CI)"
  )

comparing_analyses_plot
```

We can see that the confidence interval from analysis 1 is mostly covering positive values, while the entire confidence interval is negative for analysis 2.

# **Multicollinearity analysis**
***

How did the above happen?

The analysis data frames are very similar yet yielded very different results. 

Recall that the only difference between the two models is the number of demographic variables included as covariates. The number of rows or observations is the same, as are the outcome and the other covariates included in the model. We can use the `all_equal()` function of the `dplyr` package to compare the number of columns of our Donohue-like data and our Mustard and Lott-like data.

```{r}
all_equal(
  target = DONOHUE_DF,
  current = LOTT_DF,
  ignore_col_order = TRUE,
  ignore_row_order = TRUE
)
```

Using the base `dim()` function we can also look at the number of rows for each and see that the number of observations is the same for both datasets.

```{r}
dim(DONOHUE_DF)[1]
dim(LOTT_DF)[1]
```

The only difference between the two data frames rests in how the demographic variables were parameterized.

```{r}
DONOHUE_DF %>%
  dplyr::select(contains("years")) %>%
  colnames()

LOTT_DF %>%
  dplyr::select(contains("years")) %>%
  colnames()
```

Clearly, this had an effect on the results of the analysis. 

Let's explore how this occurred. 

When seemingly independent variables are highly related to one another, the relationships estimated in an analysis may be distorted. 

In regression analysis, this distortion is often a by-product of a violation of the independence assumption. This distortion, if large enough, can impact statistical inference. 

The phenomenon called multicollinearity occurs when independent variables are highly related to one another.

There are several ways we can diagnose multicollinearity.

### **Correlation**
***

One way we can evaluate the relationships between variables is by examining the correlation between variable pairs.

<style>
div.blue { background-color:#e6f0ff; border-radius: 5px; padding: 20px;}
</style>
<div class = "blue">

It is important to note that multicollinearity and correlation are not the same concept. [Correlation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5079093/) can be thought of as the strength of a linear relationship between variables in general. On the other hand, we usually use the term collinearity in the context of a multiple regression model, where two independent variables are collinear if they have a linear relationship or association. Multicollinearity can be thought of as collinearity among multiple (3+) regressors (independent variables) in a regression analysis, which can occur when regressors are highly correlated.

</div>


According to [Wikipedia](https://en.wikipedia.org/wiki/Multicollinearity){target="_blank"}:

> multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. 

Thus collinearity describes linear prediction or association between variables. Often those variables will be highly correlated.

The issue with this in linear regression, is that linear regression aims to determine how a one unit change in a regressor influences a one unit change in the dependent variable. In fact, this is what the coefficient estimates aim to tell us for each regressor.

However, if our regressors are also linearly related or correlated, then it becomes difficult to determine the effect of each regressor on the dependent variable, variable independent of the other regressors, and multicollinearity can cause instability in our [coefficient estimates](https://statisticsbyjim.com/glossary/regression-coefficient/){target="_blank"}, making them unreliable. In the presence of multicollinearity, coefficient estimates may be inflated, deflated, or their signs may change. 

For example, say both waist and hip circumference are included as predictors of BMI in a multiple regression model. As in any multiple regression model, the coefficient of waist circumference represents the average change in BMI for a one unit change in waist circumference, holding the other predictors in the model constant. Since it is not very meaningful to consider variation in waist circumference while holding hip circumference constant (as waist and hip circumference are indeed related to one another), our estimate of this coefficient may be uncertain. 


If you want to read further on this topic, [here](https://blog.clairvoyantsoft.com/correlation-and-collinearity-how-they-can-make-or-break-a-model-9135fbe6936a){target="_blank"} and [here](https://medium.com/swlh/multicollinearity-and-variance-inflation-factor-bc74af36b1c9){target="_blank"} are a couple of interesting discussions.

In the next sections, we will describe ways to detect multicollinearity in our covariates both using visual displays of data and using computational techniques.

#### Scatter plots

One of the ways to diagnose multicollinearity in a regression model is to first see if there are regressors that are highly correlated. If so, this suggests that we should investigate further to see if these variables are in fact linearly predicting one another.

One way to look at correlation across pairs of variables is to use the `ggpairs()` function of the `GGally` package.

```{r}
colnames(DONOHUE_DF)

DONOHUE_DF %>%
  dplyr::select(
    RTC_LAW,
    Viol_crime_rate_1k_log,
    Unemployment_rate,
    Poverty_rate,
    Population_log
  ) %>%
  ggpairs(.,
    columns = c(2:5),
    lower = list(continuous = wrap("smooth_loess",
      color = "red",
      alpha = 0.5,
      size = 0.1
    ))
  )
```
We can see that for the non-demographic variables, there is very little correlation between the pairs of variables. Only the unemployment rate and the poverty rate show relatively strong correlation, as one might expect. 


#### Heatmaps

Another way to look at correlation if we have many variables is to show the strength of correlation between pairs of variables using a heatmap, where the intensity of the color indicates the strength of the correlation between two variables.

Let's do this now for the demographic variables for each analysis.

The `ggcorrplot()` function of the `ggpcorrplot` package is one way to create such a heatmap.

As input to the plotting function, we first need to calculate the correlation values, which we will do using the `cor()` function of the `stats` package.

To label our legend with the Greek letter $\rho$, we will use the base `expression()` function, which will convert the written form of `"rho"` to the Greek letter.

```{r}
cor_DONOHUE_dem <- cor(DONOHUE_DF %>% dplyr::select(contains("_years")))

corr_mat_DONOHUE <- ggcorrplot(cor_DONOHUE_dem,
  tl.cex = 6,
  hc.order = TRUE,
  colors = c(
    "red",
    "white",
    "red"
  ),
  outline.color = "transparent",
  title = "Correlation Matrix, Analysis 1",
  legend.title = expression(rho)
)


corr_mat_DONOHUE

cor_LOTT_dem <- cor(LOTT_DF %>% dplyr::select(contains("_years")))

corr_mat_LOTT <- ggcorrplot(cor_LOTT_dem,
  tl.cex = 6,
  hc.order = TRUE,
  colors = c(
    "red",
    "white",
    "red"
  ),
  outline.color = "transparent",
  title = "Correlation Matrix, Analysis 2",
  legend.title = expression(rho)
)

corr_mat_LOTT
```
We can see that many of the demographic variables are highly correlated with one another, either positively or negatively. In this case, the sign does not matter, in terms of the effect the collinearity could have on our modeling results. 


The presence of correlation between variables suggests that we might have multicollinearity. However it does not necessarily mean that we do. So how can we assess this?


### **Coefficient estimate instability**
***

One way to look at the possible influence of multicollinearity is to look at the stability of the coefficient estimates under perturbations of the data.

We will focus on the `RTC_LAW` variable coefficient estimate, as this is of particular interest in our case.

To do so we will perform a process called [resampling](https://en.wikipedia.org/wiki/Resampling_(statistics)){target="_blank"}. This involves performing multiple iterations of our analysis, but with only a subset or sub-sample of our original data.  In our case we will remove **one** observation and see if that changes our coefficient estimate results. 


To do this we will use some functions in the `rsample` package which is very useful for splitting data in various ways.

We will use the `loo_cv()` function which stands for **leave one out** [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"}. This will allow us to split our data into every possible subset where a unique observation is left out of the data.

This function will however only prepare the data to be split.

To get the remaining data after the removal of the observation that is left out we will use a function called `training()`. These function names arise from the fact that these functions are often used for in machine learning applications where the data is split between a larger training set and a smaller testing set. Thus we want the larger $n-1$ subset, as opposed to the single value that is removed, (which we could get with the `testing()` function).

***
<details> <summary> Click here to see an example of how this works. </summary>

First we will make a toy dataset that is very simple called `test` using the `tibble()` function of the `tibble` package:
```{r}
test <- tibble::tibble(x = c(1, 2, 3))
test
```

Now we will use the `loo_cv()` to create leave one out splits:
```{r}
test_samples <- test %>% rsample::loo_cv()
test_samples
```

We can take a look at a single split of the data using the `pull()` function:

```{r}
pull(test_samples, splits)
```

Here you can see that 2 values are intended for the training set (also called Analysis set), 1 value is intended for the testing set (also called Assessment set), and 3 values were present initially.

Now we will use the `training()` function to get the data without the observation that is set aside. Here is the data for the first subset:
```{r}
rsample::training(pull(test_samples, splits)[[1]])
```

Now we will use the `map()` function of `purrr` to get all possible `training` subset of the data.
```{r}

test_subsets <- map(pull(test_samples, splits), training)
test_subsets
```

We can see that there are 3 possible subsets that leave one value out. All 3 possible subsets are created using this method. This method will always create the same number of subsets as there are unique values or rows in the data.

</details>
***

Now we will use this method with the data from our Donohue-like analysis, and since this data has `r dim(d_panel_DONOHUE)[1]` rows, `r dim(d_panel_DONOHUE)[1]` subsets will be created that leave out one row. The idea is to fit our panel regression model on each subset of the data, and then examine how the coefficient estimates from each of these model fits vary as the sample changes slightly. With collinear predictors, we expect that our coefficient estimates may be unstable and subject to change under even small perturbations of the data.

First we will create the splits using the `loo_cv()` function. This will ultimately make all unique possible subsets leaving one sample out - however the choice of which sample is left for which iteration is a random process (maybe the 12th is left out first then the 3rd etc.).  Thus, we use the `set.seed()` function to ensure reproducibility of the results when we take a look at the first subset, so that it will be the same for everyone who looks at this case study. You can change the seed to play around with this.

```{r}
set.seed(124)
DONOHUE_splits <- d_panel_DONOHUE %>% loo_cv()
DONOHUE_splits
```

Now we will use the `training()` function to select the remaining data without the value that was removed for each split:

```{r}
# To get all the data subsets
DONOHUE_subsets <- map(pull(DONOHUE_splits, splits), training)

glimpse(DONOHUE_subsets[[1]])
length(DONOHUE_subsets)
```

As expected the first subset has 1394 rows and there are 1395 subsets.

Let's see what observation was left out in the first subset:

```{r}
d_panel_DONOHUE %>%
  filter(!rownames(d_panel_DONOHUE) %in% rownames(DONOHUE_subsets[[1]]))

# Another way to check is to use:
DONOHUE_removed <- map(pull(DONOHUE_splits, splits), testing)

DONOHUE_removed[[1]]
```

It looks like the Texas data from 1988 was removed from the first split. Again, if you try different seeds you will see a different sample removed from the first split.

OK, so now let's fit our panel regression on the first subset of data like we did previously. Note that this causes our data to be an **unbalanced** panel. This does not require any adjustment to the code to model the data, but you will notice that the output will now say "unbalanced".

```{r}
subset_1_result <- plm(Viol_crime_rate_1k_log ~
RTC_LAW +
  White_Male_15_to_19_years +
  White_Male_20_to_39_years +
  Black_Male_15_to_19_years +
  Black_Male_20_to_39_years +
  Other_Male_15_to_19_years +
  Other_Male_20_to_39_years +
  Unemployment_rate +
  Poverty_rate +
  Population_log +
  police_per_100k_lag,
data = DONOHUE_subsets[[1]],
index = c("STATE", "YEAR"),
model = "within",
effect = "twoways"
)
summary(subset_1_result)
```

Indeed, we can see that now we have an unbalanced panel with N = 1394 observations instead of 1395, as expected.

Now that we have our subsets, we want to write a function to fit a panel regression using `plm()`on each subset. See [this case study](https://opencasestudies.github.io/ocs-bloomberg-vaping-case-study/){target="_blank"} for more information on writing functions.

```{r}
fit_nls_on_bootstrap_DONOHUE <- function(subset) {
  plm(Viol_crime_rate_1k_log ~
  RTC_LAW +
    White_Male_15_to_19_years +
    White_Male_20_to_39_years +
    Black_Male_15_to_19_years +
    Black_Male_20_to_39_years +
    Other_Male_15_to_19_years +
    Other_Male_20_to_39_years +
    Unemployment_rate +
    Poverty_rate +
    Population_log +
    police_per_100k_lag,
  data = data.frame(subset),
  index = c("STATE", "YEAR"),
  model = "within",
  effect = "twoways"
  )
}
```

Now we can apply this function to each of our subsets simultaneously using the `map()` function of the `purrr` package.  

```{r, eval = FALSE}

subsets_models_DONOHUE <- map(DONOHUE_subsets, fit_nls_on_bootstrap_DONOHUE)

subsets_models_DONOHUE <- subsets_models_DONOHUE %>%
  map(tidy)
```

We will save this data in our `data` directory within a subdirectory called `wrangled` like so:

```{r, eval = FALSE}
save(subsets_models_DONOHUE,
  file = here::here("data", "wrangled", "DONOHUE_simulations.rda")
)
```

If you have trouble with the previous step you can load the data like so:

```{r}
load(here::here("data", "wrangled", "DONOHUE_simulations.rda"))
```


Great! Now we want to do the same thing for the Mustard and Lott data.

```{r}
set.seed(124)
LOTT_splits <- d_panel_LOTT %>% loo_cv()

# To get all the data subsets:
LOTT_subsets <- map(pull(LOTT_splits, splits), training)
```


We need to create a different function to fit the data to account for the larger number of demographic variables. We will use the formula that we made previously.

```{r}
fit_nls_on_bootstrap_LOTT <- function(split) {
  plm(LOTT_fmla,
    data = data.frame(split),
    index = c("STATE", "YEAR"),
    model = "within",
    effect = "twoways"
  )
}
```


```{r, eval = FALSE}
subsets_models_LOTT <- map(LOTT_subsets, fit_nls_on_bootstrap_LOTT)

subsets_models_LOTT <- subsets_models_LOTT %>%
  map(tidy)
```

Again we will save this in the `wrangled` subdirectory of our `data` directory:

```{r, eval = FALSE}
save(subsets_models_LOTT,
  file = here::here("data", "wrangled", "LOTT_simulations.rda")
)
```

If you have trouble with the previous steps you can load this data like so:

```{r}
load(here::here("data", "wrangled", "LOTT_simulations.rda"))
```


Now we will combine the output so that we can make a plot to visualize the results that we obtained, i.e., to look at how the variation in our coefficient estimates across data subsets varies between the Donohue and the Lott and Mustard models. First let's name each subset that we created.

```{r}
names(subsets_models_DONOHUE) <- paste0("DONOHUE_", seq_len(length(subsets_models_DONOHUE)))

names(subsets_models_LOTT) <-
  paste0("LOTT_", 1:length(subsets_models_LOTT))
```

Now we can combine the tibbles within the list of tibbles for the `subsets_models_DONOHUE` and `subsets_models_LOTT` data.

To do this we will use the `bind_rows()` function of the `dplyr` package with the `.id = "ID"` argument, which will create a new variable called `ID` that will list the name of the tibble the data came from.

Then we will combine the data from both the Donohue and Lott simulations.

```{r}

simulations_DONOHUE <- subsets_models_DONOHUE %>%
  bind_rows(.id = "ID") %>%
  mutate(Analysis = "Analysis 1")

simulations_LOTT <- subsets_models_LOTT %>%
  bind_rows(.id = "ID") %>%
  mutate(Analysis = "Analysis 2")

simulations <- bind_rows(
  simulations_DONOHUE,
  simulations_LOTT
)

head(simulations)
tail(simulations)
```

Now we will make a set of parallel boxplots using the `geom_boxplot()` function of the coefficient estimates of the `RTC_LAWTRUE` variable for each simulation.

Since there are many variables in both analyses, we will use the `facet_grid()` function of the `ggplot2` package to allow us to separate the data for each analysis into subplots. The argument `scale = "free_x"` and `drop = TRUE` allow us to only include the variables that were present in Analysis 1, as opposed to empty spots for the variables that were in Analysis 2 but not in Analysis 1. The `space = "free"` argument removes the extra space from the dropped variables. 

#### {.question_block}
<b><u> Question Opportunity </u></b>

What happens if you don't use the `drop = TRUE` argument or the `space = "free"` argument? 

####

```{r}
simulation_plot <- simulations %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_boxplot() +
  facet_grid(. ~ Analysis, scale = "free_x", space = "free", drop = TRUE) +
  labs(
    title = "Coefficient estimates",
    subtitle = "Estimates across leave-one-out analyses",
    x = "Term",
    y = "Coefficient",
    caption = "Results from simulations"
  ) +
  theme_linedraw() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_text(angle = 70, hjust = 1),
    strip.text.x = element_text(size = 14, face = "bold")
  )

simulation_plot
```

Here, we can start to see that there is a bit more variability in the coefficient estimates from the leave-one-out results for Analysis 2.

For our display purposes, we would like to order the covariates so that they are displayed similarly across the two panels. This will allow us to better observe how the coefficients of the same covariate behave in the different analyses. We use the `mutate` function to convert the `term` variable to a factor, where we assign the non-demographic variables to be the first levels, with the sorted (in alphabetical order using the base `sort()` function) demographic variables coming afterwards.

You might notice that the names of the demographic variable values in the `term` variable of the `simulations` data all have the word "years". We can use the `str_subset()` function of the `stringr` package to select just the demographic variables based on the word "years. In contrast, we can use the `negate = TRUE` argument to select all the other variables, the non-demographic variable values. We can use the `unique()` base function to grab just the unique values of the `term` variable.

```{r}
str_subset(unique(pull(simulations, term)), "years", negate = TRUE)
```

```{r}
str_subset(unique(pull(simulations, term)), "years")
```

Now we can create the order of the values of the `term` variable using the `factor` argument and the `levels` argument.

```{r}

simulations <- simulations %>%
  mutate(term = factor(term,
    levels = c(
      str_subset(unique(pull(simulations, term)), "years", negate = TRUE),
      sort(str_subset(unique(pull(simulations, term)), "years"))
    )
  ))

levels(pull(simulations, term))
```

Looks good!


Now we just need to run the same code again to create the plot, but now the order of the x axis values will be different. 

```{r}

simulation_plot <- simulations %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_boxplot() +
  facet_grid(. ~ Analysis, scale = "free_x", space = "free", drop = TRUE) +
  labs(
    title = "Coefficient estimates",
    subtitle = "Estimates across leave-one-out analyses",
    x = "Term",
    y = "Coefficient",
    caption = "Results from simulations"
  ) +
  theme_linedraw() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_text(angle = 70, hjust = 1),
    strip.text.x = element_text(size = 14, face = "bold")
  )

simulation_plot
```


We can see that the range of coefficient estimates when only one observation is removed is much larger in Analysis 2 for nearly all variables, but particularly for many of the additional demographic variables.

Let's make a plot showing the summary of the overall coefficient instability.

To do this we will calculate the standard deviation of coefficient estimates for each variable across all of the simulations. Thus  we will group  by  the `Analysis` and  the `term` variables now that our data is in long format. We will use the `sd()` function of the `stats` package to calculate the standard deviation.

```{r}
coeff_sd <- simulations %>%
  group_by(Analysis, term) %>%
  summarize("SD" = sd(estimate))
```

First, we will display an interactive table of these standard deviations. Try searching for "RTC", and you can compare the standard deviations of the coefficients for the `RTC_LAWTRUE` variable across the two analyses. To take a better look at our data we will use the `datatable()` function of the `DT` package which will create an interactive searchable table.

```{r}
DT::datatable(coeff_sd)
```

Now we will make a plot of this data, including SDs from all coefficients in each model.

```{r}
simulation_plot <- coeff_sd %>%
  ggplot(aes(x = Analysis, y = SD)) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  labs(
    title = "Coefficient variability",
    subtitle = "SDs of coefficient estimates from leave-one-out analysis",
    x = "Term",
    y = "Coefficient Estimate \n Standard Deviations",
    caption = "Results from simulations"
  ) +
  theme_minimal() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_text(size = 8, color = "black"),
    axis.text.y = element_text(color = "black")
  )
simulation_plot
```

Here we can clearly see that overall the coefficient estimates are much less stable in Analysis 2. This is an indication that we may have multicollinearity in our data.


### **VIF**
***

Another way of evaluating the presence and severity of multicollinearity is to calculate the [variance inflation factor (VIF)](https://en.wikipedia.org/wiki/Variance_inflation_factor){target="_blank"} . 

According to [Wikipedia](https://en.wikipedia.org/wiki/Variance_inflation_factor){target="_blank"}:

>It provides an index that measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity. 

> The variance inflation factor (VIF) is the **quotient of the variance in a model with multiple terms by the variance of a model with one term alone**.

VIF values can be calculated for each explanatory variable in a model by performing the following calculation:

1) Run another ordinary least squares (OLS) linear regression with one of the explanatory variables of your model of interest ($X_i$) as the dependent variable and keep the remaining explanatory variables as explanatory variables.

So generically speaking say this is our model:

$$Y = β_0 + β_1X_1 + β_2X_2 + β_3X_3 + e $$

We have three explanatory variables ($X_1$, $X_2$, and $X_3$).

If we want to calculate the VIF value for $X_1$ we would need to perform another OLS model, where $X_1$ is now the dependent variable explained by the other explanatory variables.

$$X_1 = β_0 +  β_2X_2 + β_3X_3 + e$$ 

The [$R^2$ coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination){target="_blank"} (also called R squared value) from this regression is then used to calculate the VIF as follows:

$$\frac{1}{1-R^{2}}$$ 

The $R^2$ value is in this case the proportion of variance in $X_1$ explained by the other variables ($X_2$ and $X_3$).


VIF values are typically calculated for each explanatory variable when evaluating multicollinearity of a model.

The calculation for a single variable is:
$$VIF_i = \frac{1}{1-R_i^{2}}$$
Where $i$ is the index of each explanatory variable.


Recall that according to [Wikipedia](https://en.wikipedia.org/wiki/Variance_inflation_factor){target="_blank"}:

> The variance inflation factor (VIF) is the **quotient of the variance in a model with multiple terms by the variance of a model with one term alone**.

The $R^2$ value ranges from 0 to 1, and if the variation of one variable is highly explained by the other variables, the $R^2$ will approach 1. Thus the denominator in the VIF calculation $1-R_i^{2}$ (which is sometimes referred to as tolerance) will be smaller and the VIF value will be larger.

Thus, **higher VIF vales** indicate **more severe multicollinearity**. Typically a threshold of a tolerance of less than 0.10 and/or a VIF of 10 or above is used as a rule of thumb to determine if the presence of multicollinearity might be problematic.

Please see this [article](https://link.springer.com/content/pdf/10.1007/s11135-006-9018-6.pdf){target="_blank"} for a thorough explanation of how to interpret VIF values and how to decide what to do if your model has high multicollinearity.

So how do we calculate VIF values in R?

We could do this manually creating many linear regressions, but that would obviously be time consuming. Luckily, the `car` package has a function called `vif()` that will calculate VIF values. However, there is one wrinkle: the `vif()` function is not compatible with the output of the `plm` function. There is however a workaround that allows us to fit a similar model using the standard `lm()` function on data where we have removed the within-individual means. While this won't give us exactly the same results in terms of the standard errors of our estimates, it will give us some idea of the VIF values for the covariates in our model. We are following the steps outlined [here](http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html){target="_blank"}, a really nice summary of panel data modeling in R.

Once we have calculated our VIF values, we will create nicer looking output of the data using the `as_tibble()` function of the `tibble` package to create a tibble and add the variable names as another column.


Recall that we previously created the `DONOHUE_OUTPUT` object like so:
```{r}
DONOHUE_OUTPUT <- plm(Viol_crime_rate_1k_log ~
RTC_LAW +
  White_Male_15_to_19_years +
  White_Male_20_to_39_years +
  Black_Male_15_to_19_years +
  Black_Male_20_to_39_years +
  Other_Male_15_to_19_years +
  Other_Male_20_to_39_years +
  Unemployment_rate +
  Poverty_rate +
  Population_log +
  police_per_100k_lag,
effect = "twoways",
model = "within",
data = d_panel_DONOHUE
)
```

What we have modeled is how a state's violent crime rate has changed with modifications of RTC law status and over time, relative to itself, and how this compares to similar changes in violent crime of another state relative to itself. 

The coefficients from this model then are, in an oversimplified explanation, centered across all states and across all time points. This is called "demeaned" data.

We will now use this model output to create a data frame of demeaned data (where the effect of time is accounted for as are the within-individuals effects, in this case the different states).  We will make a model matrix of this data by using the `model.matrix()` function of the `stats` package and then we will create a data frame from this using the base `as.data.frame()` function.

```{r}
lm_DONOHUE_data <- as.data.frame(model.matrix(DONOHUE_OUTPUT))
```

```{r}
glimpse(lm_DONOHUE_data)
```

Notice that this does not contain any outcome data. We will add this by taking the outcome of the `Within()` function of the `plm` package to get the violent crime data after accounting for the state specific effects. According to the documentation for this package:

>  Within returns a vector containing the values in deviation from the individual means (if effect = "individual", from time means if effect = "time"), the so called demeaned data.

Also recall that the `d_panel_DONOHUE` data is just the Donohue data in panel format. 

```{r}
lm_DONOHUE_data %<>%
  mutate(Viol_crime_rate_1k_log = plm::Within(pull(
    d_panel_DONOHUE, Viol_crime_rate_1k_log
  )), effect = "twoways")
```


Now we will fit the demeaned data to the model:
```{r}
lm_DONOHUE <- lm(Viol_crime_rate_1k_log ~
RTC_LAWTRUE +
  White_Male_15_to_19_years +
  White_Male_20_to_39_years +
  Black_Male_15_to_19_years +
  Black_Male_20_to_39_years +
  Other_Male_15_to_19_years +
  Other_Male_20_to_39_years +
  Unemployment_rate +
  Poverty_rate +
  Population_log +
  police_per_100k_lag,
data = lm_DONOHUE_data
)
```

Now we are ready to use the `vif()` function of the `car` package to calculate the VIF values:

```{r}

vif_DONOHUE <- vif(lm_DONOHUE)

vif_DONOHUE
```

Now we will use the `as_tibble()` function of the `tibble` package to nicely put this together. 

```{r}
vif_DONOHUE <- vif_DONOHUE %>%
  as_tibble() %>%
  cbind(., names(vif_DONOHUE)) %>%
  as_tibble()

colnames(vif_DONOHUE) <- c("VIF", "Variable")

vif_DONOHUE
```

Now we will do the same for the Lott and Mustard data.

We will need to use the `rename()` function (you may recall this is part of the `dplyr` package) to replace `RTC_LAWTRUE` with `RTC_LAW`, as the variable name in the model output is appended by `TRUE` because it was a logical variable. Because the model formula for the Lott analysis is so complex, it is easier to change this variable name in our new data frame, rather than rewrite the formula for this data. 

Recall that we already saved the formula for this data:
```{r}
LOTT_fmla
```


```{r}
lm_LOTT_data <- as.data.frame(model.matrix(LOTT_OUTPUT))
lm_LOTT_data %<>%
  mutate(Viol_crime_rate_1k_log = plm::Within(pull(
    d_panel_LOTT, Viol_crime_rate_1k_log
  ), effect = "twoways")) %>%
  rename(RTC_LAW = RTC_LAWTRUE)

lm_LOTT <- lm(LOTT_fmla,
  data = lm_LOTT_data
)

vif_LOTT <- vif(lm_LOTT)

vif_LOTT

vif_LOTT <- vif_LOTT %>%
  as_tibble() %>%
  cbind(., names(vif_LOTT)) %>%
  as_tibble()

colnames(vif_LOTT) <- c("VIF", "Variable")
```

Now to have consistent variable names in the VIF data sets we will rename `RTC_LAW` back to `RTC_LAWTRUE` using the `str_replace()` function of the `stringr` package. This function replaces a pattern.
```{r}
vif_LOTT %>% mutate(Variable = str_replace(
  string = Variable,
  pattern = "RTC_LAW",
  replacement = "RTC_LAWTRUE"
))
vif_LOTT
```


```{r}

DT::datatable(vif_LOTT)
```

We can see that some of the VIF values are very high!

Now we will make a plot of the VIF values for both analyses. We will add text to a specific location on the plot using the `geom_text()` function. Typically a threshold of 10 is used to identify if the VIF are problematically high.

```{r}

vif_DONOHUE %<>%
  mutate(Analysis = "Analysis 1")
vif_LOTT %<>%
  mutate(Analysis = "Analysis 2")

vif_df <- bind_rows(
  vif_DONOHUE,
  vif_LOTT
)
datatable(vif_df)
```

You can also search the above table of results for "RTC" to see how the VIF values differ for the RTC variable between the two analyses. They are close to one another, although the value is slightly higher for Analysis 2.

Next, we will make a couple of plots to illustrate how the VIF values compare between the two models.

```{r}
vif_plot <- vif_df %>%
  ggplot(aes(x = Analysis, y = VIF)) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  geom_hline(yintercept = 10, color = "red") +
  geom_text(aes(.7, 18, label = "typical cutoff of 10")) +
  labs(title = "Variance inflation factors", y = "VIF") +
  theme_minimal() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_text(color = "black"),
    axis.text.y = element_text(color = "black")
  )

vif_plot
```

We can see that analysis 2 has variables with much higher multicollinearity.


Let's make this plot a little easier to see using the `coord_trans()` function of the `ggplot2` package with the `y ="log10"` argument. This does not change the values, but adjusts the way the y axis is displayed with diminishing distance between grid lines. 

```{r}

vif_plot <- vif_df %>%
  ggplot(aes(x = Analysis, y = VIF)) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  geom_hline(yintercept = 10, color = "red") +
  geom_text(aes(.75, 13, label = "typical cutoff of 10")) +
  coord_trans(y = "log10") +
  labs(title = "Variance inflation factors") +
  theme_minimal() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_text(color = "black"),
    axis.text.y = element_text(color = "black")
  )

vif_plot
```


In many cases it would be advisable to remove one or more of these variables and reassess the VIF values. There are also other options, such as [ridge regression](https://en.wikipedia.org/wiki/Tikhonov_regularization){target="_blank"}. However, both of these options need to be done with care as they can also introduce bias into the model.

In any case the presence of multicollinearity should encourage further investigation about the design of the model, as the results may not be reliable due to the increased level of instability of the coefficient estimates.

See this [article](https://link.springer.com/content/pdf/10.1007/s11135-006-9018-6.pdf){target="_blank"} for a detailed discussion about what to consider when your model has variables with high VIF values. 


# **Data Visualization**
*** 

Now lets make a plot that summarizes all of our findings.

We will use the `cowplot` package to put our plots together. 

We will use the `ggdraw()` function of this package. This allows you to add labels and other plot aspects on top of existing plots. Thus if we want to add a title element to our overall plot that we will add to our combined plot we can use `ggdraw()` to start and then the `draw_label()` function to add text.


```{r, fig.height=10, echo=FALSE, message=FALSE, warning=FALSE}
title_plots <- ggdraw() +
  draw_label(
    "Multicollinearity and its effects",
    fontface = "bold",
    size = 18,
    x = 0,
    hjust = -0.01
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

title_plots
class(title_plots)
```

As you can see we know have plot object that just has the text `"Multicollinearity and its effects"`.

Now we will create a subtitle in the same way.

```{r}

forward <- ggdraw() +
  draw_label(
    "Analysis 1: 6 demographic variables\nAnalysis 2: 36 demographic variables",
    fontface = "bold",
    size = 10,
    x = 0,
    hjust = -0.02
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

forward
```

Now we will recreate our correlation plots with some slight alterations. We want to remove our labels, because they will be too small to see when we combine our plots. To do this we will use the `theme_void()` function of the `ggplot2` package.

Note that because we are layering ggplot2 objects we can't use the `%>%` pipe to start with the existing correlation plots.

```{r}

corr_mat_DONOHUE <- corr_mat_DONOHUE +
  theme_void() +
  theme(plot.title = element_text(size = 8, color = "black")) +
  labs(title = "Analysis 1")

corr_mat_LOTT <- corr_mat_LOTT +
  theme_void() +
  theme(plot.title = element_text(size = 8, color = "black")) +
  labs(title = "Analysis 2")
```

OK we want to arrange our correlation plots to be in the top row of our larger plot. Now we will use the `plot_grid()` function to arrange the plots.

```{r}
row_A <- plot_grid(corr_mat_DONOHUE,
  corr_mat_LOTT,
  nrow = 1
)
row_A
```


Nice! We have combined plots!


Now let's add a title for these plots.

```{r}

title_A <- ggdraw() +
  draw_label(
    "Correlation between variables can induce multicollinearity",
    fontface = "bold",
    size = 14,
    x = 0,
    hjust = -0.01
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plot_A <- plot_grid(title_A,
  row_A,
  ncol = 1,
  rel_heights = c(0.1, 1)
)
```


For our second row in our larger plot we want to have the formula for calculating VIF values on the left and the plot that we created previously showing VIF values on the right.

First we will create a plot object that just has the formula.

To do so we are going to create a plot with a large label in the middle containing the formula. Then we will use the `theme_void()` function again to remove the axis labels and background.

To create our plot we will first plot values from 1-10 for both the x and y axis, allowing us to center the formula at the x and y values of 5.

To type the formula we will use [LaTeX mathematical notation](https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html){target="_blank"}.

The start and end of inline mathematical formulas are specified using dollar signs (`$`).   
Subscripts are written by using an underscore (`_`) and brackets (`{}`) indicate the start and end of the subscript.   

Fractions are indicated using `\frac{numerator}{denominator}`.

Superscripts are created using the carrot symbol (`^`) and brackets (`{}`) indicate the start and end of the superscript. 

Greek letters  are created by using, for example, `\beta`.

In the case of the fraction and Greek letters an additional `\` is needed in the `Tex()` function.

We will use the `TeX()` function of the `latex2exp` package to convert our LaTeX string to a [plotmath expression](https://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/plotmath.html){target="_blank"} (a mathematical notation in R to be used in plots).

```{r}
empty_df <- cbind(c(1:10), c(1:10)) %>%
  as.data.frame()

colnames(empty_df) <- c("X", "Y")

plot_B1 <- ggplot(empty_df, aes(x = X, y = Y)) +
  annotate("text",
    x = 5,
    y = 8,
    label = TeX("$X_{1} = \\beta_{0} + \\beta_{2}X_{2} + \\beta_{3}X_{3}...+\\beta_{k}X_{k}+e$"),
    size = 7
  ) +
  ylim(0, 10) +
  xlim(0, 10) +
  annotate("text",
    x = 5.9,
    y = 5.5,
    label = TeX("$R_1^{2}$"),
    size = 7
  ) +
  ylim(0, 10) +
  xlim(0, 10) +
  geom_segment(aes(x = 5, y = 6, xend = 5, yend = 4.5),
    arrow = arrow(angle = 45, ends = "last", type = "open"),
    size = 1.8,
    color = "black",
    lineend = "butt",
    linejoin = "mitre"
  ) +
  annotate("text",
    x = 5,
    y = 2,
    label = TeX("$VIF_{1} = \\frac{1}{1-R_{1}^{2}}$"),
    size = 7
  )


plot_B1

plot_B1 <- plot_B1 +
  theme_void()

plot_B1
```

Now we will combine this with the VIF plot.

```{r}
plot_B2 <- vif_plot +
  theme(axis.text.x = element_text(size = 8))

row_B <- plot_grid(plot_B1,
  plot_B2,
  nrow = 1
)

title_B <- ggdraw() +
  draw_label(
    "Variance inflation factors can be used to identify multicollinearity when present",
    fontface = "bold",
    size = 14,
    x = 0,
    hjust = -.01,
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plot_B <- plot_grid(title_B,
  row_B,
  ncol = 1,
  rel_heights = c(0.1, 1)
)

plot_B
```


Now for the third row we want to include the `comparing_analyses_plot` and the `simulation_plot`.

```{r}
plot_C1 <- comparing_analyses_plot +
  theme(
    axis.text.x = element_text(size = 8),
    axis.title.x = element_blank()
  ) +
  labs(
    title = "Results in different estimates",
    subtitle = "Different demographic groupings can change direction of estimate"
  )

plot_C2 <- simulation_plot +
  labs(title = "Reduces precision in estimates")

row_C <- plot_grid(plot_C1,
  plot_C2,
  nrow = 1
)

title_C <- ggdraw() +
  draw_label(
    "Multicollinearity can have an effect on statistical inference",
    fontface = "bold",
    size = 14,
    x = 0,
    hjust = -0.01
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plot_C <- plot_grid(title_C,
  row_C,
  ncol = 1,
  rel_heights = c(0.1, 1)
)

plot_C
```


Now that we have all of our rows we can combine everything together. We will also make the background of the plot white, which would be transparent otherwise.

```{r}
plots <- plot_grid(plot_A,
  plot_B,
  plot_C,
  ncol = 1,
  rel_heights = c(1, 1, 1)) 


mainplot <- plot_grid(title_plots,
  forward,
  plots,
  ncol = 1,
  rel_heights = c(
    0.05,
    0.05,
    1
  )
) +
  theme(plot.background = element_rect(fill = "white"))

mainplot
```


```{r, echo=FALSE, include=FALSE}
ggsave(here::here("img", "mainplot.png"))
```


# **Summary**
*** 

This case study has introduced the concept of multicollinearity by exploring data related to violent crimes and right-to-carry gun laws. We also introduced the topic of panel data as a special type of longitudinal data that includes data of 2 or more individuals or groups over 2 or more time points. We learned that we can use the `plm` package to perform panel linear regression analysis. We learned that the fixed effect model in panel analysis actually makes the least assumptions, and is therefore often the most appropriate test.

By evaluating two analyses that were identical except for the inclusion of extra demographic variables (Analysis 1 included 6, while Analysis 2 included 36), we discovered that redundant and collinear variables can change the directionality and magnitude of our findings. 

We learned that by looking at the correlation between pairs of explanatory variables we can get a sense about whether multicollinearity may exist in our data.

We learned that we can evaluate the stability of our coefficient estimates across sub-samples or calculate variance inflation factor (VIF) values to get a sense of the presence and severity of multicollinearity.

We learned that often a rule of thumb of >10 is used as a threshold for raising concern about the severity of multicollinearity. However, we also learned that (as often is the case with thresholds) more care may be required. 

Overall we learned that multicollinearity can bias our regression findings and it is good practice to check for multicollinearity when performing regression analysis.  It is something to keep in mind when we encounter coefficient estimates that are unexpected. 

Importantly this case study showcases how methodological details, like how we decide to parse our demographic variables, can have great consequences on the results of our analyses. 


# **Suggested Homework**
*** 

Ask students to remove one or more of the demographic variables with high VIF values from the Mustard and Lott-like panel data and perform the panel linear regression analysis again, as well as calculate the VIF values. 

Ask the students to discuss how this possibly changed the results.


# **Additional Information**
***

## **Helpful Links**
***

[Tidyverse](https://www.tidyverse.org/){target="_blank"}  
Please see [this case study](https://opencasestudies.github.io/ocs-bp-co2-emissions/){target="_blank"}  for more details on using `ggplot2`     
[Longitudinal studies](https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated/7-longitudinal-studies){target="_blank"}   
[Panel data](https://en.wikipedia.org/wiki/Panel_data){target="_blank"}    
[Confidence intervals](https://en.wikipedia.org/wiki/Confidence_interval){target="_blank"}   
[Linear regression](https://en.wikipedia.org/wiki/Linear_regression){target="_blank"}  
[panel regression analysis](https://en.wikipedia.org/wiki/Panel_analysis){target="_blank"}   
[Hausman test](https://en.wikipedia.org/wiki/Durbin%E2%80%93Wu%E2%80%93Hausman_test){target="_blank"} 
[Resampling](https://en.wikipedia.org/wiki/Resampling_(statistics)){target="_blank"}   
[Variance inflation factor (VIF)](https://en.wikipedia.org/wiki/Variance_inflation_factor){target="_blank"}   
[$R^2$ coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination){target="_blank"}   
[Ridge regression](https://en.wikipedia.org/wiki/Tikhonov_regularization){target="_blank"}  
[LaTeX mathematical notation](https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html){target="_blank"}   

For more information on linear regression see this [book](https://rafalab.github.io/dsbook/linear-models.html#linear-regression-in-the-tidyverse){target="_blank"} and this [case study](https://opencasestudies.github.io/ocs-bp-diet/){target="_blank"}.

For more information on the different types of panel regression models see this [book](https://bookdown.org/ccolonescu/RPoE4/panel-data-models.html),  [here](https://www.bauer.uh.edu/rsusmel/phd/ec1-15.pdf), and [here](https://sites.google.com/site/econometricsacademy/econometrics-models/panel-data-models).

For more information on implementing panel regression in R using the `plm` package, see [here](https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html){target="_blank"}  and [here](http://www.princeton.edu/~otorres/Panel101R.pdf){target="_blank"}.

For more information on multicollinearity and VIF, see this [article](https://link.springer.com/content/pdf/10.1007/s11135-006-9018-6.pdf){target="_blank"}.DOI 10.1007/s11135-006-9018-6

The articles used to motivate this case study are:   
[Mustard and Lott](https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1150&context=law_and_economics){target="_blank"}  
[Donohue, et al.](https://www.nber.org/papers/w23510.pdf){target="_blank"}     
[See here for a list of studies on this topic ](https://en.wikipedia.org/wiki/More_Guns,_Less_Crime){target="_blank"}  

For tutorials about using fixed effect models in panel analysis with the `plm` package, see [here](https://rpubs.com/rslbliss/fixed_effects) 
and [here](http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html)  

<u>**Packages used in this case study:** </u>  

  Package   | Use in this case study                                                                        
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[magrittr](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} | to use the compound assignment pipe operator `%<>%`
[purrr](https://purrr.tidyverse.org/){target="_blank"}   | to import the data in all the different excel and csv files efficiently  
[tibble](https://tibble.tidyverse.org/){target="_blank"}     | to create data objects that we can manipulate with `dplyr`/`stringr`/`tidyr`/`purrr`  
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}     | to create plots   
[ggrepel](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html){target="_blank"}    | to allow labels in figures not to overlap  
[plm](https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html){target="_blank"} | to work with panel data fitting fixed effects and linear regression models  
[broom](https://cran.r-project.org/web/packages/broom/vignettes/broom.html){target="_blank"} | to create nicely formatted model output  
[GGally](https://github.com/ggobi/ggally){target="_blank"} | to extend ggplot2 functionality to easily create more complex plots  
[ggcorrplot](https://www.rdocumentation.org/packages/ggcorrplot/versions/0.1.3){target="_blank"} | to easily visualize a correlation matrix  
[rsample](https://rsample.tidymodels.org){target="_blank"} | to split our sample for the simulation analysis   
[DT](https://rstudio.github.io/DT/){target="_blank"}  | to create interactive and searchable tables  
[car](https://cran.r-project.org/web/packages/car/vignettes/embedding.pdf){target="_blank"}  | to calculate VIF values on linear model output  
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the character strings within the data  
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined 
[latex2exp](https://cran.r-project.org/web/packages/latex2exp/vignettes/using-latex2exp.html){target="_blank"} | to convert latex math formulas to R's plotmath expressions  

## **Session Info**
***

```{r}
sessionInfo()
```

**Estimate of RMarkdown Compilation Time: **

```{r, echo=FALSE}
rmarkdown:::perf_timer_stop("render")
pts = rmarkdown:::perf_timer_summary()
cat("About", round(pts$time[1]/1000 + 5), "-", round(pts$time[1]/1000 + 15),"seconds")
```

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems. 

## **Acknowledgments**
***

We would like to acknowledge [Daniel Webster](https://www.jhsph.edu/faculty/directory/profile/739/daniel-webster) for assisting in framing the major direction of the case study. We would also like to thank [Elizabeth Stuart](https://www.jhsph.edu/faculty/directory/profile/1792/elizabeth-a-stuart) and [Aboozar Hadavand](https://www.minerva.kgi.edu/people/aboozar-hadavand-phd-assistant-professor-computational-sciences/) and [Alexander McCourt](https://publichealth.jhu.edu/faculty/3794/alexander-mccourt) for reviewing the case study. 

We would like to acknowledge [Michael Breshock](https://mbreshock.github.io/) for his contributions to this case study and developing the `OCSdata` package.

We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work. 

<script type='text/javascript' id='clustrmaps' src='//cdn.clustrmaps.com/map_v2.js?cl=080808&w=a&t=tt&d=9oKh9t-1qM5YL5nLhTMeMvoQgwtjeIQv69wuDhYUdaw&co=ffffff&cmo=3acc3a&cmn=ff5353&ct=808080'></script>