women-in-stem.qmd

---
title: "Practicality of Using Transformations in MLR"
author: "Lydia Gibson"
format: 
  revealjs:
    theme: [default, style.scss]
    footer: Cal State East Bay STAT 694 Final Presentation
---

# Introduction

```{r echo=FALSE, warning=FALSE, message=FALSE}
dat1 <- read.csv("women-stem.csv")

library(pacman)

suppressWarnings(p_load(dplyr, ggplot2, ggpubr, scales, MASS, car, lmtest, 
                        ggrepel, faraway, ggcorrplot, GGally, lindia, see, 
                        performance, ggstatsplot, rstantools, PMCMRplus, 
                        cowplot, parameters, report, modelbased))

options(scipen = 100) # remove scientific notation
```

-   In previous research, I've used multiple linear regression **(MLR)** to explore the relationship between median salary of STEM majors and gender demographics.
-   The final reduced model used the inverse transformation of the response variable, due to the skewness of it's distribution, to improve the model fit.
-   While transforming response variables can lead to better fitting models, these models are not easy to explain.

# Outline

Problem

Data Source

Methods

Results

Conclusion

Further Research

# Problem

-   How much prediction power is lost by ***not*** using a transformed response variable in a linear regression model?

-   Is it worth the inability to easily explain your model?

# Data Source

-   The [data set](https://github.com/fivethirtyeight/data/blob/master/college-majors/women-stem.csv) was obtained from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series (PUMS).
-   It has 76 observations and 9 variables.

```{r echo=FALSE, include=FALSE}
options(scipen=3)

# remove Rank, Major_code, and Major
dat2 <- dat1[,-c(1,2,3)] 

#datawizard::standardize(dat2)

lm1 <- lm(Median ~., data=dat2[-c(2)])
lm1_reduced<-step(lm1)
lm1_reduced
lm2 <- lm((Median^(-1)) ~., data=dat2[-c(2)])
lm2_reduced<-step(lm2)
lm2_reduced
```

```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}


# bar chart
p7 <- dat1 %>% ggplot(aes(x=Major_category, y=Proportion, fill=Sex)) +
  stat_summary(geom = "bar", position="fill") +
 ggtitle("Proportion of Sexes across Major Categories") +
  scale_y_continuous(labels = scales::percent_format()) + 
  labs(x="Major Category", y="Proportion of Sexes") +
  scale_x_discrete(labels = scales::label_wrap(15)) +
  scale_fill_manual(values = c( "#F1F1F1", "#FCA3B5")
                    )

p7 <- p7 + theme(legend.position = "top")
p7

```

## Exploratory Data Analysis

```{r}
# violin plot
p8 <- ggstatsplot::ggbetweenstats(
  data  = dat1,
  x     = Major_category,
  y     = Median,
  xlab = "Major Category",
  ylab = " Median Salary($)",
  type = "robust",
  p.adjust.method = "bonferroni",
  results.subtitle = FALSE,
  title = "Median Salary across Major Categories",
  outlier.tagging = TRUE,
  outlier.label = Major,
  package = "ggsci",
  palette = "default_jco", 
  pairwise.comparisons = FALSE
  
  )

p8
```

# Method


## Model *Without* Transformation

```{r, size= "tiny"}
parameters::model_parameters(lm1_reduced)
```

## Why a Transformation?

```{r echo=FALSE, warning=FALSE, message=FALSE}
p1 <- ggpubr::gghistogram(dat2$Median) + ggtitle("Response Variable Without Transformation")

p2 <-lindia::gg_boxcox(lm1_reduced) + ggtitle("Box Cox for Model Without Transformation")

cowplot::plot_grid(p1, p2)
```

## Model *With* Transformation

```{r, size="tiny"}
parameters::model_parameters(lm2_reduced)

```

# Results

## Diagnostic Plots for Model *Without* Transformation

```{r}
#lindia::gg_diagnose(lm_full)
performance::check_model(lm1_reduced)
```

## Diagnostic Plots for Model *With* Transformation

```{r}
#lindia::gg_diagnose(lm_reduced)
performance::check_model(lm2_reduced)
```

## Metrics Comparison

```{r}
performance::compare_performance(lm1_reduced, lm2_reduced, metrics = "common")
```

# Conclusion


-   Regression models with an inverse-transformation dependent response variable are not easy to explain to individuals without a statistics background, which is likely to occur in statistical consulting.
-   Based on the adjusted $R^2$ values of the two models, there is a less than 10% loss of ability to explain the variability between our response variable and explanatory variables by using a linear regression model **without** an inverse-transformation dependent response variable to one **with**.


# Further Research

-   I would like to redo my analysis with a data set of a more quantitative nature, and run the regression models using the [TidyModels](https://github.com/tidymodels) framework.
-   I would like to do a comparison of the data visualizations and analyses available in the various ggplot2 extension packages ([`ggpubr`](https://github.com/kassambara/ggpubr), [`easystats`](https://github.com/easystats), [`lindia`](https://github.com/yeukyul/lindia), [`ggstatsplot`](https://github.com/IndrajeetPatil/ggstatsplot)) used in this presentation.

# Acknowledgements

-   I would like to thank my colleagues, Sara Hatter and Ken Vu, with whom I collaborated on the previous research projects, [*Gender Wage Inequality in STEM*](https://github.com/lgibson7/Gender-Wage-Inequality-in-STEM) and [*Unemployment in STEM*](https://github.com/lgibson7/Unemployment-in-STEM), which laid the groundwork for this project.
-   I would like to acknowledge the FiveThirtyEight blog for uploading the data behind their story, [*The Economic Guide To Picking A College Major*](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/), which was used for this analysis.
-   I would like to thank Prof. Eric Suess for his guidance with this research and the ***#rstats*** community members who helped with the styling of this presentation.

# Appendix

-   This presentation can be viewed at: <https://lgibson7.quarto.pub/women-in-stem>.
-   Code for this presentation can be found at: <https://github.com/lgibson7/Women-in-STEM>.