-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathwomen-in-stem.qmd
183 lines (123 loc) · 5.81 KB
/
women-in-stem.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: "Practicality of Using Transformations in MLR"
author: "Lydia Gibson"
format:
revealjs:
theme: [default, style.scss]
footer: Cal State East Bay STAT 694 Final Presentation
---
# Introduction
```{r echo=FALSE, warning=FALSE, message=FALSE}
dat1 <- read.csv("women-stem.csv")
library(pacman)
suppressWarnings(p_load(dplyr, ggplot2, ggpubr, scales, MASS, car, lmtest,
ggrepel, faraway, ggcorrplot, GGally, lindia, see,
performance, ggstatsplot, rstantools, PMCMRplus,
cowplot, parameters, report, modelbased))
options(scipen = 100) # remove scientific notation
```
- In previous research, I've used multiple linear regression **(MLR)** to explore the relationship between median salary of STEM majors and gender demographics.
- The final reduced model used the inverse transformation of the response variable, due to the skewness of it's distribution, to improve the model fit.
- While transforming response variables can lead to better fitting models, these models are not easy to explain.
# Outline
Problem
Data Source
Methods
Results
Conclusion
Further Research
# Problem
- How much prediction power is lost by ***not*** using a transformed response variable in a linear regression model?
- Is it worth the inability to easily explain your model?
# Data Source
- The [data set](https://github.com/fivethirtyeight/data/blob/master/college-majors/women-stem.csv) was obtained from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series (PUMS).
- It has 76 observations and 9 variables.
```{r echo=FALSE, include=FALSE}
options(scipen=3)
# remove Rank, Major_code, and Major
dat2 <- dat1[,-c(1,2,3)]
#datawizard::standardize(dat2)
lm1 <- lm(Median ~., data=dat2[-c(2)])
lm1_reduced<-step(lm1)
lm1_reduced
lm2 <- lm((Median^(-1)) ~., data=dat2[-c(2)])
lm2_reduced<-step(lm2)
lm2_reduced
```
```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}
# bar chart
p7 <- dat1 %>% ggplot(aes(x=Major_category, y=Proportion, fill=Sex)) +
stat_summary(geom = "bar", position="fill") +
ggtitle("Proportion of Sexes across Major Categories") +
scale_y_continuous(labels = scales::percent_format()) +
labs(x="Major Category", y="Proportion of Sexes") +
scale_x_discrete(labels = scales::label_wrap(15)) +
scale_fill_manual(values = c( "#F1F1F1", "#FCA3B5")
)
p7 <- p7 + theme(legend.position = "top")
p7
```
## Exploratory Data Analysis
```{r}
# violin plot
p8 <- ggstatsplot::ggbetweenstats(
data = dat1,
x = Major_category,
y = Median,
xlab = "Major Category",
ylab = " Median Salary($)",
type = "robust",
p.adjust.method = "bonferroni",
results.subtitle = FALSE,
title = "Median Salary across Major Categories",
outlier.tagging = TRUE,
outlier.label = Major,
package = "ggsci",
palette = "default_jco",
pairwise.comparisons = FALSE
)
p8
```
# Method
## Model *Without* Transformation
```{r, size= "tiny"}
parameters::model_parameters(lm1_reduced)
```
## Why a Transformation?
```{r echo=FALSE, warning=FALSE, message=FALSE}
p1 <- ggpubr::gghistogram(dat2$Median) + ggtitle("Response Variable Without Transformation")
p2 <-lindia::gg_boxcox(lm1_reduced) + ggtitle("Box Cox for Model Without Transformation")
cowplot::plot_grid(p1, p2)
```
## Model *With* Transformation
```{r, size="tiny"}
parameters::model_parameters(lm2_reduced)
```
# Results
## Diagnostic Plots for Model *Without* Transformation
```{r}
#lindia::gg_diagnose(lm_full)
performance::check_model(lm1_reduced)
```
## Diagnostic Plots for Model *With* Transformation
```{r}
#lindia::gg_diagnose(lm_reduced)
performance::check_model(lm2_reduced)
```
## Metrics Comparison
```{r}
performance::compare_performance(lm1_reduced, lm2_reduced, metrics = "common")
```
# Conclusion
- Regression models with an inverse-transformation dependent response variable are not easy to explain to individuals without a statistics background, which is likely to occur in statistical consulting.
- Based on the adjusted $R^2$ values of the two models, there is a less than 10% loss of ability to explain the variability between our response variable and explanatory variables by using a linear regression model **without** an inverse-transformation dependent response variable to one **with**.
# Further Research
- I would like to redo my analysis with a data set of a more quantitative nature, and run the regression models using the [TidyModels](https://github.com/tidymodels) framework.
- I would like to do a comparison of the data visualizations and analyses available in the various ggplot2 extension packages ([`ggpubr`](https://github.com/kassambara/ggpubr), [`easystats`](https://github.com/easystats), [`lindia`](https://github.com/yeukyul/lindia), [`ggstatsplot`](https://github.com/IndrajeetPatil/ggstatsplot)) used in this presentation.
# Acknowledgements
- I would like to thank my colleagues, Sara Hatter and Ken Vu, with whom I collaborated on the previous research projects, [*Gender Wage Inequality in STEM*](https://github.com/lgibson7/Gender-Wage-Inequality-in-STEM) and [*Unemployment in STEM*](https://github.com/lgibson7/Unemployment-in-STEM), which laid the groundwork for this project.
- I would like to acknowledge the FiveThirtyEight blog for uploading the data behind their story, [*The Economic Guide To Picking A College Major*](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/), which was used for this analysis.
- I would like to thank Prof. Eric Suess for his guidance with this research and the ***#rstats*** community members who helped with the styling of this presentation.
# Appendix
- This presentation can be viewed at: <https://lgibson7.quarto.pub/women-in-stem>.
- Code for this presentation can be found at: <https://github.com/lgibson7/Women-in-STEM>.