-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
231 lines (152 loc) · 15.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
output: md_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# Open Case Studies: Disparities in Youth Disconnection
<!-- badges: start -->
[![render-README](https://github.com/opencasestudies/ocs-bp-youth-disconnection/workflows/render-README/badge.svg)](https://github.com/opencasestudies/ocs-bp-youth-disconnection/actions)
[![render-index](https://github.com/opencasestudies/ocs-bp-youth-disconnection/workflows/render-index/badge.svg)](https://github.com/opencasestudies/ocs-bp-youth-disconnection/actions)
<!-- badges: end -->
### Important links
- HTML: https://www.opencasestudies.org/ocs-bp-youth-disconnection
- GitHub: https://github.com/opencasestudies/ocs-bp-youth-disconnection
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies
### Disclaimer
The purpose of the [Open Case Studies](https://www.opencasestudies.org)
project is **to demonstrate
the use of various data science methods, tools, and software in the
context of messy, real-world data**. A given case study does not cover
all aspects of the research process, is not claiming to be the most
appropriate way to analyze a given dataset, and should not be used in
the context of making policy decisions without external consultation
from scientific experts.
### License
This case study is part of the [Open Case Studies](https://www.opencasestudies.org) project.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 ([CC BY-NC 3.0](https://creativecommons.org/licenses/by-nc/3.0/us/)) United States License.
### Citation
To cite this case study:
Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie C. (2020). [https://github.com/opencasestudies/ocs-youth-disconnection-case-study](https://github.com/opencasestudies/ocs-youth-disconnection-case-study). Disparities in Youth Disconnection.
### Acknowledgments
We would like to acknowledge [Tamar Mendelson](https://www.jhsph.edu/faculty/directory/profile/1770/tamar-mendelson) for assisting in framing the major direction of the case study.
We would like to acknowledge [Qier Meng](https://www.opencasestudies.org/authors/qmeng/) and [Michael Breshock](https://mbreshock.github.io/) for their contributions to this case study.
We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work.
### Reading Metrics
The total reading time for this case study was calculated with [koRpus](https://github.com/unDocUMeantIt/koRpus): **About 85 minutes**
The Flesch-Kincaid Readability Index was also calculated with [koRpus](https://github.com/unDocUMeantIt/koRpus): **Grade 8, Age 13**
### Title
Disparities in Youth Disconnection
### Motivation
According to this [report](https://ssrc-static.s3.amazonaws.com/moa/Making%20the%20Connection.pdf) youth disconnection (defined as "young people between the ages of **16 and 24** who are **neither working nor in school**" according to the [Measure of America](https://www.ssrc.org/programs/view/moa/) (a nonpartisan project) although generally showing decreasing trends for the past 7 years, shows **racial and ethnic disparities**, where some groups are showing increased rates of disconnection.
Thus in this case study we aim to look further at youth disconnection rates among gender and racial and ethnic subgroups to identify groups that may be particularly vulnerable.
### Motivating questions
1) How have youth disconnection rates in American youth changed since 2008?
2) In particular, how has this changed for different gender and ethnic groups? Are any groups particularly disconnected?
### Data
In this case study we will be using data related to youth disconnection from the two following reports from the [Measure of America](https://www.ssrc.org/programs/view/moa/) project:
> Measure of America is a nonpartisan project of the nonprofit [Social Science Research Council](https://www.ssrc.org/) founded in 2007 to create easy-to-use yet methodologically sound tools for understanding well-being and opportunity in America. Through reports, interactive apps, and custom-built dashboards, Measure of America works with partners to breathe life into numbers, using data to identify areas of highest need, pinpoint levers for change, and track progress over time.
1. Lewis, Kristen. [Making the Connection: Transportation and Youth Disconnection](https://ssrc-static.s3.amazonaws.com/moa/Making%20the%20Connection.pdf). New York: Measure of America, Social Science Research Council, 2019. (Data up to 2017)
2. : Lewis, Kristen. [A Decade Undone: Youth Disconnection in the Age of Coronavirus](https://ssrc-static.s3.amazonaws.com/moa/ADecadeUndone.pdf).
New York: Measure of America, Social Science Research Council, 2020. (Data up to 2018)
These reports use data from the [American Community Survey (ASC)](https://www.census.gov/programs-surveys/acs).
#### Learning Objectives
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
<u>**Data Science Learning Objectives:**</u>
1. Importing text from PDF files using images and the `magick` package
2. Apply action verbs in `dplyr` for data wrangling
3. How to reshape data by pivoting between "long" and "wide" formats and separating columns into additional columns (`tidyr`)
4. How to fill in data based on previous values (`tidyr`)
5. How to create data visualizations with `ggplot2` that are in a similar style to an existing image
6. How to add images to plots using `cowplot`
7. How to create effective bar plots to for multiple comparisons, including adding gaps between bars in bar plots, adding figure legends to the plot area, and adding comparison lines (`ggplot2`)
<u>**Statistical Learning Objectives:**</u>
1. Implementation of the Mann-Kendall trend test
2. Interpretation of the Mann-Kendall trend test
3. Difference between linear regression and Mann-Kendall trend test
#### Data import
Data is imported from several tables within two PDF documents by taking screenshots of the tables of interest and using the `magick` package to import the text from the screenshots.
#### Data wrangling
This case study particularly focuses on renaming variables, modifying variables, creating new variables, and modifying the shape of the data using functions such as as: `pivot_longer()`, and `pivot_wider()`, as well as modifying specific variables using the `mutate()` and `across()` functions of the `dplyr` package.
This case study also covers combining data with `bind_rows()` and `add_rows()` functions of the `dplyr` package.
We also cover removing NA values with the `drop_na()` function of the `tidyr` package, separating one column into multiple columns using the `separate()` function of the `tidyr` package, filling in `NA` values based on previous values using the `fill()` and replacing `NA` values with the `replace_na()` function, both of the `tidyr` package, as well as arranging levels of factors using the `forcats` package.
Finally, this case study also covers many of the `stringr` functions to manipulate character strings, including `str_extract()`, `str_to_title()`, `str_replace()`, `str_remove()`.
#### Data Visualization
We include an example of creating a plot to match the style of a plot in an existing report. We also demonstrate how to make effective bar plots, by demonstrating details such as creating gaps between groups, taking advantage of these gaps to move the legend to within the plot area, and to use horizontal lines to allow for additional comparisons among groups. We also demonstrate how to add images to plots and combine plots using the `patchwork` package.
### Analysis
The analysis in this case study covers some basics about probability and hypothesis testing, as well as the Mann-Kendall trend test and the difference between this test and simple linear regression. In this analysis we use the Mann-Kendall to test if there has been a trend within the disconnection rates of particular groups of youths over time.
### Other notes and resources
[RStudio](https://rstudio.com/products/rstudio/features/){target="_blank"}
[Cheatsheet on RStuido IDE](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf){target="_blank"}
[Other RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}
[Tidyverse](https://www.tidyverse.org/){target="_blank"}
[Response bias](https://en.wikipedia.org/wiki/Response_bias)
[Cross-Sectional data](https://en.wikipedia.org/wiki/Cross-sectional_data?oldformat=true)
[Population](https://en.wikipedia.org/wiki/Population?oldformat=true)
[Sample](https://en.wikipedia.org/wiki/Sampling_(statistics)?oldformat=true)
[Sampling methods](https://en.wikipedia.org/wiki/Sampling_(statistics)?oldformat=true){target="_blank"}
[Inference](https://www.britannica.com/science/inference-statistics)
[American Community Survey (ASC)](https://www.census.gov/programs-surveys/acs)
See [here](https://en.wikipedia.org/wiki/American_Community_Survey) for more detailed information about the survey
[Measure of America](https://www.ssrc.org/programs/view/moa/)
[Social Science Research Council](https://www.ssrc.org/)
[Piping in R](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"}
[Writing functions](https://r4ds.had.co.nz/functions.html)
Also see [this case study](https://www.opencasestudies.org/ocs-bp-vaping-case-study/){target="_blank"} for more information on writing functions.
[String manipulation cheatsheet](https://rstudio.com/resources/cheatsheets/){target="_blank"}
[Table formats](https://en.wikipedia.org/wiki/Wide_and_narrow_data){target="_blank"}
[Regression](https://lindeloev.github.io/tests-as-linear/)
[simple linear regression](https://en.wikipedia.org/wiki/Simple_linear_regression)
[monotonic association](https://www.statisticshowto.com/monotonic-relationship/)
[Kendall rank correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient?oldformat=true)
[Null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis)
[Alternative hypothesis](https://en.wikipedia.org/wiki/Alternative_hypothesis)
[Probability](https://en.wikipedia.org/wiki/Probability)
[one-sided and two-sided hypotheses](https://en.wikipedia.org/wiki/One-_and_two-tailed_tests)
[Nonparametric](https://en.wikipedia.org/wiki/Nonparametric_statistics) [Parametric](https://en.wikipedia.org/wiki/Parametric_statistics)
[significance threshold](https://en.wikipedia.org/wiki/Statistical_significance)
[Z score](https://en.wikipedia.org/wiki/Standard_score)
[Z score table](https://socratic.org/questions/5986a3e1b72cff6fd48a5408)
[Z score to p-value calculator](https://www.socscistatistics.com/pvalues/normaldistribution.aspx)
[`ggplot2` package](http://ggplot2.tidyverse.org){target="_blank"}
Please see [this case study](https://www.opencasestudies.org/ocs-bp-co2-emissions/) for more details on using `ggplot2`
[grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html){target="_blank"}
[`ggplot2` themes](https://ggplot2.tidyverse.org/reference/ggtheme.html){target="_blank"}
[`directlabels` package methods](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"}
[Hmong people](https://en.wikipedia.org/wiki/Hmong_people)
[Intersections](https://www.vox.com/the-highlight/2019/5/20/18542843/intersectionality-conservatism-law-race-gender-discrimination)
[Motivating article for this case study about youth disconnection/opportunity youth](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6243446/)
To learn more about importing and wrangling PDFs using the `pdftools` package see this [case study](https://www.opencasestudies.org/ocs-bp-rural-and-urban-obesity/).
To learn more about what you can do with the `magick` package see this [vingette](https://cran.r-project.org/web/packages/magick/vignettes/intro.html).
To learn more about the **Mann-Kendall trend test** see
[here](https://www.statisticshowto.com/mann-kendall-trend-test/) and [here](https://www.statisticshowto.com/wp-content/uploads/2016/08/Mann-Kendall-Analysis-1.pdf).
To learn more about hypothesis testing, see this [case study](https://www.opencasestudies.org/ocs-bp-rural-and-urban-obesity/).
<u>**Packages used in this case study:** </u>
Package | Use in this case study
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"} | to easily load and save data
[pdftools](https://readr.tidyverse.org/) | to import PDF documents
[magick](https://cran.r-project.org/web/packages/magick/vignettes/intro.html#Kernel_convolution) | for importing images and extracting text from images
[tesseract](https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html) | for extracting text from images with `magick`
[knitr](https://yihui.org/knitr/) | for showing images in reports
[dplyr](https://dplyr.tidyverse.org/){target="_blank"} | to filter, subset, join, add rows to, and modify the data
[stringr](https://stringr.tidyverse.org/){target="_blank"} | to manipulate strings
[magrittr](https://magrittr.tidyverse.org/){target="_blank"} | to pipe sequential commands
[tidyr](https://tidyr.tidyverse.org/){target="_blank"} | to change the shape or format of tibbles to wide and long, to drop rows with `NA` values, to separate a column into additional columns, and to fill out values based on previous values
[tibble](https://tibble.tidyverse.org/){target="_blank"} | to create tibbles
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} | to create plots
[directlabels](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"} | to add labels directly to lines in plots
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to add images to plots
[forcats](https://forcats.tidyverse.org/){target="_blank"} | to reorder factor for plot
[kendall](https://cran.r-project.org/web/packages/Kendall/Kendall.pdf) | to implement the Mann-Kendall trend test in R
[patchwork](https://github.com/thomasp85/patchwork) | to combine plots
[DT](https://cran.r-project.org/web/packages/DT/index.html) | Interactive tables
#### For instructors
Instructors can start at the Data Visualization or Data Analysis sections. Instructors can also skip the *Subgroup plots* section if they don't wish to instruct students about making bar plots in depth.
#### Target audience
This case study is appropriate for those new to R programming and new to statistics. It is also appropriate for more advanced R users who are new to the Tidyverse. This particular case study may require some fundamental knowledge of statistics.
#### Suggested homework
- For the Asian and Latinx subgroup bar plots made across year, modify these plots to consider gender differences (instead of across time).
- Taking the plot you made above, modify the plot to facet across years.
- Find another table in one of the reports to import using the `magick` package (for example perhaps the data about different states over time in the 2019 report called [Making the Connection](https://ssrc-static.s3.amazonaws.com/moa/Making%20the%20Connection.pdf)). Look for differences between groups by plotting the data and evaluating with the Mann-Kendall test.
#### Estimate of RMarkdown Compilation Time:
~ About 36 - 46 seconds
This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.