- HTML: https://www.opencasestudies.org/ocs-bp-youth-disconnection
- GitHub: https://github.com/opencasestudies/ocs-bp-youth-disconnection
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies
The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This case study is part of the Open Case Studies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study:
Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie C. (2020). https://github.com/opencasestudies/ocs-youth-disconnection-case-study. Disparities in Youth Disconnection.
We would like to acknowledge Tamar Mendelson for assisting in framing the major direction of the case study.
We would like to acknowledge Qier Meng and Michael Breshock for their contributions to this case study.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.
The total reading time for this case study was calculated with koRpus: About 85 minutes
The Flesch-Kincaid Readability Index was also calculated with koRpus: Grade 8, Age 13
Disparities in Youth Disconnection
According to this report youth disconnection (defined as “young people between the ages of 16 and 24 who are neither working nor in school” according to the Measure of America (a nonpartisan project) although generally showing decreasing trends for the past 7 years, shows racial and ethnic disparities, where some groups are showing increased rates of disconnection.
Thus in this case study we aim to look further at youth disconnection rates among gender and racial and ethnic subgroups to identify groups that may be particularly vulnerable.
- How have youth disconnection rates in American youth changed since 2008?
- In particular, how has this changed for different gender and ethnic groups? Are any groups particularly disconnected?
In this case study we will be using data related to youth disconnection from the two following reports from the Measure of America project:
Measure of America is a nonpartisan project of the nonprofit Social Science Research Council founded in 2007 to create easy-to-use yet methodologically sound tools for understanding well-being and opportunity in America. Through reports, interactive apps, and custom-built dashboards, Measure of America works with partners to breathe life into numbers, using data to identify areas of highest need, pinpoint levers for change, and track progress over time.
- Lewis, Kristen. Making the Connection: Transportation and Youth Disconnection. New York: Measure of America, Social Science Research Council, 2019. (Data up to 2017)
- : Lewis, Kristen. A Decade Undone: Youth Disconnection in the Age of Coronavirus. New York: Measure of America, Social Science Research Council, 2020. (Data up to 2018)
These reports use data from the American Community Survey (ASC).
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data Science Learning Objectives:
- Importing text from PDF files using images and the
magick
package - Apply action verbs in
dplyr
for data wrangling - How to reshape data by pivoting between “long” and “wide” formats
and separating columns into additional columns (
tidyr
) - How to fill in data based on previous values (
tidyr
) - How to create data visualizations with
ggplot2
that are in a similar style to an existing image - How to add images to plots using
cowplot
- How to create effective bar plots to for multiple comparisons,
including adding gaps between bars in bar plots, adding figure
legends to the plot area, and adding comparison lines (
ggplot2
)
Statistical Learning Objectives:
- Implementation of the Mann-Kendall trend test
- Interpretation of the Mann-Kendall trend test
- Difference between linear regression and Mann-Kendall trend test
Data is imported from several tables within two PDF documents by taking
screenshots of the tables of interest and using the magick
package to
import the text from the screenshots.
This case study particularly focuses on renaming variables, modifying
variables, creating new variables, and modifying the shape of the data
using functions such as as: pivot_longer()
, and pivot_wider()
, as
well as modifying specific variables using the mutate()
and across()
functions of the dplyr
package.
This case study also covers combining data with bind_rows()
and
add_rows()
functions of the dplyr
package.
We also cover removing NA values with the drop_na()
function of the
tidyr
package, separating one column into multiple columns using the
separate()
function of the tidyr
package, filling in NA
values
based on previous values using the fill()
and replacing NA
values
with the replace_na()
function, both of the tidyr
package, as well
as arranging levels of factors using the forcats
package.
Finally, this case study also covers many of the stringr
functions to
manipulate character strings, including str_extract()
,
str_to_title()
, str_replace()
, str_remove()
.
We include an example of creating a plot to match the style of a plot in
an existing report. We also demonstrate how to make effective bar plots,
by demonstrating details such as creating gaps between groups, taking
advantage of these gaps to move the legend to within the plot area, and
to use horizontal lines to allow for additional comparisons among
groups. We also demonstrate how to add images to plots and combine plots
using the patchwork
package.
The analysis in this case study covers some basics about probability and hypothesis testing, as well as the Mann-Kendall trend test and the difference between this test and simple linear regression. In this analysis we use the Mann-Kendall to test if there has been a trend within the disconnection rates of particular groups of youths over time.
RStudio
Cheatsheet on RStuido IDE
Other RStudio cheatsheets
Tidyverse
Response bias
Cross-Sectional
data
Population
Sample
Sampling methods
Inference
American Community Survey (ASC)
See here for
more detailed information about the survey
Measure of America
Social Science Research Council
Piping in R
Writing functions
Also see
this case study
for more information on writing functions.
String manipulation cheatsheet
Table formats
Regression
simple linear
regression
monotonic
association
Kendall rank correlation
coefficient
Null hypothesis
Alternative
hypothesis
Probability
one-sided and two-sided
hypotheses
Nonparametric
Parametric
significance
threshold
Z score
Z score
table
Z score to p-value
calculator
ggplot2
package
Please see this case
study for more
details on using ggplot2
grammar of graphics
ggplot2
themes
directlabels
package methods
Hmong people
Intersections
Motivating article for this case study about youth disconnection/opportunity youth
To learn more about importing and wrangling PDFs using the pdftools
package see this case
study.
To learn more about what you can do with the magick
package see this
vingette.
To learn more about the Mann-Kendall trend test see here and here.
To learn more about hypothesis testing, see this case study.
Packages used in this case study:
Package | Use in this case study |
---|---|
here | to easily load and save data |
pdftools | to import PDF documents |
magick | for importing images and extracting text from images |
tesseract | for extracting text from images with magick |
knitr | for showing images in reports |
dplyr | to filter, subset, join, add rows to, and modify the data |
stringr | to manipulate strings |
magrittr | to pipe sequential commands |
tidyr | to change the shape or format of tibbles to wide and long, to drop rows with NA values, to separate a column into additional columns, and to fill out values based on previous values |
tibble | to create tibbles |
ggplot2 | to create plots |
directlabels | to add labels directly to lines in plots |
cowplot | to add images to plots |
forcats | to reorder factor for plot |
kendall | to implement the Mann-Kendall trend test in R |
patchwork | to combine plots |
DT | Interactive tables |
Instructors can start at the Data Visualization or Data Analysis sections. Instructors can also skip the Subgroup plots section if they don’t wish to instruct students about making bar plots in depth.
This case study is appropriate for those new to R programming and new to statistics. It is also appropriate for more advanced R users who are new to the Tidyverse. This particular case study may require some fundamental knowledge of statistics.
- For the Asian and Latinx subgroup bar plots made across year, modify these plots to consider gender differences (instead of across time).
- Taking the plot you made above, modify the plot to facet across years.
- Find another table in one of the reports to import using the
magick
package (for example perhaps the data about different states over time in the 2019 report called Making the Connection). Look for differences between groups by plotting the data and evaluating with the Mann-Kendall test.
~ About 36 - 46 seconds
This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.