Skip to content

opencasestudies/ocs-bp-youth-disconnection

Repository files navigation

Open Case Studies: Disparities in Youth Disconnection

render-README render-index

Important links

Disclaimer

The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.

License

This case study is part of the Open Case Studies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

Citation

To cite this case study:

Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie C. (2020). https://github.com/opencasestudies/ocs-youth-disconnection-case-study. Disparities in Youth Disconnection.

Acknowledgments

We would like to acknowledge Tamar Mendelson for assisting in framing the major direction of the case study.

We would like to acknowledge Qier Meng and Michael Breshock for their contributions to this case study.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

Reading Metrics

The total reading time for this case study was calculated with koRpus: About 85 minutes

The Flesch-Kincaid Readability Index was also calculated with koRpus: Grade 8, Age 13

Title

Disparities in Youth Disconnection

Motivation

According to this report youth disconnection (defined as “young people between the ages of 16 and 24 who are neither working nor in school” according to the Measure of America (a nonpartisan project) although generally showing decreasing trends for the past 7 years, shows racial and ethnic disparities, where some groups are showing increased rates of disconnection.

Thus in this case study we aim to look further at youth disconnection rates among gender and racial and ethnic subgroups to identify groups that may be particularly vulnerable.

Motivating questions

  1. How have youth disconnection rates in American youth changed since 2008?
  2. In particular, how has this changed for different gender and ethnic groups? Are any groups particularly disconnected?

Data

In this case study we will be using data related to youth disconnection from the two following reports from the Measure of America project:

Measure of America is a nonpartisan project of the nonprofit Social Science Research Council founded in 2007 to create easy-to-use yet methodologically sound tools for understanding well-being and opportunity in America. Through reports, interactive apps, and custom-built dashboards, Measure of America works with partners to breathe life into numbers, using data to identify areas of highest need, pinpoint levers for change, and track progress over time.

  1. Lewis, Kristen. Making the Connection: Transportation and Youth Disconnection. New York: Measure of America, Social Science Research Council, 2019. (Data up to 2017)
  2. : Lewis, Kristen. A Decade Undone: Youth Disconnection in the Age of Coronavirus. New York: Measure of America, Social Science Research Council, 2020. (Data up to 2018)

These reports use data from the American Community Survey (ASC).

Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives:

  1. Importing text from PDF files using images and the magick package
  2. Apply action verbs in dplyr for data wrangling
  3. How to reshape data by pivoting between “long” and “wide” formats and separating columns into additional columns (tidyr)
  4. How to fill in data based on previous values (tidyr)
  5. How to create data visualizations with ggplot2 that are in a similar style to an existing image
  6. How to add images to plots using cowplot
  7. How to create effective bar plots to for multiple comparisons, including adding gaps between bars in bar plots, adding figure legends to the plot area, and adding comparison lines (ggplot2)

Statistical Learning Objectives:

  1. Implementation of the Mann-Kendall trend test
  2. Interpretation of the Mann-Kendall trend test
  3. Difference between linear regression and Mann-Kendall trend test

Data import

Data is imported from several tables within two PDF documents by taking screenshots of the tables of interest and using the magick package to import the text from the screenshots.

Data wrangling

This case study particularly focuses on renaming variables, modifying variables, creating new variables, and modifying the shape of the data using functions such as as: pivot_longer(), and pivot_wider(), as well as modifying specific variables using the mutate() and across() functions of the dplyr package.
This case study also covers combining data with bind_rows() and add_rows() functions of the dplyr package.

We also cover removing NA values with the drop_na() function of the tidyr package, separating one column into multiple columns using the separate() function of the tidyr package, filling in NA values based on previous values using the fill() and replacing NA values with the replace_na() function, both of the tidyr package, as well as arranging levels of factors using the forcats package.

Finally, this case study also covers many of the stringr functions to manipulate character strings, including str_extract(), str_to_title(), str_replace(), str_remove().

Data Visualization

We include an example of creating a plot to match the style of a plot in an existing report. We also demonstrate how to make effective bar plots, by demonstrating details such as creating gaps between groups, taking advantage of these gaps to move the legend to within the plot area, and to use horizontal lines to allow for additional comparisons among groups. We also demonstrate how to add images to plots and combine plots using the patchwork package.

Analysis

The analysis in this case study covers some basics about probability and hypothesis testing, as well as the Mann-Kendall trend test and the difference between this test and simple linear regression. In this analysis we use the Mann-Kendall to test if there has been a trend within the disconnection rates of particular groups of youths over time.

Other notes and resources

RStudio
Cheatsheet on RStuido IDE
Other RStudio cheatsheets
Tidyverse

Response bias
Cross-Sectional data Population Sample Sampling methods Inference

American Community Survey (ASC)

See here for more detailed information about the survey
Measure of America
Social Science Research Council

Piping in R
Writing functions
Also see this case study for more information on writing functions.
String manipulation cheatsheet
Table formats

Regression
simple linear regression
monotonic association
Kendall rank correlation coefficient
Null hypothesis
Alternative hypothesis
Probability
one-sided and two-sided hypotheses
Nonparametric Parametric significance threshold
Z score
Z score table
Z score to p-value calculator

ggplot2 package
Please see this case study for more details on using ggplot2
grammar of graphics
ggplot2 themes
directlabels package methods
Hmong people
Intersections

Motivating article for this case study about youth disconnection/opportunity youth

To learn more about importing and wrangling PDFs using the pdftools package see this case study.

To learn more about what you can do with the magick package see this vingette.

To learn more about the Mann-Kendall trend test see here and here.

To learn more about hypothesis testing, see this case study.

Packages used in this case study:

Package Use in this case study
here to easily load and save data
pdftools to import PDF documents
magick for importing images and extracting text from images
tesseract for extracting text from images with magick
knitr for showing images in reports
dplyr to filter, subset, join, add rows to, and modify the data
stringr to manipulate strings
magrittr to pipe sequential commands
tidyr to change the shape or format of tibbles to wide and long, to drop rows with NA values, to separate a column into additional columns, and to fill out values based on previous values
tibble to create tibbles
ggplot2 to create plots
directlabels to add labels directly to lines in plots
cowplot to add images to plots
forcats to reorder factor for plot
kendall to implement the Mann-Kendall trend test in R
patchwork to combine plots
DT Interactive tables

For instructors

Instructors can start at the Data Visualization or Data Analysis sections. Instructors can also skip the Subgroup plots section if they don’t wish to instruct students about making bar plots in depth.

Target audience

This case study is appropriate for those new to R programming and new to statistics. It is also appropriate for more advanced R users who are new to the Tidyverse. This particular case study may require some fundamental knowledge of statistics.

Suggested homework

  • For the Asian and Latinx subgroup bar plots made across year, modify these plots to consider gender differences (instead of across time).
  • Taking the plot you made above, modify the plot to facet across years.
  • Find another table in one of the reports to import using the magick package (for example perhaps the data about different states over time in the 2019 report called Making the Connection). Look for differences between groups by plotting the data and evaluating with the Mann-Kendall test.

Estimate of RMarkdown Compilation Time:

~ About 36 - 46 seconds

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.