-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
3503 lines (2510 loc) · 158 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Open Case Studies: Mental Health of American Youth"
css: style.css
output:
html_document:
includes:
in_header:
- header.html
- GA_Script.Rhtml
self_contained: yes
code_download: yes
highlight: tango
number_sections: no
theme: cosmo
toc: yes
toc_float: yes
pdf_document:
toc: yes
word_document:
toc: yes
---
<style>
#TOC {
background: url("https://opencasestudies.github.io/img/icon-bahi.png");
background-size: contain;
padding-top: 240px !important;
background-repeat: no-repeat;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = FALSE,
fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(magick) # to create gif
```
#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "mainplot.png"))
```
####
#### {.disclaimer_block}
**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
####
#### {.license_block}
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"} United States License.
####
#### {.reference_block}
To cite this case study please use:
Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie C. (2020). https://github.com/opencasestudies/ocs-bp-youth-mental-health. Mental Health of American Youth.
####
#### {.emphasis_block}
**Please be advised that the material in this case study describes and discusses rates of suicide, as well as rates and symptoms of depression.**
According to the [National Institute of Mental Health (NIMH)](https://www.nimh.nih.gov/health/publications/teen-depression/index.shtml){target="_blank"}:
If you are in crisis and need help, call this toll-free number for the **National Suicide Prevention Lifeline (NSPL)**, available 24 hours a day, every day: **1-800-273-TALK (8255)**. The service is available to everyone. The deaf and hard of hearing can contact the Lifeline via TTY at 1-800-799-4889. All calls are confidential. You can also visit the Lifeline’s website at [www.suicidepreventionlifeline.org](www.suicidepreventionlifeline.org){target="_blank"}.
The **Crisis Text Line** is another free, confidential resource available 24 hours a day, seven days a week. Text “HOME” to **741741** and a trained crisis counselor will respond to you with support and information over text message. Visit [www.crisistextline.org](www.crisistextline.org){target="_blank"}.
Also see [here](https://www.mhanational.org/depression-teens-0){target="_blank"} for more information about how to recognize and help youths experiencing symptoms of depression.
####
To access the GitHub repository for this case study see here: https://github.com//opencasestudies/ocs-bp-youth-mental-health.
This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).
## **Motivation**
***
Rates of depression appear to have been increasing among American youths since around 2010 according to a recent [report](https://content.apa.org/record/2019-12578-001){target="_blank"}. A [recent study](https://pubmed.ncbi.nlm.nih.gov/24285382/){target="_blank"} also shows that youths appear to be seeking more care from mental health services.
This case study will explore how rates of major depressive episodes have changed since the early 2000s and across different youth subgroups (age, gender, ethnicity).
We also will explore how rates of treatment for depression of youths have changed over time.
```{r,echo = FALSE, out.width="40%"}
knitr::include_graphics(here::here("img", "k-mitch-hodge-IqSaG9zv2e0-unsplash.jpg"))
```
<span>Photo by <a href="https://unsplash.com/@kmitchhodge?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">K. Mitch Hodge</a> on <a href="https://unsplash.com/s/photos/depression?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></span>
The major symptoms of a major depressive episode include:
**S**leep disorder (increased or decreased)
**I**nterest deficit (anhedonia)
**G**uilt (worthlessness, hopelessness, regret)
**E**nergy deficit
**C**oncentration deficit
**A**ppetite disorder (increased or decreased)
**P**sychomotor retardation or agitation
**S**uicidality
##### [[source]](https://www.icsi.org/guideline/depression/diagnose-and-characterize-major-depression-persistent-depressive-disorder-with-clinical-interview/){target="_blank"}
```{r, echo = FALSE, out.width="80%"}
knitr::include_graphics(here::here("img", "depression-symptoms-and-treatment-768x768.jpg"))
```
##### [[source]](https://newmilfordcounselingcenter.com/depression-2/){target="_blank"}
***
<details> <summary> Click here to see the diagnostic requirements for a major depressive episode (MSE) according to the [DSM 5](https://en.wikipedia.org/wiki/DSM-5){target="_blank"}.</summary>
A. Five or more of the following symptoms have been present and documented during the same two-week period and represent a change from previous functioning; at least one of the symptoms is either (1) depressed mood or (2) loss of interest or pleasure.
**Note**: Do not include symptoms that are clearly attributable to another medical condition.
1. Depressed mood most of the day, nearly every day, as indicated by either subjective report (e.g., feels sad, empty, hopeless) or observation made by others (e.g., appears tearful)
2. Markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly every day (as indicated by either subjective account or observation)
3. Significant weight loss when not dieting or weight gain (e.g., a change of more than 5% of body weight in a month), or decrease or increase in appetite nearly every day
4. Insomnia or hypersomnia nearly every day
5. Psychomotor agitation or retardation nearly every day (observable by others, not merely subjective feelings of restlessness or being slowed down)
6. Fatigue or loss of energy nearly every day
7. Feelings of worthlessness or excessive or inappropriate guilt (which may be delusional) nearly every day (not merely self-reproach or guilt about being sick)
8. Diminished ability to think or concentrate, or indecisiveness, nearly every day (either by subjective account or as observed by others)
9. Recurrent thoughts of death (not just fear of dying), recurrent suicidal ideation without a specific plan, or a suicide attempt or a specific plan for committing suicide
B. The symptoms do not meet criteria for a mixed episode.
C. The episode is not attributable to the physiological effects of a substance or to another medical condition.
**Note**: Criteria A-C represent a major depressive episode.
**Note**: Responses to a significant loss (e.g., bereavement, financial ruin, losses from a natural disaster, a serious medical illness or disability) may include feelings of intense sadness, rumination about the loss, insomnia, poor appetite and weight loss noted in Criterion A, which may resemble a depressive episode. Although such symptoms may be understandable or considered appropriate to the loss, the presence of a major depressive episode in addition to the normal response to a significant loss should also be carefully considered. This decision inevitably requires the exercise of clinical judgment based on the individual’s history of and the cultural norms for the expression of distress in the context of loss.
D. The occurrence of the major depressive episode is not better explained by schizoaffective disorder, schizophrenia, schizophreniform disorder, delusional disorder, or other specified and unspecified schizophrenia spectrum and other psychotic disorders.
E. There has never been a manic episode or a hypomanic episode.
Note: This exclusion does not apply if all of the manic-like or hypomanic-like episodes are substance-induced or are attributable to the physiological effects of another medical condition.
#### [[source]](https://www.icsi.org/guideline/depression/diagnose-and-characterize-major-depression-persistent-depressive-disorder-with-clinical-interview/){target="_blank"}
</details>
***
This case study is motivated by the following two articles:
#### {.reference_block}
Twenge JM, Cooper AB, Joiner TE, Duffy ME, Binau SG. Age, period, and cohort trends in mood disorder indicators and suicide-related outcomes in a nationally representative dataset, 2005-2017. *J Abnorm Psychol*.128,3 (2019):185-199. doi:10.1037/abn0000410
Olfson, M., Blanco, C., Wang, S., Laje, G. & Correll, C. U. National Trends in the Mental Health Care of Children, Adolescents, and Adults by Office-Based Physicians. *JAMA Psychiatry*. 71, 81 (2014):81-90. doi: 10.1001/jamapsychiatry.2013.3074.
####
The main findings of the first [article](https://content.apa.org/record/2019-12578-001){target="_blank"} are:
> Rates of major depressive episode in the last year increased 52% 2005–2017 (from 8.7% to 13.2%) among adolescents aged 12 to 17 and 63% 2009–2017 (from 8.1% to 13.2%) among young adults 18–25.
> Serious psychological distress in the last month and suicide-related outcomes (suicidal ideation, plans, attempts, and deaths by suicide) in the last year also increased among young adults 18–25 from 2008–2017 (with a 71% increase in serious psychological distress), with less consistent and weaker increases among adults ages 26 and over.
> Cultural trends contributing to an increase in mood disorders and suicidal thoughts and behaviors since the mid-2000s, including the rise of electronic communication and digital media and declines in sleep duration, may have had a larger impact on younger people, creating a cohort effect.
While the main findings of the second [article](https://pubmed.ncbi.nlm.nih.gov/24285382/){target="_blank"} are:
> Compared with adult mental health care, the mental health
care of young people has increased more rapidly...
This means that the number of youths receiving mental health care has increased faster than the number of adults receiving mental health care.
> Between 1995-1998 and 2007-2010, visits resulting in mental disorder diagnoses
... increased significantly faster for youths (from 7.78 to 15.30 visits) than for
adults (from 23.23 to 28.48 visits) (interaction: P < .001).
> Psychiatrist visits also increased
significantly faster for youths (from 2.86 to 5.71 visits).
**Summary**: While depression appears to be on the rise for youths, youths also appear to be seeking more mental health care.
In this case study, we will be using data from the [National Survey on Drug Use and Health (NSDUH)](https://nsduhweb.rti.org/respweb/homepage.cfm){target="_blank"} related to treatment and major depressive episode (MDE) rate to explore how trends in mental health have changed over time and how different groups compare.
This data was also used in the first referenced article.
## **Main Questions**
***
#### {.main_question_block}
<b><u> Our main questions: </u></b>
1. How have depression rates in American youth changed since 2004, according to the NSDUH data? How have rates differed between different youth subgroups (age, gender, ethnicity)?
2. Do mental health services appear to be reaching more youths? Again, how have rates differed between different youth subgroups (age, gender, ethnicity)?
####
## **Learning Objectives**
***
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
<u>**Data Science Learning Objectives:**</u>
1. Scrape data directly from a website (`rvest`)
2. Subset and filter data (`dplyr`)
3. Write functions to wrangle data repetitively
4. Work with character strings (`stringr`)
5. Reshape data into different formats (`tidyr`)
6. Data visualizations (`ggplot2`) with labels (`directlabels`) and facets for different groups
7. Combine multiple plots (`cowplot`)
8. Optional: Create an animated gif (`magick`)
<u>**Statistical Learning Objectives:**</u>
1. Discuss the impact of self-reporting bias on survey responses
2. Define and create a contingency table
3. Implementation of a chi-squared test for independence
4. Interpretation of a chi-squared test for independence
In this case study, we will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"}, such as [`rvest`](https://github.com/tidyverse/rvest){target="_blank"}. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R more legible and intuitive.
```{r, out.width = "20%", echo = FALSE, fig.align ="center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```
***
We will begin by loading the packages that we will need:
```{r}
library(here)
library(rvest)
library(dplyr)
library(magrittr)
library(stringr)
library(tidyr)
library(tibble)
library(purrr)
library(ggplot2)
library(directlabels)
library(scales)
library(forcats)
library(ggthemes)
library(cowplot)
```
<u>**Packages used in this case study:** </u>
Package | Use in this case study
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"} | to easily load and save data
[rvest](https://github.com/tidyverse/rvest){target="_blank"} | to scrape web pages
[dplyr](https://dplyr.tidyverse.org/){target="_blank"} | to subset and filter the data for specific groups, to replace specific values with `NA`, rename variables, and perform functions on multiple variables
[magrittr](https://magrittr.tidyverse.org/){target="_blank"} | to use and reassign data objects using the %<>%pipe operator
[stringr](https://stringr.tidyverse.org/){target="_blank"} | to manipulate strings
[tidyr](https://tidyr.tidyverse.org/){target="_blank"} | to change the shape or format of tibbles to wide and long
[tibble](https://tibble.tidyverse.org/){target="_blank"} | to create tibbles and convert values of a column to row names
[purrr](https://purrr.tidyverse.org/){target="_blank"} | to apply a function to each column of a tibble or each tibble in a list
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} | to create plots
[directlabels](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"} | to add labels directly to lines in plots
[scales](https://cran.r-project.org/web/packages/scales/scales.pdf) | to get the current linetype options
[forcats](https://forcats.tidyverse.org/){target="_blank"} | to reorder factor for plot
[ggthemes](https://cran.r-project.org/web/packages/ggthemes/ggthemes.pdf) | to create a plot to see what the different linetypes look like
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to combine plots together
The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
## **Context**
***
To motivate the examination of the mental health of American youths, we begin by exploring the rate of suicide in the United States (US).
According to the CDC the rate of suicide has increased for both genders.
```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig1.png")
```
##### [[source]](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}
While suicide does appear to be increasing among youths it also appears to be increasing among most age groups in the US over the past decade and a half for both females and males:
```{r image_grobs, echo = FALSE, fig.show = "hold", out.width = "50%", fig.align = "default"}
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig2.png")
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig3.png")
```
##### [[source]](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}
According to the [CDC](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}:
> Since 2008, suicide has ranked as the 10th leading cause of death for **all ages** in the United States.
```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","mortality.png"))
```
##### [[source]](https://www.cdc.gov/nchs/data/databriefs/db293.pdf){target="_blank"}
Furthermore, according to the [CDC](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}:
>In 2016, suicide became the **second leading cause of death** among youths.
**So although suicide is on the rise for most age groups, suicide is one of the top *two* contributors to death for youths.**
Thus, this warrants further examination of the mental health of American youths.
```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","mortality_age.png"))
```
##### [[source]](https://www.cdc.gov/nchs/data/nvsr/nvsr68/nvsr68_06-508.pdf){target="_blank"}
Historically, suicide rates were much higher before 1950, however, we are seeing an increase in the last 20 years.
```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","suicide.png"))
```
##### [[source]](https://time.com/5609124/us-suicide-rate-increase/){target="_blank"}
Besides the US, [other countries](https://academic.oup.com/ije/article/48/5/1650/5366210){target="_blank"} are also experiencing increased rates of depression in youths.
See [this report](https://apps.who.int/iris/bitstream/handle/10665/254610/WHO-MSD-MER-2017.2-eng.pdf;jsessionid=E44360055DD83EAC472AA40C2853DBFA?sequence=1){target="_blank"} from the World Health Organization (WHO) about rates of depression in other countries.
See [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3330161/){target="_blank"} for an interesting discussion about what may be causing increased depression rates.
## **Limitations**
***
There are some important considerations regarding this data analysis to keep in mind:
1. The data that we will use come from a survey and are therefore values from a sample that estimate that of the true population. In our statistical analysis we use these sample values as if they are population estimates (because this is all we have access to). Thus, our results are not necessarily indicative of population differences.
2. Furthermore, the sampling mechanism utilized for the survey can introduce [selection bias](https://en.wikipedia.org/wiki/Selection_bias?oldformat=true){target="_blank"} in cases where the the [sampling methods do not produce a representative sample](https://en.wikipedia.org/wiki/Sampling_(statistics)?oldformat=true){target="_blank"}.
3. Data are collected from human participants; this presents the *potential* for information bias, as there is the *potential* that participants in the [sampling frame](https://en.wikipedia.org/wiki/Sampling_frame?oldformat=true){target="_blank"} may for a variety of reasons report inaccurate information.
4. Data about certain group [intersections](https://www.vox.com/the-highlight/2019/5/20/18542843/intersectionality-conservatism-law-race-gender-discrimination){target="_blank"} (meaning for example individuals of a particular gender and ethnicity) or particular groups in general such as specific ethnicities or gender or sexual identity groups such as LGBTQIA+ (lesbian/gay/bisexual/transgender/queer and questioning) or non-binary gender populations is unfortunately not available in the data used in this analysis and in most research about this topic.
Note: While [gender and sex](https://www.who.int/genomics/gender/en/index1.html) are not actually binary, the data used in this analysis unfortunately only contains information for groups of individuals who self-reported as male or female. We also acknowledge that unfortunately not all ethnicities or group intersections are represented in the data either. More research should be devoted to collecting data about the mental health of these groups.
## **What are the data?**
***
We will be using data from the [National Survey on Drug Use and Health (NSDUH)](https://nsduhweb.rti.org/respweb/homepage.cfm){target="_blank"} which is directed by the [Substance Abuse and Mental Health Services Administration (SAMHSA)](https://www.samhsa.gov/){target="_blank"}, an agency in the [U.S. Department of Health and Human Services (DHHS)](https://www.hhs.gov/){target="_blank"}.
This survey started in 1971 and is conducted annually in all 50 states and the District of Columbia. Approximately 70,000 people (ages 12 and up) are interviewed each year about health-related issues. Only civilian, non-institutionalized individuals are included. Households are randomly selected and then a professional interviewer visits the addresses and asks one or two of the residents to interview. The interviewer brings a laptop with them that the participants use to fill out the survey, which typically takes an hour to complete. If a participant chooses to participate they receive $30 in cash. All collected information is confidential and is used for disease surveillance and to guide public policy particularly focused on drug and alcohol use as well as mental health. See [here](https://nsduhweb.rti.org/respweb/about_nsduh.html){target="_blank"} for more details about the survey.
The data are made available publicly online on the [Substance Abuse & Mental Health Data Archive](https://datafiles.samhsa.gov/){target="_blank"}.
```{r, out.width = "100%", echo = FALSE, fig.align ="center"}
include_graphics(here::here("img", "nsudh_screenshot_webpage.png"))
```
On the [website](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"} with the survey data, you can see that the results are displayed in many tables. Importantly, there is no obvious way to download the data directly from this particular website.
```{r, out.width = "100%", echo = FALSE, fig.align ="center"}
include_graphics(here::here("img", "website_overview.png"))
```
If you click on the TOC button on the far left upper corner, you will be directed to another [website](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetailedTabsTOC2018.htm#toc){target="_blank"}, where a large [pdf document](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetailedTabs2018.pdf){target="_blank"} containing all of the results can be downloaded.
We are interested in investigating how depression rates have changed and how youths are interacting with mental health services. Thus, the following tables are of interest to us:
Table | Details
---|-------------
Table 11.1A | Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Numbers in Thousands, 2002-2018
Table 11.1B | Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Percentages, 2002-2018
Table 11.2A | Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.2B | Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2004-2018
Table 11.3A | Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2006-2018
Table 11.3B | Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2006-2018
Table 11.4A | Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.4B | Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Percentages, 2004-2018
Our goal is to bring these data into R so we can explore them.
***
<details> <summary>Click here for the NSDUH defines a major depressive episode (MDE) </summary>
According to the [NSDUH 2018 report](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHNationalFindingsReport2018/NSDUHNationalFindingsReport2018.pdf){target="_blank"}
> Respondents were defined as having had an MDE in
the past 12 months if they had at least one period of 2 weeks
or longer in the past year when they experienced a depressed
mood or loss of interest or pleasure in daily activities,
accompanied by problems with sleeping, eating, energy,
concentration, or self-worth. The MDE questions are based
on diagnostic criteria from DSM-5. Some of the wordings
of the depression questions for adolescents aged 12 to 17
and adults aged 18 or older differed slightly to make the
questions more developmentally appropriate for adolescents.
> Adolescents were defined as having an MDE with
severe impairment if their depression caused severe problems
with their ability to do chores at home, do well at work or
school, get along with their family, or have a social life.
</details>
***
## **Data Import**
***
Data are often made available online. Sometimes, the data we are interested in is made available for download on a web page as a delimited text file or an excel file. However, sometimes data are not made available in this manner, such as the [NSDUH survey data](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"}.
How do we proceed in this scenario?
We can manually copy each cell of data; however, this process is often inefficient, subject to error, and not reproducible. Say we wanted to run an analysis next year on the next years data and it happens to be formatted in the same way.
Alternatively, we could use `R` to scrape the data from the web!
Formally, [web scraping](https://en.wikipedia.org/wiki/Web_scraping?oldformat=true){target="_blank"} is the process of extracting data from a webpage. Let's learn how to do this for our case study.
### **Basic steps of web scraping**
***
There are two main steps to web scraping:
1. Identify **location** of data on the webpage that will be scraped.
2. Save the webpage **element** to an R **object**.
We accomplish STEP 1 with our web browser.
We accomplish STEP 2 in the `R` programming environment.
The **location** of the data on the webpage that will be scraped can be identified using a language called [XPath](https://en.wikipedia.org/wiki/XPath), which is short for XML Path Language. It is used to identify pieces (in this case called **nodes**) of a document written in the [XML](https://en.wikipedia.org/wiki/XML) language. [XML](https://en.wikipedia.org/wiki/XML) which is short for Extensible Markup Language is frequently used for documents on the internet, similar to [HTML](https://en.wikipedia.org/wiki/HTML). One of the [major differences](https://techdifferences.com/difference-between-xml-and-html.html) between these two is that HTML does not provide structural information, while XML does. This structural information can be used to parse documents so that we can scrape only the data that we are interested in from a website.
#### {.resource_block}
<u>Additional resources for web scraping</u>:
- [Vignette](https://rstudio-pubs-static.s3.amazonaws.com/266430_f3fd4660b2744751ab144aa130768a06.html){target="_blank"}
- [Blog](http://blog.corynissen.com/2015/01/using-rvest-to-scrape-html-table.html){target="_blank"}
- [Blog](http://research.libd.org/rstatsclub/post/introduction-to-scraping-and-wranging-tables-from-research-articles/#.Xw878ZNKhQJ){target="_blank"}
- [Selectorgadget Tool](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html){target="_blank"}
####
### **The `rvest` package**
***
In this case study, we will scrape data from the tables on the [NSDUH survey](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"} website.
Note that these data are available in a large PDF with all the results by year if you wish to use the data from this particular source.
One option to import the data would be to import the PDF. However it is not easy to find this PDF and it would be difficult and time consuming to find our tables of interest and to extract our data of interest from the pdf. However, if one wanted to do this, say if the tables were not available online, they could use the `pdftools` package. See this other [case study](https://www.opencasestudies.org/ocs-bp-diet/) and this other [case study](https://www.opencasestudies.org/ocs-bp-youth-disconnection/) for two methods to work with PDFs.
Another option could be to copy and paste the data from the website to another file that we would also need to import. But this would not be as efficient or reproducible and might result in errors.
Alternatively, we will use the `rvest` package to [scrape](https://en.wikipedia.org/wiki/Web_scraping?oldformat=true){target="_blank"} the data directly from the tables on the website.
Assuming the data next year would be displayed in a similar manner, this could allow us to simply modify our code based on the url for the data next year to run the same analysis on the data easily.
However, it is important to keep in mind that one downside of scraping the data directly from the web, is that the website could change - this can be a good thing if the website adds additional data and keeps the same formatting. This would allow us to get additional data very easily. However, if the website changes formatting then this would require that we update our code.
### **Scraping tables into R**
***
The two web scraping steps for these tables can be broken down even further:
1. Identify location of data that will be scraped
+ right-click to inspect element (webpage)
+ hover pointer over components of element (webpage) until the data has been found
+ copy XPath of data sought
2. Save webpage element to an object in R
+ import html code for the webpage
+ extract pieces of HTML documents (webpage) using XPath
+ parse the extracted data into a data frame
Below is a animated overview of the process.
<details><summary> Click here if you want to see how this animation was made!</summary>
First the images need to be imported into R using the `image_read` function of the `magick` package.
```{r}
step1 <- magick::image_read(here::here("img", "webpage_screenshot.png"))
step2 <- image_read(here::here("img", "table_screenshot_inspect.png"))
step3 <- image_read(here::here("img", "table_screenshot_inspect_table.png"))
step4 <- image_read(here::here("img", "table_screenshot_inspect_table_xpath.png"))
step5 <- image_read(here::here("img", "table_screenshot_xpath_copy_r.png"))
step5_zoom <- image_read(here::here("img", "table_screenshot_xpath_copy_r_zoom.png"))
```
The last image is smaller than the others, to get a sense of the size we can use the `image_info()` function of the `magick` package.
```{r}
step5
step5_zoom
image_info(step5)
image_info(step5_zoom)
```
First let's re-size the second image to make it a bit larger using the `image_resize()` function of the `magick` package. We will re-size the width to be the same as the previous image width and keep the aspect ratio for the height by using "1440x". If we wanted to just do the same for height we would use "x900".
```{r}
step5_zoom <- image_resize(step5_zoom, "1440x")
step5_zoom
```
We can add a white boarder around the last image to make the size more similar height-wise using the `image_border()` function of the `magick` package. There are many image modification functions in the `magick` package! See [here](https://cran.r-project.org/web/packages/magick/vignettes/intro.html){target="_blank"} to learn more.
```{r}
step5_zoom <- image_border(step5_zoom, "white", "2x334")
step5_zoom
```
Looks good!
Now we will make the sequence of images for our animation. We also want to indicate how long we want to spend on each relative to the others. We want to linger on the last image so we include it two times.
```{r}
img <- c(step1,
step2,
step3,
step4,
step5,
step5_zoom,
step5_zoom)
```
Now, we are ready to create our gif! But first we want to modify our images a bit more.
First we want to make all images within `img` the exact same size using the `image_resize()` function. To do this for all images we can use the `!` at the end, which ignoring aspect ratios.
```{r}
image_info(img)
img <-image_resize(img, '1440x900!')
image_info(img)
```
We also want to morph or blend each image into the next so that there appears to be a smooth transition. We can also specify how many frames to include in the morph, to speed up or slow down the blend from one image to another. We will specify that 4 frames should be used in the morph by using the `image_morph()` function.
To make the final animation we use the `image_animate()` function
Importantly, we want to delay changing from one image to another about 70* 1/100 seconds to give people a chance to see what is happening. So we can use the `delay` argument. The optimize argument of this function requires that all images are the same size (luckily we did this!) and it causes R to only store the differences between frames.
```{r}
educational_gif <-
image_morph(img, frames = 4) %>%
image_animate(delay = 70,
optimize = TRUE)
```
Now to save our gif we can use the `image_write()` function of the `magick` package and the `here()` function of the `here` package to easily save it in a directory called `img` within the directory that contains our .Rproj file. We will name the file educational.gif.
```{r, eval = FALSE}
image_write(educational_gif,
here::here("img", "educational.gif"))
```
</details>
```{r, echo = FALSE}
knitr::include_graphics(here::here("img","educational.gif"))
```
***
Now let's go through each step together:
#### 1) Identify location of data that will be scraped
First, let's go to the [web page](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"} with all the tables we are interested in scraping
```{r, step1, echo=FALSE}
step1
```
Once on the webpage, there are not any visible options to download the data.
Right-click and select "Inspect".
```{r, step2, echo=FALSE}
step2
```
A window opens.
This window allows us to glance at the internal mechanics of the webpage. To scrape the data from the webpage, we need to first learn a little bit about the components that make it the web page it is.
Hovering our mouse over the elements of the webpage highlights the respective section of the webpage it represents.
By hovering over several elements—and clicking on the elements on the right side of the screen—we can identify the element that contains the data we are looking for.
Another option for identifying XPaths is to use the [selectorgadget tool](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html){target="_blank"}.
```{r,step3, echo=FALSE}
step3
```
Right click on the element and copy the XPath. We will need this XPath for the next step.
```{r, step4, echo=FALSE}
step4
```
Now we can return to the `R` programming environment.
```{r, step5, echo=FALSE}
step5
```
***
#### 2) Save webpage element to an object in R
For the first table we want to scrape, the XPath is `/html/body/div[4]/div[1]/table`. We use this XPath with functions from the `rvest` package to scrape the data from this table.
```{r,step5_zoom, echo=FALSE}
step5_zoom
```
Let's explore this step in greater detail:
We need to:
+ import html code for the webpage
+ extract pieces (table) out of HTML documents (webpage) using XPath
+ parse the html table into a data frame
To do this:
+ We import the html code using the `read_html()` function of the `rvest` package
+ We extract specific components of the webpage using the `html_nodes()` function of the `rvest` package
+ We convert this html table into a dataframe using the `html_table()`function of the `rvest` package
**The `rvest` package provides wrappers for the `xml2` and `httr` packages, thus we can just install and load the `rvest` package and it will install and load dependency packages like `xml2` and `httr` and allow us to use functions from both of these packages.**
In fact, when we load `rvest` the first time we see:
```{r, out.width= "60%", echo=FALSE}
knitr::include_graphics(here::here("img", "rvest.png"))
```
In this case, we are scraping Table 11.1A from the website.
First, we assign the URL to a character string to use within the `read_html()` function in the `xml2` package.
```{r}
NSDUH_url <- "https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm"
```
One could also directly use the URL but this is less convenient for piping.
***
<details> <summary>Click here if you are unfamiliar with piping in R, which uses this `%>%` operator</summary>
By [piping](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} we mean using the `%>%` pipe operator which is accessible after loading the `tidyverse` or several of the packages within the tidyverse like `dplyr` because they load the [`magrittr` package](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"}.
This allows us to perform multiple sequential steps on one data input.
</details>
***
Then, the `read_html()` function allows us to save the html document for the webpage inside R.
```{r}
webpage <- NSDUH_url %>%
xml2::read_html()
webpage
```
Then, we use the `html_nodes()` function of the `rvest` package to select just the Table 11.1A element of the webpage.
See this [tutorial](http://flukeout.github.io/#){target="_blank"} (and the [answers](https://gist.github.com/chrisman/fcb0a88459cd98239dbe6d2d200b02d1){target="_blank"} in case you get stuck) on CSS selectors to understand more about how this function works to use the `xpath` to select the elements of interest from the webpage.
```{r}
webpage_element <- webpage %>%
rvest::html_nodes(xpath='/html/body/div[4]/div[1]/table')
webpage_element
```
Finally, the `html_table()` function of the `rvest` package parses the html object into a data frame. We can use the `glimpse()` function of the `dplyr` package to take a look at the data. This function shows data frames sideways with the columns listed on the far left.
```{r}
table11.1a <- webpage_element %>%
rvest::html_table()
print(table11.1a, max = 2)
glimpse(table11.1a)
```
We can see that the output is a list with one element, to extract the data from the list we will use brackets `[[]]` to select the first element of the list.
```{r}
table11.1a <- table11.1a[[1]]
```
Putting this all of this together we can do the entire process like this with our pipe operator `%>%`.
```{r}
table11.1a <- NSDUH_url %>%
xml2::read_html() %>%
rvest::html_nodes(xpath = '/html/body/div[4]/div[1]/table') %>%
rvest::html_table()
table11.1a <- table11.1a[[1]]
```
Now, we need to repeat the above process for the other tables we are interested in.
### **Writing a function to scrape multiple tables**
***
One option is to copy and paste code we wrote above.
However, this is not very efficient and is error prone.
Alternatively, we can create an R function to accomplish this succinctly.
Functions allow us to perform the same process on multiple data inputs.
See [this other case study](https://opencasestudies.github.io/ocs-bloomberg-vaping-case-study/){target="_blank"} for more details about how to write a function.
In general, the process of writing functions involves first specifying an input that is used within the function to create an output. In this case, `XPATH` will be used as an "input argument" to the function, which will be replaced by an actual XPath and then used in the subsequent steps to scrape the data from each table that an XPath is supplied for.
We will call this function `scraper`.
```{r}
scraper <- function(XPATH){
NSDUH_url <- "https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm"
table <- NSDUH_url %>%
read_html() %>%
html_nodes(xpath = XPATH) %>%
html_table()
output <- table[[1]]
output
}
```
Now we can apply the function we created to each of the XPaths for each of the tables on the website that we would like to use in our data analysis.
```{r}
table11.1b <- scraper(XPATH = "/html/body/div[4]/div[2]/table")
table11.2a <- scraper(XPATH = '/html/body/div[4]/div[3]/table')
table11.2b <- scraper(XPATH = '/html/body/div[4]/div[4]/table')
table11.3a <- scraper(XPATH = '/html/body/div[4]/div[5]/table')
table11.3b <- scraper(XPATH = '/html/body/div[4]/div[6]/table')
table11.4a <- scraper(XPATH = '/html/body/div[4]/div[7]/table')
table11.4b <- scraper(XPATH = '/html/body/div[4]/div[8]/table')
```
Great! We have successfully scraped the data from the web into R!
Next, we need to wrangle the data.
We will save the data now in case the website gets removed in the future. To do this we will use the base `save()` function.
```{r, eval = FALSE}
save(table11.1a, table11.1b, table11.2a, table11.2b,
table11.3a, table11.3b, table11.4a, table11.4b, file = here::here("data", "raw_data.rda"))
```
### **Exercise**
***
<!---MH_DI_Quiz-->
<iframe style="margin:0 auto; min-width: 100%;" id="MH_DI_QuizIframe" class="interactive" src="https://rsconnect.biostat.jhsph.edu/OCS_MH_DI_Quiz/" scrolling="no" frameborder="no"></iframe>
<!---------------->
## **Data Wrangling**
***
```{r, echo = FALSE}
# for instructors that wish to start at this section, they can just load the imported data.
load(file = here::here("data", "raw_data.rda"))
```
Now that we have imported the data, let's see if we can wrangle a table.
What do we mean by "wrangle"? We mean that we intend to get the data into what is called "tidy" format.
This means that the data:
1) has each observation in single row
2) has a single aspect about each observation as a single column
3) is rectangular (meaning there are no empty cells)
4) the values within the cells are in a format that is useful for making visualizations and for analysis
Check out this [paper](https://vita.had.co.nz/papers/tidy-data.pdf) for more information about tidy data.
Since the data appear to be formatted in a similar way in each of the tables, it is likely that whatever steps we take to wrangle this first table will also be necessary in the wrangling of subsequent tables.
This is because well-maintained data sources often format different datasets similarly.
We can take advantage of this similarity to speed up the wrangling process.
### **Table11.1a**
***
Let's take a look at our table on the website to see what it needs to get it into "tidy" format.
First, we can see that we want to remove the legend of our table.
```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "table11.1a.png"))
```
We can take a look at the last row using the `tail` function of the `dplyr` package. We can specify that we only want to see the last row by using the `n = 1` argument. To use the `dplyr` functions we need to first make this table into a tibble, which is the `tidyverse` version of a dataframe.
Currently table11.1a is a typical dataframe, which we can see using the `class` function which describes the types of data objects in R.
```{r}
class(table11.1a)
```
We can use the `as_tibble()` function of the `dplyr` package to convert `table11.1a` into a tibble.
```{r}
table11.1a %>%
dplyr::as_tibble() %>%
tail(n = 1)
```
We can see that the legend is repeated for every column.
Now, let's take a look at the year 2004 column.
```{r}
table11.1a %>%
dplyr::as_tibble() %>%
dplyr::select(`2004`) %>%
tail(n = 1)
```
Let's save this to an object called `legend` so that we can refer back to it later:
```{r}
legend <- table11.1a %>%
as_tibble() %>%
select(`2004`) %>%
tail(n = 1)
```
Another way to look at the last row is to use the `n()` function of the `dplyr` package.
This function can be used inside other `dplyr` functions and it counts the total number of observations of a group.
Within the [`slice()` function](https://dplyr.tidyverse.org/reference/slice.html){target="_blank"} of the `dplyr` package, it allows you to refer the full length of the object.
```{r}
table11.1a %>%
dplyr::as_tibble() %>%
dplyr::slice(n())
```
We can use the `slice()` function of the `dplyr` package to remove this row, by using the `slice`function to select from the first row using `1:` to the second to last row using `n()-1`.
We are also going to use a special pipe operator from the [`magrittr` package](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} called the compound assignment pipe-operator or sometimes the double pipe operator.
This allows us to use the `table11.1a` as our input and reassign it at the end after all the steps have been performed.
We will also first change the data to be a [tibble](https://tibble.tidyverse.org/#:~:text=A%20tibble%2C%20or%20tbl_df%20%2C%20is,modern%20reimagining%20of%20the%20data.&text=Tibbles%20are%20data.,a%20variable%20does%20not%20exist).), which is the `tidyverse` version of a data frame using the `as_tibble()` function of the `dplyr` package and the `tibble` package.
```{r}
table11.1a %<>%
dplyr::as_tibble() %>%
slice(1:(n()-1))
```
Now let's take a look at the data:
```{r}
slice_head(table11.1a, n = (length(pull(table11.1a, `2002`))))
```
Great! We can see the the legend is no longer part of the data.
Now let's use the legend to recode the data.
There are many different values for missing data, that we would like to replace with `NA` instead.
We can use the `pull()` function of the `dplyr` package to take a look at the legend data:
```{r}
dplyr::pull(legend, `2004`)
```
Looks like we want to replace values that are: `*`, `--`, `da`, `nc`, and `nr`.
We can use the `na_if()` function to recode these values to `NA`.
```{r}
table11.1a %<>%
dplyr::na_if("nc") %>%
dplyr::na_if("--") %>%
dplyr::na_if("") %>%
dplyr::na_if("*")
head(table11.1a)
```
Let's look at the column names in our table:
```{r}
colnames(table11.1a)
```
Let's rename the first column using the `rename()` function of the `dplyr` package.
This requires listing the new name first like so: `new_name = old_name`.
```{r}
table11.1a %<>%
dplyr::rename(MHS_setting = `Setting Where Mental Health ServiceWas Received`)
head(table11.1a)
```
Nice!
Now you may notice that the individual values for the year columns have an `"a"` after the numeric value.
According to the legend this indicates if "the difference between this estimate and the 2018 estimate is significant at the $\alpha$=.05 level."
While this is useful information, it makes it difficult to work with our numeric values, so we want to remove them.
Since lower case `"a"` values occur in the values of the `MHS_setting` column (like outp**a**tient), we want to make sure that we don't just remove all `"a"` values from the table, just those in the subsequent columns.
So how can we do this? We can use the `stringr` package to modify character strings.
Also, we can use the `dplyr` functions `mutate()`, `select()` and `across()` to specify want columns we want to change.
Currently all of our data are character strings as indicated by the `<chr>` under the column names.
***
<details> <summary> Click here for an explanation about data classes in R and about character strings. </summary>
There are several [classes of data in R programming](https://en.wikipedia.org/wiki/R_(programming_language)), meaning that certain objects will be treated or interpreted differently.
Character is one of these classes.
A character string is an individual data value made up of characters.
This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter `"a"` or the number `"3"`.
If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense.
If you want your numeric values to be interpreted that way, they need to be converted to a numeric class.
The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).
</details>
***
The `stringr` package has functions that allow us to replace (using the `str_replace()` function) or remove (using the `str_remove()` function) characters.
To use these, we need to be able to specify what we want to remove and replace.
Here is a part of a [cheatsheet](https://rstudio.com/resources/cheatsheets/){target="_blank"} about string manipulation from RStudio.
```{r, echo=FALSE}
knitr::include_graphics(here::here("img", "regex.png"))
```
We can see that we can refer to any digit (such as 1, 2, 3 etc.) as `[:digit:]`.
We can also see that we can refer to any punctuation mark as `[:punct:]`.
Finally, we see that spaces and tabs can be referred to as `[:blank:]`.