-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlab03.Rmd
268 lines (232 loc) · 10.1 KB
/
lab03.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
title: "Shisham Adhikari Lab 3"
output:
pdf_document: default
html_document:
df_print: paged
date: "Math 241, Week 4"
urlcolor: blue
---
```{r setup, include=TRUE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
```{r}
library(tidyverse)
```
```{r}
crash <- read_csv("/home/courses/math241s20/Data/pdx_crash_2018_CRASH.csv")
partic <- read_csv("/home/courses/math241s20/Data/pdx_crash_2018_PARTIC.csv")
polls <- read_csv("/home/courses/math241s20/Data/generic_topline.csv") %>%
select(subgroup, modeldate, dem_estimate, rep_estimate)
```
### Problem 1: Git Control
Done!
### Problem 2: `dplyr` madness
####a.
```{r}
library(xtable)
crash_coll <- crash %>%
count(COLLIS_TYP_SHORT_DESC) %>%
arrange(desc(n))
print(crash_coll)
```
###Answer:
We see from the table that the REAR type of collision was most common and OTH type of collision was least common.
b.
```{r}
weather <- crash %>%
select(COLLIS_TYP_SHORT_DESC,
WTHR_COND_SHORT_DESC) %>%
group_by(WTHR_COND_SHORT_DESC) %>%
count(COLLIS_TYP_SHORT_DESC) %>%
arrange(desc(n)) %>%
ungroup()
weather1 <- weather %>%
filter(COLLIS_TYP_SHORT_DESC == "REAR"|
COLLIS_TYP_SHORT_DESC == "TURN"|
COLLIS_TYP_SHORT_DESC == "ANGL") %>%
group_by(WTHR_COND_SHORT_DESC)%>%
mutate(prop = (n / sum(n)))
print (weather1)
```
c.
```{r}
weekday <- crash %>%
mutate(Wkday_or_Wkend = if_else( CRASH_WK_DAY_CD == 1|
CRASH_WK_DAY_CD == 7, "Weekend", "Weekday")) %>%
select(COLLIS_TYP_SHORT_DESC,
Wkday_or_Wkend) %>%
count(COLLIS_TYP_SHORT_DESC,
Wkday_or_Wkend) %>%
group_by(COLLIS_TYP_SHORT_DESC) %>%
mutate(Comparision = (n / sum(n)))
print(weekday)
```
Analysis: We see that more accidents happen during weekdays in comparision to weekends for each type of collision .
d.
```{r}
pedes <- crash %>%
mutate(ped_involved = if_else( COLLIS_TYP_SHORT_DESC == "PED", TRUE, FALSE)) %>%
group_by(ped_involved) %>%
summarise(n = n()) %>%
mutate (prop = n/sum(n))
print(pedes)
join_df <- left_join(crash, partic, by = c("CRASH_ID" = "CRASH_ID"))
driver_with_ped <- join_df %>%
mutate(if_ped = if_else(COLLIS_TYP_SHORT_DESC=="PED", TRUE, FALSE)) %>%
select(DRVR_LIC_STAT_CD, if_ped) %>%
count(DRVR_LIC_STAT_CD, if_ped) %>%
group_by(DRVR_LIC_STAT_CD) %>%
mutate(Proportion = (n / sum(n)))
print(driver_with_ped)
```
###Analysis:
From table 1, 0.05 proportion of crashes involve pedestrians. Above table shows for each driver license status what proportion of crashes involve pedestrians.Diver License with code 1, which is Valid Oregon license or permit has the highest rate of crashes that involve pedestrians.
e.
```{r}
age_coll <- join_df %>%
select(PARTIC_TYP_SHORT_DESC, AGE_VAL, COLLIS_TYP_SHORT_DESC) %>%
filter(PARTIC_TYP_SHORT_DESC == "DRVR") %>%
mutate(AGE_VAL = as.numeric(AGE_VAL)) %>%
filter(AGE_VAL != 0) %>%
summarize(Median_Age = median(AGE_VAL, na.rm = TRUE),
Average_Age = mean(AGE_VAL, na.rm = TRUE)) %>%
print(age_coll)
```
The average age of drivers(excluding the ones with unknown age) is 40.90184 and median age is 38.The average and median age of drivers by collision type is given by:
```{r}
age_colltype <- join_df %>%
select(PARTIC_TYP_SHORT_DESC, AGE_VAL, COLLIS_TYP_SHORT_DESC) %>%
filter(PARTIC_TYP_SHORT_DESC == "DRVR") %>%
mutate(AGE_VAL = as.numeric(AGE_VAL)) %>%
filter(AGE_VAL != 0) %>%
group_by(COLLIS_TYP_SHORT_DESC) %>%
summarize(Median_Age = median(AGE_VAL, na.rm = TRUE),
Mean_Age = mean(AGE_VAL, na.rm = TRUE))
print(age_colltype)
```
A graph of driver ages is calculated below:
```{r}
age_coll1 <- join_df %>%
select(PARTIC_TYP_SHORT_DESC, AGE_VAL, COLLIS_TYP_SHORT_DESC) %>%
filter(PARTIC_TYP_SHORT_DESC == "DRVR") %>%
mutate(AGE_VAL = as.numeric(AGE_VAL)) %>%
filter(AGE_VAL != 0)
qplot(age_coll1$AGE_VAL,
geom="histogram",
binwidth = 5,
main = "Histogram of driver ages",
xlab = "Age")
```
###Analysis:
We see that 25 seems to be the most common age for drivers. The most drivers are aged in mid-20s to late-30s which makes sense based on our earlier calculation of mean and median. There are very few very old and very yound drivers which also makes sense.
A graph of driver ages by collision type is given below:
```{r}
age_colltype1 <- join_df %>%
select(PARTIC_TYP_SHORT_DESC, AGE_VAL, COLLIS_TYP_SHORT_DESC) %>%
filter(PARTIC_TYP_SHORT_DESC == "DRVR") %>%
mutate(AGE_VAL = as.numeric(AGE_VAL)) %>%
filter(AGE_VAL != 0) %>%
group_by(COLLIS_TYP_SHORT_DESC)
ggplot(age_colltype1, aes(COLLIS_TYP_SHORT_DESC, AGE_VAL)) +
geom_boxplot(fill="lavender")
```
###Analysis:
For almost all collission types, age values looks rightskewed which intituively makes sense. Looking at our three most common types: REAR, ANGL, and TURN, we see some interesting result. REAR looks like is the only collission type with outliers. ANGL and TURN type of collissions have the highest range in the age values, denoted by the length of whiskers in the box plot. Also, our least common category OTH is the most skewed and with the smallest range of all collission types. The median for almost all the collission types from the boxplot is also soemwhere near 38 which matches our calculation above.
### Problem 3:
```{r}
# CDC data
CDC <- read_csv("/home/courses/math241s20/Data/CDC2.csv")
# Regional data
USregions <- read_csv("/home/courses/math241s20/Data/USregions.csv")
```
a.
```{r}
twenty_sixteen <- CDC %>%
filter(YearStart == 2016)
length(unique(twenty_sixteen$Topic))
```
####There are 16 distinct topics that were tracked.
b.
```{r}
influenza <- CDC %>%
filter(YearStart %in% 2010:2016) %>%
filter(DataValueType == "Age-adjusted Prevalence", Question == "Influenza vaccination among noninstitutionalized adults aged >= 18 years") %>%
filter(LocationDesc == "Oregon" | LocationDesc == "United States")
print(influenza)
```
c.
```{r}
influenza1 <- CDC %>%
filter(Topic == "Immunization", LocationAbbr == "US"| LocationAbbr == "PA")
ggplot(influenza1, aes(x=YearStart, y=DataValue, group=LocationDesc)) +
geom_smooth(aes(color=LocationDesc))
```
Comments: Based on the plot, the immunization rates of Pennsylvania was lower than the US till shortly after mid-2014 or closer to 2015 but after that it is higher than the US. It's surprising how there is a sharp decline in the US immunization rates in between 2015 and 2016.Before 2014, the two rates roughly seems to have moved together.
d. Let's see how immunization rates vary by region of the country. Join the regional dataset to our CDC dataset so that we have a column signifying the region of the country.
```{r}
regional <- left_join(CDC, USregions, by = c("LocationDesc" = "State"))
```
e. There are NAs in the region column of the new dataset vecause now we have countries (like US, Puerto Rico), islands (like Virgin Islands), and cities like District of Columbia which aren't states and don't have any assigned regions.
f.
```{r}
influenza2 <- regional %>%
filter(YearStart == "2016", DataValueType == "Age-adjusted Prevalence", Topic == "Immunization") %>%
group_by(LocationDesc) %>%
arrange(desc(DataValue)) %>%
select(LocationDesc, Region, YearStart, DataValue, Topic, DataValueType)
print(influenza2)
```
South Dakota has the highest immunization.
g. Construct a graphic of the 2016 influenza immunization rates by region of the country. Don't include locations without a region. Comment on your graphic.
```{r}
influenza2 <- regional %>%
filter(YearStart == "2016", Topic == "Immunization") %>%
drop_na(Region)
ggplot(influenza2, aes(Region, DataValue)) +
geom_boxplot(fill='lavender', color="black")
```
Comment on the graph: MidWest, South, and West all have rightskewed immunization rates while NorthEast has left skewed immunization rates but NE also has higher value in general and the range of the data is shorter than the rest.
### Problem 4:
###a.
This data is not currently in a tidy format because two variables are stored in a column. It has both party and estimate variables placed in the same column. When column names are not variables but values of a variable, each row represents multiple observations.
```{r}
polls
```
###b.
```{r}
polls2 <- polls %>%
filter(subgroup == "All polls") %>%
rename(Democratic = dem_estimate, Republican = rep_estimate) %>%
pivot_longer(cols = c(Democratic, Republican),
names_to = "Party",
values_to = "Estimate")
print(polls2)
```
###c.
```{r}
democrats <- polls %>%
select(subgroup, modeldate, dem_estimate) %>%
pivot_wider(names_from = modeldate,values_from = dem_estimate)
print(democrats)
```
###d.
Someone might want to transform the data like we did in part because if it is just for democrats estimate it would be easier to compare estimates across different subgroups, especially in between a certain time period.
### Problem 5: YOUR TURN!
Now it is your turn. Pick one (or multiple) of the datasets used on this lab. Ask a question of the data. Do some data wrangling to produce statistics (use at least two wrangling verbs) and a graphic to answer the question. Then comment on any conclusions you can draw about your question.
Dataset Used: Polls
Question: How do democratic and reoublican estimates vary for each subgroups in the polls data?
```{r}
by_subgroup <- polls %>%
rename(Democratic = dem_estimate, Republican = rep_estimate) %>%
pivot_longer(cols = c(Democratic, Republican), names_to = "Party", values_to = "Estimate") %>%
select(subgroup, Party, Estimate)
summary <- by_subgroup %>%
group_by(subgroup, Party) %>%
summarize(Mean_Estimate = mean(Estimate))
print(summary)
ggplot(by_subgroup, aes(x = subgroup, y = Estimate,
color = Party)) +
geom_boxplot()
```
Interpretation: From both the summary table and the box plot, we see that all subgroups have higher estimation for Democrats. Among the three subgroups, adults seems to have the most fluctuating estimates with a wider range. All polls and Voters have very similar estimate data for both parties. It was good to know that at least the estimate favors the Democratic Party, phew; I hope their estimates come true.