-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathPart4-WhyTidyData.Rmd
170 lines (114 loc) · 4.45 KB
/
Part4-WhyTidyData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: "Tidy Data: Why and How"
author: "Ted Laderas"
date: "5/24/2017"
output:
html_document:
code_download: true
code_folding: hide
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
## What is Tidy Data?
- each row corresponds to an observation
- each variable is a column
- each type of observation is in a different table
![](figure/tidy-1.png)
## Why Tidy Data?
Tidy data enables us to do lots of things!
1) Great ggplots
2) Summarize/slice the data in multiple ways
3) Enable Exploratory Data Analysis
4) Ensure assumptions are met for methods
5) Enable Confirmatory Data Analysis
## Beware of columns masquerading as variables!
These columns are actually categories!
Ask yourself: do these columns go together as a single observation for your analysis?
Also ask yourself: What is the unit of observation?
```{r}
library(tidyr)
fertilityData <- read.csv("data/total_fertility.csv", check.names = FALSE)
fertilityData[1:10, 1:10]
```
## Making data tidy: `gather()`
Use `gather()` when you need to make a bunch of columns into one column.
```{r}
library(tidyr)
fertilityData <- read.csv("data/total_fertility.csv", check.names = FALSE)
#gather() takes three arguments: data, key, and value
#key is what you want your new categorical column to be named
#value is for the actual values in the columns
#We don't want the `Total fertility rate` column to be included as part of the
#gather() operation, so we use the `-` notation to exclude it.
fertilityDataTidy <-
gather(fertilityData, "Year", "fertilityRate", -`Total fertility rate`) %>%
select(Country = `Total fertility rate`, Year, fertilityRate) %>%
#remove na values (there are countries that have no information)
filter(!is.na(fertilityRate))
fertilityDataTidy[1:10,]
```
## Your Task: using tidy data
How would we find the average fertility within a year? How about from 1860 on?
```{r}
```
## Making one column into many: `spread()`
Sometimes, you will need to go the other direction: take a long format dataset and make it into a more matrix-like format. This is necessary for such functions such as `heatmap()`.
Let's change things around and make the `Country` column into the variables (columns) in the dataset.
```{r}
fertilityCountryColumns <- fertilityDataTidy %>%
#spread takes a key (Country) and value (fertilityRate) argument
spread(Country, fertilityRate)
fertilityCountryColumns[1:10, 1:10]
```
## Your Task - Who is the most democratic?
Load the `dem_score.csv` dataset in the `data/` folder. Tidy it up. Which countries had the highest democracy score in 2007?
Hint: you'll have to use your `dplyr` skills as well.
```{r}
#enter your answer here
demScore <- read.csv("data/dem_score.csv")
```
## Challenge - if that was too easy...
Take a look at the `who` dataset (it's built into `tidyverse`)
```{r}
data(who)
who
```
## Make it look like this:
```{r}
load("data/who_tidy.rda")
who_tidy
```
## Some Hints on the Challenge
Look at the documentation for `separate()`. You will first have to gather a bunch of the columns into a single column. Then you will have to apply `separate()` twice, with different parameters.
What does each column mean? Here's some info from the data dictionary:
1) The first three letters of entries in the key column correspond to new or old cases of TB.
2) The next two letters (after the _) correspond to TB type:
+ `rel` for relapse,
+ `ep` for extrapulmonary TB
+ `sn` for smear negative,
+ `sp` for smear positive
3) The next letter after the second _ corresponds to the sex of the TB patient.
4) The remaining numbers correspond to age group:
+ `014` for 0 to 14 years
+ `65` for 65 or older
+ etc.
## What's Next?
We've showed you the bare basics of the tidyverse. There's a ton more!
Where to go next?
- <http://tidyverse.org>
- `lubridate` for dealing with dates
- `stringr` for manipulating strings
- `haven`/`readr`/`readxl` for importing data
- Modern Dive (by Chester and Albert Kim): http://moderndive.com
- R for Data Science: http://r4ds.had.co.nz
- [Variety of courses on DataCamp](https://www.datacamp.com/)
## Keep in Touch!
Ted: @tladeras https://laderast.github.io
Chester: @old_man_chester https://ismayc.github.io
## Give us Feedback!
Let us know what you thought of the workshop!
Was it too easy? Too hard?
What else would you like to see? [Add Feedback here](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse/issues/1)