title | description |
---|---|
The Whys and Hows of Tidy Data |
Why we need tidy data and using `tidyr` to make messy data tidy |
--- type:NormalExercise lang:r xp:0 skills:1 key:2b27344379
- How to identify tidy data
- How to make messy data tidy using
tidyr
- Long vs. Wide data
- The tidy manifesto
- Loading Data
- Joining tables on a common identifier
After this chapter, you will be able to handle about 90% of your data visualization, data manipulation, and data cleaning problems. Hopefully you'll have gained enough confidence to start going beyond this Bootcamp and learn some more.
*** =instructions Just move on to the next exercise. (CTRL+K)
*** =hint
*** =pre_exercise_code
*** =sample_code
*** =solution
*** =sct
--- type:MultipleChoiceExercise lang:r xp:100 skills:1 key:80e373da6a
Tidy data is defined as data that has the following properties:
- each row corresponds to an observation
- each variable is a column
- each type of observation is in a different table
One thing to keep in mind is when columns are not separate variables. One clue is that the column names could be considered categories of a single variable. This means that the columns themselves could each correspond to a single observation.
Look at the dem_score
dataset in the console by using either head()
or summary()
Would you consider this dataset tidy?
*** =instructions
- Yes, we consider each row to be a separate observation
- No, each column is not a separate observation, but actually multiple observations
*** =hint
*** =pre_exercise_code
dem_score <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_3864/datasets/dem_score.csv")
total_fertility <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_3864/datasets/total_fertility.csv")
*** =sample_code
*** =solution
*** =sct
test_mc(correct = 2)
--- type:NormalExercise lang:r xp:100 skills:1 key:c1c536cd5d
gather()
is a special function in the tidyr
package that takes columns and combines them into a single column.
gather()
has the following syntax:
dem_score %>% gather(key=year, value=score, -country)
The first argument, key
, is the name of our 'gathered' variable. We're smushing all of the columns that have year names and
calling the column names, or the key
the name of year
. The second argument, value
, is the actual observations. Finally, if we don't want a column to be gathered, we can leave it out with the -
notation (here, -country
).
*** =instructions
- Try out the above
gather
statement. - Use a
mutate
expression to remove theX
in front ofyear
. (Look up the documentation forstr_replace
to see how to remove theX
in the mutate statement). - Assign the output to
gatheredData
.
*** =hint
The mutate
expression to remove the X
is mutate(year=str_replace(year, "X", "")
.
*** =pre_exercise_code
dem_score <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_3864/datasets/dem_score.csv")
*** =sample_code
library(dplyr)
library(tidyr)
library(stringr)
gatheredData <-
*** =solution
library(dplyr)
library(tidyr)
library(stringr)
gatheredData <- dem_score %>% gather(key=year, value=score, -country) %>%
mutate(year=str_replace(year, "X",""))
head(gatheredData)
*** =sct
test_object("gatheredData", incorrect_msg = "Not quite. Did you run `gather` and `mutate`?")
--- type:NormalExercise lang:r xp:100 skills:1 key:0c7200c72b
spread()
does the opposite of gather()
. It "unbundles" a column into multiple columns.
This situation can happen because related measurements that consist of an observation are
collected separately, or someone has gather
ed the data a little too enthusiastically.
spread()
has the following syntax:
gatheredData %>% spread(key=country, value=score)
We see that spread()
takes two arguments: the first is the key column, which is the
variable name that contains our column names of interest; the second is the value
column, which is the variable that contains the values we want to fill with.
*** =instructions
- Let's transform our
gatheredData
into a matrix again, but with each column having acountry
. - Assign the output to
spreadData
.
*** =hint
*** =pre_exercise_code
dem_score <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_3864/datasets/dem_score.csv")
library(dplyr)
library(tidyr)
library(stringr)
gatheredData <- dem_score %>% gather(key=year, value=score, -country) %>%
mutate(year=str_replace(year, "X",""))
*** =sample_code
spreadData <-
head(spreadData)
*** =solution
spreadData <- gatheredData %>% spread(key=country, value=score)
head(spreadData)
*** =sct
success_msg("Great! Now you know how to spread!")
test_object("spreadData", incorrect_msg = "Almost, but not quite")
--- type:NormalExercise lang:r xp:100 skills:1 key:fbd075070a
Have you ever had a table which had a column that actually consisted of bunch of other columns together, separated by something? Something like this?
"value1/value2/value3"
separate()
is made to make this one variable into many other variable. Separate takes the following arguments:
health_code_separated <-
health_code_example %>%
separate(col=HealthCodeEncounterCode,
into=c("HealthCode", "EncounterCode"), sep="/")
Here we specify the variable name with col
, the names of the new columns with into
, and
the delimiter, or separator in the variable to separate on, with sep
. Note that sep
can
be any string.
*** =instructions
- Run the above code and
separate
HealthCodeEncounterCode
intoHealthCode
andEncounterCode
forhealth_code_example
, assigning the output tohealth_code_separated
. - Show those patients that had
HealthCode==410
, assigning them topatients410
. - Show
patients410
.
*** =hint
*** =pre_exercise_code
library(tidyr)
library(dplyr)
library(readr)
health_code_example <-
read_csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_3864/datasets/healthExample.csv")
*** =sample_code
health_code_separated <-
patients410 <- health_code_separated %>% filter()
patients410
*** =solution
health_code_separated <-
health_code_example %>%
separate(col=HealthCodeEncounterCode,
into=c("HealthCode", "EncounterCode"),
sep="/")
patients410 <- health_code_separated %>% filter(HealthCode==410)
patients410
*** =sct
test_object("health_code_separated", incorrect_msg="Did you set up your `separate()` statement correctly?")
test_object("patients410", incorrect_msg="Not quite. Make Sure your filter statement was correct")
--- type:NormalExercise lang:r xp:300 skills:1 key:3d8e137017
In addition to tidy data, we can have long data versus wide data. We call a dataset as long data
because the format of the data has many more rows than columns, and we call data wide data,
because it has more columns than rows. You have seen how to transform a wide dataset (dem_score
)
into a long one with gather()
and transform it into a different wide format with (spread
).
In general, I tend to work with long data because this format makes it easeir to aggregate the data for plots when I have a lot of covariates. Let's look at what's possible because the data is in a long format.
Let's practice with another dataset in long format, called fertilityTidy
. You can look at the
original data as fertilityData
. We'll summarize it in two different ways.
*** =instructions
- Look at
fertilityTidy
. Show the average fertility by country to present day by usingdplyr
verbs, calling this variablemeanCountryRate
. - Assign the summarized data to
fertilityMeanByCountry
. - Show
fertilityMeanByCountry
. - Next, show average fertility by
Year
, usinggroup_by/summarize()
assigning the summarized data tofertilityMeanByYear
. - Show
fertilityMeanByYear
.
*** =hint
You'll have to first use a filter()
statement, and then a group_by/summarize
statement.
*** =pre_exercise_code
library(tidyr)
library(dplyr)
library(ggplot2)
fertilityData <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_3864/datasets/total_fertility.csv", check.names = FALSE)
colnames(fertilityData)[1] <- "Total fertility rate"
fertilityTidy <-
gather(fertilityData, "Year", "fertilityRate", -`Total fertility rate`) %>%
select(Country = `Total fertility rate`, Year, fertilityRate) %>%
#remove na values (there are countries that have no information)
filter(!is.na(fertilityRate))
*** =sample_code
fertilityMeanByCountry <- fertilityTidy %>%
#show fertlityMeanByCountry
fertilityMeanByCountry
fertilityMeanByYear <- fertilityTidy %>%
#show fertilityMeanByYear
fertilityMeanByYear
*** =solution
fertilityMeanByCountry <- fertilityTidy %>%
group_by(Country) %>% summarize(meanCountryRate=mean(fertilityRate))
#show fertlityMeanByCountry
fertilityMeanByCountry
fertilityMeanByYear <- fertilityTidy %>%
group_by(Year) %>% summarize(meanYearRate=mean(fertilityRate))
#show fertilityMeanByYear
fertilityMeanByYear
*** =sct
--- type:NormalExercise lang:r xp:500 skills:1 key:651465f93b
Let's put everything we've learned together on a new data.frame called MouseBalanceTimeSeries
.
*** =instructions
- Look at the
MouseBalanceTimeSeries
data.frame
. This is a widedata.frame
where each column corresponds to the time (in seconds) a mouse stayed on a balance beam pre and post treatment. gather()
the measurements into a single column calledtime
with a key calledinterventionStatus
.- Use
separate()
to makemeasurementStatus
into two columns (intervention
andreplicate
) by specifyingsep="Treat"
. - Remove any observations that are
NA
. Assign the output togatheredMouse
. - Finally, make a boxplot of
gatheredMouse
, plottingtime
versusintervention
usinggeom_boxplot
.
*** =hint
*** =pre_exercise_code
load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_3864/datasets/module4.RData"))
MouseBalanceTimeSeries <- data.frame(MouseBalanceTimeSeries)
MouseBalanceTimeSeries <- data.frame(mouseID=rownames(MouseBalanceTimeSeries), MouseBalanceTimeSeries)
colnames(MouseBalanceTimeSeries)[3] <- "PreTreat2"
*** =sample_code
library(tidyr)
library(dplyr)
gatheredMouse <- MouseBalanceTimeSeries %>%
*** =solution
library(tidyr)
library(dplyr)
library(ggplot2)
gatheredMouse <- MouseBalanceTimeSeries %>% gather(key=measurementStatus, value=time, -mouseID) %>%
filter(!is.na(time)) %>% separate(measurementStatus, c("intervention","replicate"), sep="Treat")
ggplot(gatheredMouse, aes(x=intervention, y=time)) + geom_boxplot()
*** =sct
--- type:MultipleChoiceExercise lang:r xp:50 skills:1 key:26c910c309
Was there a difference in mean time spent on the balance beam pre and post treatment?
*** =instructions
- No, the times were too close to tell.
- Yes, the intervals overlapped, but the means were clearly different
*** =hint
*** =pre_exercise_code
load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_3864/datasets/module4.RData"))
MouseBalanceTimeSeries <- data.frame(MouseBalanceTimeSeries)
MouseBalanceTimeSeries <- data.frame(mouseID=rownames(MouseBalanceTimeSeries), MouseBalanceTimeSeries)
colnames(MouseBalanceTimeSeries)[3] <- "PreTreat2"
library(tidyr)
library(dplyr)
library(ggplot2)
gatheredMouse <- MouseBalanceTimeSeries %>% gather(key=measurementStatus, value=time, -mouseID) %>%
filter(!is.na(time)) %>% separate(measurementStatus, c("intervention","replicate"), sep="Treat")
ggplot(gatheredMouse, aes(x=intervention, y=time)) + geom_boxplot()
*** =sct
test_mc(correct=2)
--- type:NormalExercise lang:r xp:50 skills:1 key:5945f330d7
- The tidy manifesto
gather()
spread()
- Joining data
*** =instructions
*** =hint
*** =pre_exercise_code
*** =sct