title | description |
---|---|
Introduction to dplyr |
Learn how to manipulate data into a ggplot2 friendly format |
--- type:NormalExercise lang:r xp:0 skills:1 key:e0b4f34c19
Remember: Please Join our RBootcamp OHSU Group!
We've been looking at datasets that fit the ggplot2
paradigm nicely; however, most data we encounter
is really messy (missing values), or is a completely different format. In this chapter, we'll look
at one of the most powerful tools in the tidyverse
: dplyr
, which lets you manipulate data frames.
There is a function/action for most of the annoying tasks you have to use in data cleaning, which
makes it super useful.
In particular, we're going to look at six fundamental verbs/actions in dplyr
:
filter()
mutate()
group_by()
/summarize()
arrange()
select()
Along the way, we'll do some data manipulation challenges. You'll be a dplyr
expert in no time!
You will want to keep this dplyr
cheat sheet open in a separate window to remind you about the syntax:
dplyr cheat sheet
Also, remember: if you need to know the variables in a data.frame
called biopics
you can always use
colnames(biopics)
If you want more information on a function such as mutate()
, you can always ask for help:
?mutate
*** =instructions Just move on to the next exercise! (CTRL+K)
*** =hint
*** =pre_exercise_code
*** =sample_code
*** =solution
*** =sct
--- type:NormalExercise lang:r xp:100 skills:1 key:0643d8c567
In order to do the following exercises, we need to learn a little bit about how to assign the output of a function to a variable.
For example, we can assign the output of the operation 1 + 2
to a variable called sumOfTwoNumbers
using the <-
operator. This is called the assignment
operator.
You can also use =
to assign a value to a variable, but I find it makes my code a bit confusing, because
there is also ==
, which tests for equality.
sumOfTwoNumbers <- 1 + 2
Once we have something assigned to a variable, we can use it in other expressions:
sumOfThreeNumbers <- sumOfTwoNumbers + 3
This is the bare basics of assignment. We'll use it in the next exercises to evaluate the output
of our dplyr
cleaning.
*** =instructions
- Assign
newValue
the value of10
. - Then use
newValue
to calculate the value ofmultValue
by calculatingnewValue * 5
. - Show
multValue
.
*** =hint
*** =pre_exercise_code
*** =sample_code
##assign newValue
newValue <-
## use newValue to calculate multValue
multValue <-
##show multValue
multValue
*** =solution
##assign newValue
newValue <- 10
## use newValue to calculate multValue
multValue <- newValue * 5
##show multValue
*** =sct
success_msg("Now you know how to assign variables and use them in expressions. Well done!")
test_object("newValue", incorrect_msg = "Not quite. Did you assign `newValue` to 10?")
test_object("multValue", incorrect_msg = "Not quite. Did you use `newValue` in your expression?")
--- type:NormalExercise lang:r xp:100 skills:1 key:cca48c6abd
We're going to use the biopics
dataset in the fivethirtyeight
package to do learn dplyr
. This is a dataset of 761 different biopic movies.
This is how we loaded the data into the session. Note that I've already done this for you, so you don't need to do this.
#load fivethirtyeight package
library(fivethirtyeight)
#access biopics data
data(biopics)
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
*** =instructions
- Run a
summary
on thebiopics
dataset. It's already loaded up for you. - How many categories are in the
country
variable?
*** =hint
Use the levels()
function.
*** =sample_code
##run summary here
##show length of country categories here
*** =solution
##run summary here
summary(biopics)
##length of country category here
length(levels(biopics$country))
*** =sct
success_msg("Nice work!")
test_function("summary", incorrect_msg = "you need to use the summary() function on biopics")
test_function("levels", incorrect_msg = "you need to use the levels() function on biopics$country")
--- type:NormalExercise lang:r xp:100 skills:1 key:1929755973
filter()
is a very useful dplyr
command. It allows you to subset a data.frame
based on variable criteria.
For example, if we wanted to subset biopics
to those movies that were made in the UK
we'd use the following statement:
#subset the data using filter
biopicsUK <- filter(biopics, country=="UK")
#confirm that we have subsetted correctly
biopicsUK
Three things to note here:
- The first argument to
filter()
is the dataset. We'll see another variation of this in a second. - For those who are used to accessing
data.frame
variables by$
, notice we don't have to usebiopics$country
. Instead, we can just use the variable namecountry
. - Our filter statement uses
==
. Remember that==
is an equality test, and=
is to assign something. (confusing the two will happen to you from time to time.)
*** =instructions
- Filter
biopics
so that it only showsCriminal
movies (you'll have to use thetype_of_subject
variable inbiopics
. - Show how many rows are left using
nrow(crimeMovies)
.
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sample_code
#add your filter statement here
crimeMovies <- filter()
#show number of crime movies
*** =solution
#add your filter statement here
crimeMovies <- filter(biopics, type_of_subject == "Criminal")
#show number of crime movies
nrow(crimeMovies)
*** =sct
success_msg("Good job! That's the right number.")
test_function("filter", incorrect_msg = "Did you try using the `filter()` function?")
test_object("crimeMovies", incorrect_msg = "Your filter statement is incorrect. Try again.")
test_function("nrow", incorrect_msg = "You should use `nrow()` to show the number of rows in `crimeMovies`")
--- type:NormalExercise lang:r xp:100 skills:1 key:119cf149a3
Let's look at the following filter()
statement:
filter(biopics, year_release > 1980 &
type_of_subject == "Criminal")
Three things to note:
- We used the comparison operator
>
. The basic comparisons you'll use are>
(greater than),<
(less than),==
(equals to) and!=
(not equal to) - We also chained on another expression,
type_of_subject == "Criminal"
using&
(and). The other chaining operator that you'll use is|
, which corresponds to OR. - Chaining expressions is where
filter()
becomes super powerful. However, it's also the source of headaches, so you will need to carefully test your chain of expressions.
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =instructions
- Add another comparison to the chain using
&
. Useperson_of_color == FALSE
. - Show how many rows are left from your filter statement.
*** =sample_code
#add your comparison to the end of this filter statement
crimeFilms <- filter(biopics, year_release > 1980 &
type_of_subject == "Criminal")
#show number of rows in crimeFilms
*** =solution
#add your comparison to the end of this filter statement
crimeFilms <- filter(biopics, year_release > 1980 &
type_of_subject == "Criminal" & person_of_color == FALSE)
#show number of rows in crimeFilms
nrow(crimeFilms)
*** =sct
success_msg("Good job! Now you know how to chain comparisons!")
test_object("crimeFilms", incorrect_msg = "Not quite. Did you add your comparison correctly?")
test_function("nrow", incorrect_msg = "Show the number of rows in the crimeFilms using `nrow()`")
--- type:MultipleChoiceExercise lang:r xp:50 skills:1 key:e6843338b8
Which statement should be the larger subset? Try them out in the console if you're not sure.
*** =instructions
filter(biopics, year_release > 1980 & type_of_subject == "criminal")
filter(biopics, year_release > 1980 | type_of_subject == "criminal")
*** =hint
Think about the difference between |
(or) and &
(and)
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sct
success_msg("Good Job! Let's move on.")
test_mc(correct = 2, feedback_msgs = c("Nope. This should be the smaller subset (because you're applying both criteria)",
"Correct! Because you accept either criteria, this is the larger subset"))
--- type:NormalExercise lang:r xp:100 skills:1 key:aa9b117998
What if you wanted to select for multiple values? You can use the %in%
operator. Here we
put the values into a vector
with the c()
function, which concatentates the
values together into a form that R can manipulate. Note that these values have to be exact.
biopicsUSUK <- biopics %>% filter(country %in% c("US", "UK"))
*** =instructions
- Pick out the
Musician
,Artist
andSinger
movies fromtype_of_subject
. - Assign the output to
biopicsArt
.
*** =hint
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
biopics$type_of_subject <- factor(biopics$type_of_subject)
options(tibble.width = Inf)
*** =sample_code
biopicsArt <- biopics %>%
*** =solution
biopicsArt <- biopics %>% filter(type_of_subject %in% c("Musician", "Artist", "Singer"))
*** =sct
success_msg("Now you know how to filter on multiple values! Let's move on.")
test_object("biopicsArt", incorrect_msg = "Not Quite.")
--- type:NormalExercise lang:r xp:100 skills:1 key:7b9a4c4b99
One trick you can use filter()
for is to remove missing values. Usually missing values are
coded as NA
in data. You can remove rows that contain NAs
by using is.na()
. For example:
filter(biopics, !is.na(box_office))
Note the !
in front of is.na(box_office)
. This !
is known as the NOT operator. Basically,
it switches the values in our is.na
statement, making everything that was TRUE
into FALSE
,
and everything FALSE
into TRUE
. We want to keep everything that is not NA
, so that's why
we use the !
.
*** =instructions
- Filter
biopics
to remove the NAs, and assign the output tofilteredBiopics
. - Compare the number of rows in
biopics
tofilteredBiopics
. - How many missing values did we remove?
*** =hint
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sample_code
filteredBiopics <- filter()
#show number of rows in biopics
#show number of rows in filteredBiopics
*** =solution
filteredBiopics <- filter(biopics, !is.na(box_office))
#show number of rows in biopics
nrow(biopics)
#show number of rows in filteredBiopics
nrow(filteredBiopics)
*** =sct
success_msg("Now you know how to filter out missing values! Let's move on.")
test_object("filteredBiopics", incorrect_msg = "Did you use !is.na() correctly?")
test_function("nrow", incorrect_msg = "Did you call `nrow()`?")
--- type:NormalExercise lang:r xp:100 skills:1 key:7d9b5ddfe5
mutate()
is one of the most useful dplyr
commands. You can use it to transform data
(variables in your data.frame
) and add it as a new variable into the data.frame
.
For example, let's calculate the total box_office
divided by the number_of_subjects
to normalize our comparison as normalized_box_office
:
biopics2 <- mutate(biopics, normalized_box_office = box_office/number_of_subjects)
What did we do here? First, we used the mutate()
function to add a new column into our
data.frame
called normalized_box_office
. This new variable is calculated per row by dividing
box_office
by number_of_subjects
.
*** =instructions
- Try defining a new variable
race_and_gender
by pasting togethersubject_race
andsubject_sex
into a newdata_frame
calledbiopics2
. - Show the first few rows using
head()
so you can confirm that you added this new variable correctly.
Remember, you can use the paste()
function to paste two strings together.
*** =hint
paste(subject_race, subject_sex)
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sample_code
#assign new variable race_and_gender here using mutate()
biopics2 <- mutate()
#show first rows of biopics2 using head()
*** =solution
#assign new variable race_and_gender here using mutate()
biopics2 <- mutate(biopics, race_and_gender = paste(subject_race, subject_sex))
#show first rows using head
head(biopics2)
*** =sct
success_msg("Great! Now you know how to use `mutate()`!")
test_object("biopics2", incorrect_msg = "Not quite. Did you use `paste()`?")
test_function("head", incorrect_msg = "Almost there. Did you use `head()`")
--- type:NormalExercise lang:r xp:100 skills:1 key:dfc4eaade1
The nifty thing about mutate()
is that once you define the variables in the statement,
you can use them right away, in the same mutate
statement. For example, look at this code:
mutate(biopics, box_office_year = year_release * box_office, box_office_subject = paste0(box_office_year, subject))
Notice that we first defined box_office_year
in the first part of the mutate()
statement,
and then used it right away to define a new variable, box_office_subject
.
*** =instructions
- Define another variable called
box_office_y_s_num
in the samemutate()
statement by takingbox_office_year
and dividing it bynumber_of_subjects
. - Assign the output to
mutatedBiopics
.
*** =hint
Add box_office_y_s_num=box_office_year/number_of_subjects
to the statement below.
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sample_code
mutatedBiopics <- mutate(biopics, box_office_year = year_release * box_office, box_office_subject = paste0(box_office_year, subject))
*** =solution
mutatedBiopics <- mutate(biopics, box_office_year = year_release * box_office, box_office_subject = paste0(box_office_year, subject),
box_office_y_s_num = box_office_year/number_of_subjects)
*** =sct
success_msg("You used a variable that was defined in a mutate statement. Great!")
test_object("mutatedBiopics", incorrect_msg = "Not quite. Did you define a new variable?")
--- type:MultipleChoiceExercise lang:r xp:50 skills:1 key:bfd4f10b15
What is this statement doing? Try it out in the console if you're not sure.
mutate(biopics, subject= paste(subject, year_release))
*** =instructions
- We are defining a brand-new variable with the same name in our dataset and keeping the old variable as well
- We are processing the variable
subject
and saving it in place
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sct
success_msg("Great! Now you know you can also use `mutate()` to process variables in place.")
msg1 = "Try it out. Did we add another variable?"
msg2 = "Correct! We're processing `subject` in place."
test_mc(correct=2, feedback_msgs = c(msg1, msg2))
--- type:MultipleChoiceExercise lang:r xp:50 skills:1 key:3b247ce89c
What is the difference between these two statements? Try them out in the console if you're not sure.
biopics %>% filter(year_release > 1998)
biopics %>% mutate(isNewer = year_release > 1998)
*** =instructions
- The first statement should have a larger number of rows than the second one
- The first statement filters the data, whereas the second statement defines a new boolean variable.
- The second statement is more confusing.
*** =hint Compare the number of rows and the output of each of these statements.
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sct
success_msg("Yes! I'm glad you understand the difference.")
msg1 = "Not the case! Try comparing the number of rows."
msg2 = "Yes, this is correct! We're identifying a new variable that we can use to flag the data."
msg3 = "It is a little confusing, but think it out."
test_mc(correct = 2, feedback_msgs = c(msg1, msg2, msg3))
--- type:TabExercise lang:r xp:100 key:02924208d4
We're going to introduce another bit of dplyr
syntax, the %>%
operator. %>%
is called a pipe
operator.
You can think of it as being similar to the +
in a ggplot2
statement.
What %>%
does is that it takes the output of one statement and makes it the input of the next statement. When
I'm describing it, I think of it as a "THEN". For example, I read the following expression
biopics %>% filter(race_known == "Known") %>%
mutate(poc_code = as.numeric(person_of_color))
As: I took the biopics
data, THEN
I filtered it down with the race_known == "Known"
criteria and THEN
I defined a new variable called poc_code
.
Note that filter()
doesn't have a data
argument, because the data
is piped
into filter()
. Same thing for mutate()
.
%>%
allows you to chain multiple verbs in the tidyverse
. It's one of the most powerful things about the tidyverse
. In fact,
having a standardized chain of processing actions is called a pipeline. Making pipelines for a data
format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2
friendly format.
*** =pre_exercise_code
library(fivethirtyeight)
library(tidyverse)
library(stringr)
data(biopics)
options(tibble.width = Inf)
*** =sample_code```{r} richardUS <- biopics %>%
*** =type1:NormalExercise
*** =key1: e0b8c4068c
*** =xp1: 100
*** =instructions1
+ Use `%>%` to chain `biopics` into a `filter` to select (`country=="US"`)
*** =solution1
```{r}
richardUS <-
biopics %>% filter(country=="US")
*** =sct1
test_object("richardUS", incorrect_msg="Not quite. Use %>% filter()")
*** =type2:NormalExercise *** =key2: 7a97ffb1f6 *** =xp2: 100
*** =instructions2
Let's put another verb into filter
: str_detect()
If you use this in a filter
statement, you
can use it to search for a string within a variable.
Add this to your dplyr statement (don't forget the %>%
):
filter(str_detect(lead_actor_actress, "Richard"))
*** =solution2
richardUS <-
biopics %>% filter(country=="US") %>%
filter(str_detect(lead_actor_actress, "Richard"))
*** =sct2
test_object("richardUS", incorrect_msg="Not quite. Add the above filter() statement")
--- type:NormalExercise lang:r xp:100 skills:1 key:8c5431911a
group_by()
doesn't do anything by itself. But when combined with summarize()
, you can
calculate metrics (such as mean
, max
- the maximum, min
, sd
- the standard deviation) across groups. For example:
countryMeans <- biopics %>%
filter(!is.na(box_office)) %>%
group_by(country) %>%
summarize(mean_box_office = mean(box_office))
Here we want to calculate the mean box_office
by country
. However, in order to do that, we first need to remove
any rows that have NA
values in box_office
that may confound our calculation.
*** =instructions
Let's ask a tough question. Is there a difference between mean box_office
between the two subject_sex
categories?
First use filter()
to remove the NA values. Then, use group_by()
and summarize()
to
calculate the mean box_office
by subject_sex
, naming the summary
variable as mean_bo_by_gender
. Assign the output to gender_box_office
.
*** =hint
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
options(tibble.width = Inf)
*** =sample_code
gender_box_office <- biopics %>%
filter() %>%
group_by() %>%
summarize(mean_bo_by_gender=)
##show gender_box_office
head(gender_box_office)
*** =solution
gender_box_office <- biopics %>%
filter(!is.na(box_office)) %>%
group_by(subject_sex) %>%
summarize(mean_bo_by_gender=mean(box_office))
##show gender_box_office
head(gender_box_office)
*** =sct
success_msg("Yes! You're summarizing like crazy! Let's move on.")
test_object("gender_box_office", incorrect_msg = "almost, but not quite! Check each dplyr verb to make sure it's correct.")
--- type:MultipleChoiceExercise lang:r xp:50 skills:1 key:915110e4a5
What does the following code do? Try it out in the console!
biopics %>% group_by(type_of_subject) %>% summarize(count=n())
*** =instructions
- just shows the regular
biopics
data.frame
- counts each
type_of_subject
and puts it in another table
*** =hint
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
options(tibble.width = Inf)
*** =sct
success_msg("Now you know how to use `group_by()` and `summarize()` to count categories.")
resp1 <- "Nope. Try it out and see what it does."
resp2 <- "Correct! This is a recipe for counting categories."
test_mc(correct=2, feedback_msgs = c(resp1, resp2))
--- type:NormalExercise lang:r xp:100 skills:1 key:6586c80668
arrange()
lets you sort by a variable. If you provide multiple variables, the variables are
arranged within each other. For example:
biopics %>% arrange(country, year_release)
This statement will sort the data by country
first, and then within each country
category,
it will sort by year_release
.
*** =instructions
Sort biopics
by year_release
then by lead_actor_actress
. Assign the output to biopics_sorted
.
*** =hint
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
options(tibble.width = Inf)
*** =sample_code
biopics_sorted <- biopics %>%
*** =solution
biopics_sorted <- biopics %>% arrange(year_release, lead_actor_actress)
*** =sct
success_msg("Now you know how to sort data using dplyr. Nifty, eh?")
test_object("biopics_sorted", incorrect_msg = "Not quite. Make sure you are arranging variables in order")
--- type:NormalExercise lang:r xp:100 skills:1 key:86c314b14c
The final verb we'll learn is select()
. select()
allows you to 1) extract columns,
2) reorder columns or 3) remove columns from your data, as well as 4) rename your data.
For example, look at the following code:
biopics %>% select(movieTitle=title, box_office)
Here, we're just extracting two columns (title_of_movie
, box_office
). Notice we also renamed
title
to movieTitle
.
*** =instructions
Use select
to extract the following variables: title
(rename it movieTitle
),
box_office
and subject_sex
and assign them to a new table called threeVarTable
*** =hint
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
options(tibble.width = Inf)
*** =sample_code
threeVarTable <- biopics %>% select()
*** =solution
threeVarTable <- biopics %>% select(movieTitle=title, box_office, subject_sex)
*** =sct
success_msg("Great! Now you know how to select columns from your tables.")
test_object("threeVarTable", incorrect_msg = "Not quite. Remember names and order matters.")
--- type:MultipleChoiceExercise lang:r xp:50 skills:1 key:f9e951b711
What is the difference between select()
and filter()?
*** =instructions
select()
works on booleans, whereasfilter()
works on all data typesselect()
only works afterfilter()
select()
works on columns,filter()
works on rows
*** =pre_exercise_code
*** =sct
success_msg("Welcome to the cult of `dplyr`! Your secret decoder ring is in the mail.")
msg1 <- "Nope. Both of these verbs don't care what data type you use."
msg2 <- "Not true. You can use `filter()` and `select()` in any order!"
msg3 <- "Yup. Repeat this 10 times every day so you know the difference."
test_mc(correct = 3, feedback_msgs = c(msg1, msg2, msg3))
--- type:TabExercise lang:r xp:100 key:7d73050cc1
Now here comes the fun part. Chaining dplyr
verbs together to accomplish some
data cleaning and transformation.
For a reference while you work, you can use the dplyr
cheatsheet here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sample_code
biopics_by_country <- biopics %>%
*** =type1:NormalExercise *** =key1: e453083e3c *** =key1: 9daff25f19
*** =xp1: 100
*** =instructions1
For the biopics
data, filter()
the data so that we only cover movies from 2000 to 2014. Then
use mutate()
to code a new variable, box_office_per_subject
.
*** =solution1
biopics_by_country <- biopics %>%
filter(year_release >= 2000 & year_release <= 2014) %>%
mutate(box_office_per_subject = box_office / number_of_subjects)
*** =sct1
test_object("biopics_by_country", incorrect_msg="Almost - did you set up your `filter()` and your `mutate()` statement?")
*** =type2:NormalExercise *** =key2: 5a8228cb3c *** =key2: 0d53edf905
*** =xp2: 100
*** =instructions2
Now filter
to remove the NA values inbox_office_per_subject
and group_by()
country
and summarize()
the mean box_office_per_subject
by country
as bops_by_country
.
*** =solution2
biopics_by_country <- biopics %>%
filter(year_release >= 2000 & year_release <= 2014) %>%
mutate(box_office_per_subject = box_office / number_of_subjects) %>%
filter(!is.na(box_office_per_subject)) %>%
group_by(country) %>%
summarize(bops_by_country = mean(box_office_per_subject))
*** =sct2
# no sct yet
*** =type3:NormalExercise *** =key3: d017afe283 *** =key3: 006e415450
*** =xp3: 100
*** =instructions3
Finally, arrange biopics_by_country
by your new bops_by_country
variable.
*** =solution3
biopics_by_country <- biopics %>%
filter(year_release >= 2000 & year_release <= 2014) %>%
mutate(box_office_per_subject = box_office / number_of_subjects) %>%
filter(!is.na(box_office_per_subject)) %>%
group_by(country) %>%
summarize(bops_by_country = mean(box_office_per_subject)) %>%
arrange(bops_by_country)
*** =sct3
success_msg("Wow! You've really come really far. I bestow upon you the title of `dplyr` ninja!")
test_object("biopics_by_country", incorrect_msg = "Not quite. Check your code.")
--- type:NormalExercise lang:r xp:300 skills:1 key:a4c8d46d41
Answer the question: Do movies where we know the race is known (race_known
== TRUE) make more
money than movies where the race is not known (race_known
== FALSE) grouped by country?
Which race_known
/country
combination made the highest amount of money?
*** =instructions
- You'll need to do a
filter
step first to removeNA
values frombox_office
before you do anything. - Then think of what variables you need to
group_by
. - Finally, figure out what do you need to
summarize
(assign the value tomean_box_office
) andarrange
on (don't forget to usedesc
!)? - Assign the output to
race_country_box_office
.
*** =hint You can do this!
*** =pre_exercise_code
library(fivethirtyeight)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sample_code
race_country_box_office <- biopics %>%
*** =solution
race_country_box_office <- biopics %>%
filter(!is.na(box_office)) %>%
group_by(race_known, country) %>%
summarize(mean_box_office=mean(box_office)) %>%
arrange(desc(mean_box_office))
*** =sct
success_msg("Your `dplyr` skills are ever improving!")
test_object("race_country_box_office", incorrect_msg =
"Not quite. Think about the order in which you need to do these operations.")
--- type:NormalExercise lang:r xp:300 skills:1 key:b7b6b449de
Now we're cooking with fire. You can directly pipe the output of a dplyr
pipeline
into a ggplot2
statement. For example:
biopics %>%
filter(year_release >= 2000 & year_release <= 2014) %>%
mutate(box_office_per_subject = box_office / number_of_subjects) %>%
ggplot(aes(x = year_release, y=box_office_per_subject)) +
geom_point()
Note that we use %>%
to pipe our statement into the ggplot()
function. The
tricky thing to remember is that everything after the ggplot()
is connected with
+
, and not %>%
.
Also note: we don't assign a data
variable in the ggplot()
statement. We are piping
in the data.
*** =instructions
Are you sick of biopics
yet? I promise this is the last time we use this dataset.
- First, filter
biopics
to haveyear_release
< 1990 and removeNA
values. - Then pipe that into a
ggplot()
statement that plots an x-y plot ofbox_office
(usegeom_point()
) wherex=year_release
andy=log(box_office)
. - Color the points by
person_of_color
. - Assign the output to
bPlot
and print it to the screen usingprint(bPlot)
.
*** =pre_exercise_code
library(fivethirtyeight)
library(ggplot2)
library(dplyr)
data(biopics)
biopics$country <- factor(biopics$country)
options(tibble.width = Inf)
*** =sample_code
bPlot <- biopics %>%
print(bPlot)
*** =solution
bPlot <- biopics %>% filter(year_release < 1990) %>% filter(!is.na(box_office)) %>%
ggplot(aes(x=year_release, y=log(box_office), color=person_of_color)) +
geom_point()
print(bPlot)
*** =sct
success_msg("Now you know how to chain dplyr statements straight into a ggplot!")
#test_ggplot(index = 0, check_aes = TRUE, aes_fail_msg = "Make sure you're mapping the right variables")
#test_ggplot(index=0, check_data = TRUE, data_fail_msg = "Did you do the filter step correctly?")
--- type:NormalExercise lang:r xp:0 skills:1 key:749b2485e7
- How to use
%>%
(the pipe) dplyr::filter()
dplyr::mutate()
dplyr::group_by()/dplyr::summarize()
dplyr::arrange()
dplyr::select()
- How to put it all together!
- Chester's Mantra
*** =instructions Please fill out our feedback form! We're trying to improve these workshops for future students.
Just move on to the next chapter. (CTRL+K)
Good job for making it through this chapter! You're well on your way
to becoming a tidyverse
ninja!
*** =hint
*** =pre_exercise_code
*** =sample_code
*** =solution
*** =sct