Skip to content

Commit

Permalink
Add example of group_by + summarise and the list() trick
Browse files Browse the repository at this point in the history
  • Loading branch information
jennybc committed Apr 11, 2018
1 parent 23ab029 commit f0ffc39
Show file tree
Hide file tree
Showing 3 changed files with 107 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Not all are used in webinar
* **Row-wise thinking vs. column-wise thinking.** [`ex05_attack-via-rows-or-columns`](ex05_attack-via-rows-or-columns.md) Data rectangling example. Both are possible, but I find building a tibble column-by-column is less aggravating than building rows, then row binding.
* **Iterate over rows of a data frame.** [`iterate-over-rows`](iterate-over-rows.md) Empirical study of reshaping a data frame into this form: a list with one component per row. Revisiting a study originally done by Winston Chang. Run times for different number of [rows](row-benchmark.png) or [columns](col-benchmark.png).
* **Generate data from different distributions via `purrr::pmap()`.** [`ex06_runif-via-pmap`](ex06_runif-via-pmap.md) Use `purrr::pmap()` to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame.
* **Group and summarise.** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column.
* **Split-apply-combine.** Nesting vs splitting.
- Downside of `split()`: First-class grouping variable(s) --> character vector of names --> variable is a big drag. Integer-y numerics must be coerced back, factors must be recreated, with original levels. Transitting data through attributes is an anti-pattern.
- Downside of `nest()`: When you inspect the list-column, you can't see values of grouping (key) variables. Grouping variables not necessarily/easily available for simple map (coolbutuseless's posts and PR).
45 changes: 45 additions & 0 deletions ex07_group-by-summarise.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#' ---
#' title: "Work on groups of rows via dplyr::group_by() + summarise()"
#' author: "Jenny Bryan"
#' date: "`r format(Sys.Date())`"
#' output: github_document
#' ---

#+ setup, include = FALSE, cache = FALSE
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
error = TRUE
)
options(tidyverse.quiet = TRUE)

#+ body
# ----

#' What if you need to work on groups of rows? Such as the groups induced by
#' the levels of a factor.
#'
#' You do not need to ... split the data frame into mini-data-frames, loop over
#' them, and glue it all back together.
#'
#' Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to
#' compute group-wise summaries.

library(dplyr)

iris %>%
group_by(Species) %>%
summarise(pl_avg = mean(Petal.Length), pw = mean(Petal.Width))

#' What if you want to return summaries that are not just a single number?
#'
#' This does not "just work".
iris %>%
group_by(Species) %>%
summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))

#' Solution: package as a length-1 list that contains 3 values, creating a
#' list-column.
iris %>%
group_by(Species) %>%
summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
61 changes: 61 additions & 0 deletions ex07_group-by-summarise.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Work on groups of rows via dplyr::group\_by() + summarise()
================
Jenny Bryan
2018-04-10

What if you need to work on groups of rows? Such as the groups induced
by the levels of a factor.

You do not need to … split the data frame into mini-data-frames, loop
over them, and glue it all back together.

Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to
compute group-wise summaries.

``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union

iris %>%
group_by(Species) %>%
summarise(pl_avg = mean(Petal.Length), pw = mean(Petal.Width))
#> # A tibble: 3 x 3
#> Species pl_avg pw
#> <fct> <dbl> <dbl>
#> 1 setosa 1.46 0.246
#> 2 versicolor 4.26 1.33
#> 3 virginica 5.55 2.03
```

What if you want to return summaries that are not just a single number?

This does not “just work”.

``` r
iris %>%
group_by(Species) %>%
summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))
#> Error in summarise_impl(.data, dots): Column `pl_qtile` must be length 1 (a summary value), not 3
```

Solution: package as a length-1 list that contains 3 values, creating a
list-column.

``` r
iris %>%
group_by(Species) %>%
summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
#> # A tibble: 3 x 2
#> Species pl_qtile
#> <fct> <list>
#> 1 setosa <dbl [3]>
#> 2 versicolor <dbl [3]>
#> 3 virginica <dbl [3]>
```

0 comments on commit f0ffc39

Please sign in to comment.