Add example of group_by + summarise and the list() trick

thx @hadley
jennybc · Apr 11, 2018 · f0ffc39 · f0ffc39
1 parent 23ab029
commit f0ffc39
Show file tree

Hide file tree

Showing 3 changed files with 107 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -22,6 +22,7 @@ Not all are used in webinar
   * **Row-wise thinking vs. column-wise thinking.** [`ex05_attack-via-rows-or-columns`](ex05_attack-via-rows-or-columns.md) Data rectangling example. Both are possible, but I find building a tibble column-by-column is less aggravating than building rows, then row binding.
   * **Iterate over rows of a data frame.** [`iterate-over-rows`](iterate-over-rows.md) Empirical study of reshaping a data frame into this form: a list with one component per row. Revisiting a study originally done by Winston Chang. Run times for different number of [rows](row-benchmark.png) or [columns](col-benchmark.png).
   * **Generate data from different distributions via `purrr::pmap()`.** [`ex06_runif-via-pmap`](ex06_runif-via-pmap.md) Use `purrr::pmap()` to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame.
+  * **Group and summarise.** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column.
   * **Split-apply-combine.** Nesting vs splitting.
     - Downside of `split()`: First-class grouping variable(s) --> character vector of names --> variable is a big drag. Integer-y numerics must be coerced back, factors must be recreated, with original levels. Transitting data through attributes is an anti-pattern.
     - Downside of `nest()`: When you inspect the list-column, you can't see values of grouping (key) variables. Grouping variables not necessarily/easily available for simple map (coolbutuseless's posts and PR).
diff --git a/ex07_group-by-summarise.R b/ex07_group-by-summarise.R
@@ -0,0 +1,45 @@
+#' ---
+#' title: "Work on groups of rows via dplyr::group_by() + summarise()"
+#' author: "Jenny Bryan"
+#' date: "`r format(Sys.Date())`"
+#' output: github_document
+#' ---
+
+#+ setup, include = FALSE, cache = FALSE
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  error = TRUE
+)
+options(tidyverse.quiet = TRUE)
+
+#+ body
+# ----
+
+#' What if you need to work on groups of rows? Such as the groups induced by
+#' the levels of a factor.
+#'
+#' You do not need to ... split the data frame into mini-data-frames, loop over
+#' them, and glue it all back together.
+#'
+#' Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to
+#' compute group-wise summaries.
+
+library(dplyr)
+
+iris %>%
+  group_by(Species) %>%
+  summarise(pl_avg = mean(Petal.Length), pw = mean(Petal.Width))
+
+#' What if you want to return summaries that are not just a single number?
+#'
+#' This does not "just work".
+iris %>%
+  group_by(Species) %>%
+  summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))
+
+#' Solution: package as a length-1 list that contains 3 values, creating a
+#' list-column.
+iris %>%
+  group_by(Species) %>%
+  summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
diff --git a/ex07_group-by-summarise.md b/ex07_group-by-summarise.md
@@ -0,0 +1,61 @@
+Work on groups of rows via dplyr::group\_by() + summarise()
+================
+Jenny Bryan
+2018-04-10
+
+What if you need to work on groups of rows? Such as the groups induced
+by the levels of a factor.
+
+You do not need to … split the data frame into mini-data-frames, loop
+over them, and glue it all back together.
+
+Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to
+compute group-wise summaries.
+
+``` r
+library(dplyr)
+#> 
+#> Attaching package: 'dplyr'
+#> The following objects are masked from 'package:stats':
+#> 
+#>     filter, lag
+#> The following objects are masked from 'package:base':
+#> 
+#>     intersect, setdiff, setequal, union
+
+iris %>%
+  group_by(Species) %>%
+  summarise(pl_avg = mean(Petal.Length), pw = mean(Petal.Width))
+#> # A tibble: 3 x 3
+#>   Species    pl_avg    pw
+#>   <fct>       <dbl> <dbl>
+#> 1 setosa       1.46 0.246
+#> 2 versicolor   4.26 1.33 
+#> 3 virginica    5.55 2.03
+```
+
+What if you want to return summaries that are not just a single number?
+
+This does not “just work”.
+
+``` r
+iris %>%
+  group_by(Species) %>%
+  summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))
+#> Error in summarise_impl(.data, dots): Column `pl_qtile` must be length 1 (a summary value), not 3
+```
+
+Solution: package as a length-1 list that contains 3 values, creating a
+list-column.
+
+``` r
+iris %>%
+  group_by(Species) %>%
+  summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
+#> # A tibble: 3 x 2
+#>   Species    pl_qtile 
+#>   <fct>      <list>   
+#> 1 setosa     <dbl [3]>
+#> 2 versicolor <dbl [3]>
+#> 3 virginica  <dbl [3]>
+```