First crack at row-oriented sums or means

jennybc · May 12, 2018 · 13ec762 · 13ec762
1 parent cd78808
commit 13ec762
Show file tree

Hide file tree

Showing 3 changed files with 284 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@ Not all are used in webinar
   * **Generate data from different distributions via `purrr::pmap()`.** [`ex06_runif-via-pmap`](ex06_runif-via-pmap.md) Use `purrr::pmap()` to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame.
   * **Are you SURE you need to iterate over groups?** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column.
   * **Group-and-nest.** [`ex08_nesting-is-good`](ex08_nesting-is-good.md) How to explicitly work on groups of rows via nesting (our recommendation) vs splitting.
+  * **Row-wise mean or sum.** [`ex09_row-summaries`](ex09_row-summaries.md) How to do `rowSums()`-y and `rowMeans()`-y work inside a data frame.
 
 ## More tips and links
 

diff --git a/ex09_row-summaries.R b/ex09_row-summaries.R
@@ -0,0 +1,109 @@
+#' ---
+#' title: "Row-wise Summaries"
+#' author: "Jenny Bryan"
+#' date: "`r format(Sys.Date())`"
+#' output: github_document
+#' ---
+
+#+ setup, include = FALSE, cache = FALSE
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  error = TRUE
+)
+options(tidyverse.quiet = TRUE)
+
+#' > For rowSums, mtcars %>% mutate(rowsum = pmap_dbl(., sum)) works but is
+#' > a tidy oneliner for mean or sd per row?
+#' > I'm looking for a tidy version of rowSums, rowMeans and similarly rowSDs...
+#'
+#' [Two](https://twitter.com/vrnijs/status/995129678284255233)
+#' [tweets](https://twitter.com/vrnijs/status/995193240864178177) from Vincent
+#' Nijs [github](https://github.com/vnijs),
+#' [twitter](https://twitter.com/vrnijs)
+#'
+
+#' Good question! This also came up when I was originally casting about for
+#' genuine row-wise operations, but I never worked it up. I will do so now!
+#'
+#+ body
+# ----
+library(tidyverse)
+
+df <- tribble(
+  ~ name, ~ t1, ~t2, ~t3,
+  "Abby",    1,   2,   3,
+  "Bess",    4,   5,   6,
+  "Carl",    7,   8,   9
+)
+
+#' Here is a one-liner, but my use of `purrr::lift_vd()` makes it a little
+#' astronaut-y..
+df %>%
+  mutate(t_avg = pmap_dbl(select(., -name), lift_vd(mean)))
+
+#' Interestingly, you don't need to change the domain for `sum()`:
+df %>%
+  mutate(t_sum = pmap_dbl(select(., -name), sum))
+
+#' Why is that? Because of the difference in signature of `sum()` and `mean()`:
+
+#+ eval = FALSE
+sum(..., na.rm = FALSE)
+mean(x, ...)
+
+#' `sum()` has a more favorable signature for the way `purrr::pmap()` presents
+#' the data from each row.
+#'
+#' Note that above I'm also showing the use of `select(., SOME EXPRESSION)` to
+#' take control over which variables are passed along to `.f` of `pmap()`.
+#'
+#' ## Joining summaries back in
+#'
+#' Data frames simply aren't a convenient storage format if you have a frequent
+#' need to compute summaries, row-wise, on a subset of columns. This might
+#' suggest that your data is in the wrong shape. In any case, the more
+#' transparent ways to do this are also more verbose.
+
+#' More verbose patterns for this involve using `group_by()` + `summarise()`
+#' and, therefore, obligate you to computing summaries separately and joining
+#' back in.
+(s1 <- df %>%
+    group_by(name) %>%
+    summarise(t_avg = mean(c(t1, t2, t3))))
+df %>%
+  left_join(s1)
+
+(s2 <- df %>%
+    gather("time", "val", starts_with("t")) %>%
+    group_by(name) %>%
+    summarize(t_avg = mean(val)))
+df %>%
+  left_join(s2)
+
+(s3 <- df %>%
+    column_to_rownames("name") %>%
+    rowMeans() %>%
+    enframe())
+df %>%
+  left_join(s3)
+
+#' ## Maybe you should use a matrix
+#'
+#' If you truly have data where each row is:
+#'
+#'   * Identifier for this observational unit
+#'   * Homogeneous vector of length n for the unit
+#'
+#' then you do want to use a matrix with rownames. I used to do this alot but
+#' found that practically none of my data analysis problems live in this simple
+#' world for more than a couple of hours. Eventually I always get back to a
+#' setting where a data frame is the most favorable receptacle, overall. YMMV.
+m <- matrix(
+  1:9,
+  byrow = TRUE, nrow = 3,
+  dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
+)
+
+cbind(m, rowsum = rowSums(m))
+cbind(m, rowmean = rowMeans(m))
diff --git a/ex09_row-summaries.md b/ex09_row-summaries.md
@@ -0,0 +1,174 @@
+Row-wise Summaries
+================
+Jenny Bryan
+2018-05-12
+
+> For rowSums, mtcars %\>% mutate(rowsum = pmap\_dbl(., sum)) works but
+> is a tidy oneliner for mean or sd per row? I’m looking for a tidy
+> version of rowSums, rowMeans and similarly rowSDs…
+
+[Two](https://twitter.com/vrnijs/status/995129678284255233)
+[tweets](https://twitter.com/vrnijs/status/995193240864178177) from
+Vincent Nijs [github](https://github.com/vnijs),
+[twitter](https://twitter.com/vrnijs)
+
+Good question\! This also came up when I was originally casting about
+for genuine row-wise operations, but I never worked it up. I will do so
+now\!
+
+``` r
+library(tidyverse)
+
+df <- tribble(
+  ~ name, ~ t1, ~t2, ~t3,
+  "Abby",    1,   2,   3,
+  "Bess",    4,   5,   6,
+  "Carl",    7,   8,   9
+)
+```
+
+Here is a one-liner, but my use of `purrr::lift_vd()` makes it a little
+astronaut-y..
+
+``` r
+df %>%
+  mutate(t_avg = pmap_dbl(select(., -name), lift_vd(mean)))
+#> # A tibble: 3 x 5
+#>   name     t1    t2    t3 t_avg
+#>   <chr> <dbl> <dbl> <dbl> <dbl>
+#> 1 Abby      1     2     3     2
+#> 2 Bess      4     5     6     5
+#> 3 Carl      7     8     9     8
+```
+
+Interestingly, you don’t need to change the domain for `sum()`:
+
+``` r
+df %>%
+  mutate(t_sum = pmap_dbl(select(., -name), sum))
+#> # A tibble: 3 x 5
+#>   name     t1    t2    t3 t_sum
+#>   <chr> <dbl> <dbl> <dbl> <dbl>
+#> 1 Abby      1     2     3     6
+#> 2 Bess      4     5     6    15
+#> 3 Carl      7     8     9    24
+```
+
+Why is that? Because of the difference in signature of `sum()` and
+`mean()`:
+
+``` r
+sum(..., na.rm = FALSE)
+mean(x, ...)
+```
+
+`sum()` has a more favorable signature for the way `purrr::pmap()`
+presents the data from each row.
+
+Note that above I’m also showing the use of `select(., SOME EXPRESSION)`
+to take control over which variables are passed along to `.f` of
+`pmap()`.
+
+## Joining summaries back in
+
+Data frames simply aren’t a convenient storage format if you have a
+frequent need to compute summaries, row-wise, on a subset of columns.
+This might suggest that your data is in the wrong shape. In any case,
+the more transparent ways to do this are also more verbose. More verbose
+patterns for this involve using `group_by()` + `summarise()` and,
+therefore, obligate you to computing summaries separately and joining
+back in.
+
+``` r
+(s1 <- df %>%
+    group_by(name) %>%
+    summarise(t_avg = mean(c(t1, t2, t3))))
+#> # A tibble: 3 x 2
+#>   name  t_avg
+#>   <chr> <dbl>
+#> 1 Abby      2
+#> 2 Bess      5
+#> 3 Carl      8
+df %>%
+  left_join(s1)
+#> Joining, by = "name"
+#> # A tibble: 3 x 5
+#>   name     t1    t2    t3 t_avg
+#>   <chr> <dbl> <dbl> <dbl> <dbl>
+#> 1 Abby      1     2     3     2
+#> 2 Bess      4     5     6     5
+#> 3 Carl      7     8     9     8
+
+(s2 <- df %>%
+    gather("time", "val", starts_with("t")) %>%
+    group_by(name) %>%
+    summarize(t_avg = mean(val)))
+#> # A tibble: 3 x 2
+#>   name  t_avg
+#>   <chr> <dbl>
+#> 1 Abby      2
+#> 2 Bess      5
+#> 3 Carl      8
+df %>%
+  left_join(s2)
+#> Joining, by = "name"
+#> # A tibble: 3 x 5
+#>   name     t1    t2    t3 t_avg
+#>   <chr> <dbl> <dbl> <dbl> <dbl>
+#> 1 Abby      1     2     3     2
+#> 2 Bess      4     5     6     5
+#> 3 Carl      7     8     9     8
+
+(s3 <- df %>%
+    column_to_rownames("name") %>%
+    rowMeans() %>%
+    enframe())
+#> Warning: Setting row names on a tibble is deprecated.
+#> # A tibble: 3 x 2
+#>   name  value
+#>   <chr> <dbl>
+#> 1 Abby      2
+#> 2 Bess      5
+#> 3 Carl      8
+df %>%
+  left_join(s3)
+#> Joining, by = "name"
+#> # A tibble: 3 x 5
+#>   name     t1    t2    t3 value
+#>   <chr> <dbl> <dbl> <dbl> <dbl>
+#> 1 Abby      1     2     3     2
+#> 2 Bess      4     5     6     5
+#> 3 Carl      7     8     9     8
+```
+
+## Maybe you should use a matrix
+
+If you truly have data where each row is:
+
+  - Identifier for this observational unit
+  - Homogeneous vector of length n for the unit
+
+then you do want to use a matrix with rownames. I used to do this alot
+but found that practically none of my data analysis problems live in
+this simple world for more than a couple of hours. Eventually I always
+get back to a setting where a data frame is the most favorable
+receptacle, overall. YMMV.
+
+``` r
+m <- matrix(
+  1:9,
+  byrow = TRUE, nrow = 3,
+  dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
+)
+
+cbind(m, rowsum = rowSums(m))
+#>      t1 t2 t3 rowsum
+#> Abby  1  2  3      6
+#> Bess  4  5  6     15
+#> Carl  7  8  9     24
+cbind(m, rowmean = rowMeans(m))
+#>      t1 t2 t3 rowmean
+#> Abby  1  2  3       2
+#> Bess  4  5  6       5
+#> Carl  7  8  9       8
+```