Skip to content

Commit

Permalink
First crack at row-oriented sums or means
Browse files Browse the repository at this point in the history
  • Loading branch information
jennybc committed May 12, 2018
1 parent cd78808 commit 13ec762
Show file tree
Hide file tree
Showing 3 changed files with 284 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Not all are used in webinar
* **Generate data from different distributions via `purrr::pmap()`.** [`ex06_runif-via-pmap`](ex06_runif-via-pmap.md) Use `purrr::pmap()` to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame.
* **Are you SURE you need to iterate over groups?** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column.
* **Group-and-nest.** [`ex08_nesting-is-good`](ex08_nesting-is-good.md) How to explicitly work on groups of rows via nesting (our recommendation) vs splitting.
* **Row-wise mean or sum.** [`ex09_row-summaries`](ex09_row-summaries.md) How to do `rowSums()`-y and `rowMeans()`-y work inside a data frame.

## More tips and links

Expand Down
109 changes: 109 additions & 0 deletions ex09_row-summaries.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#' ---
#' title: "Row-wise Summaries"
#' author: "Jenny Bryan"
#' date: "`r format(Sys.Date())`"
#' output: github_document
#' ---

#+ setup, include = FALSE, cache = FALSE
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
error = TRUE
)
options(tidyverse.quiet = TRUE)

#' > For rowSums, mtcars %>% mutate(rowsum = pmap_dbl(., sum)) works but is
#' > a tidy oneliner for mean or sd per row?
#' > I'm looking for a tidy version of rowSums, rowMeans and similarly rowSDs...
#'
#' [Two](https://twitter.com/vrnijs/status/995129678284255233)
#' [tweets](https://twitter.com/vrnijs/status/995193240864178177) from Vincent
#' Nijs [github](https://github.com/vnijs),
#' [twitter](https://twitter.com/vrnijs)
#'

#' Good question! This also came up when I was originally casting about for
#' genuine row-wise operations, but I never worked it up. I will do so now!
#'
#+ body
# ----
library(tidyverse)

df <- tribble(
~ name, ~ t1, ~t2, ~t3,
"Abby", 1, 2, 3,
"Bess", 4, 5, 6,
"Carl", 7, 8, 9
)

#' Here is a one-liner, but my use of `purrr::lift_vd()` makes it a little
#' astronaut-y..
df %>%
mutate(t_avg = pmap_dbl(select(., -name), lift_vd(mean)))

#' Interestingly, you don't need to change the domain for `sum()`:
df %>%
mutate(t_sum = pmap_dbl(select(., -name), sum))

#' Why is that? Because of the difference in signature of `sum()` and `mean()`:

#+ eval = FALSE
sum(..., na.rm = FALSE)
mean(x, ...)

#' `sum()` has a more favorable signature for the way `purrr::pmap()` presents
#' the data from each row.
#'
#' Note that above I'm also showing the use of `select(., SOME EXPRESSION)` to
#' take control over which variables are passed along to `.f` of `pmap()`.
#'
#' ## Joining summaries back in
#'
#' Data frames simply aren't a convenient storage format if you have a frequent
#' need to compute summaries, row-wise, on a subset of columns. This might
#' suggest that your data is in the wrong shape. In any case, the more
#' transparent ways to do this are also more verbose.

#' More verbose patterns for this involve using `group_by()` + `summarise()`
#' and, therefore, obligate you to computing summaries separately and joining
#' back in.
(s1 <- df %>%
group_by(name) %>%
summarise(t_avg = mean(c(t1, t2, t3))))
df %>%
left_join(s1)

(s2 <- df %>%
gather("time", "val", starts_with("t")) %>%
group_by(name) %>%
summarize(t_avg = mean(val)))
df %>%
left_join(s2)

(s3 <- df %>%
column_to_rownames("name") %>%
rowMeans() %>%
enframe())
df %>%
left_join(s3)

#' ## Maybe you should use a matrix
#'
#' If you truly have data where each row is:
#'
#' * Identifier for this observational unit
#' * Homogeneous vector of length n for the unit
#'
#' then you do want to use a matrix with rownames. I used to do this alot but
#' found that practically none of my data analysis problems live in this simple
#' world for more than a couple of hours. Eventually I always get back to a
#' setting where a data frame is the most favorable receptacle, overall. YMMV.
m <- matrix(
1:9,
byrow = TRUE, nrow = 3,
dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
)

cbind(m, rowsum = rowSums(m))
cbind(m, rowmean = rowMeans(m))
174 changes: 174 additions & 0 deletions ex09_row-summaries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
Row-wise Summaries
================
Jenny Bryan
2018-05-12

> For rowSums, mtcars %\>% mutate(rowsum = pmap\_dbl(., sum)) works but
> is a tidy oneliner for mean or sd per row? I’m looking for a tidy
> version of rowSums, rowMeans and similarly rowSDs…
[Two](https://twitter.com/vrnijs/status/995129678284255233)
[tweets](https://twitter.com/vrnijs/status/995193240864178177) from
Vincent Nijs [github](https://github.com/vnijs),
[twitter](https://twitter.com/vrnijs)

Good question\! This also came up when I was originally casting about
for genuine row-wise operations, but I never worked it up. I will do so
now\!

``` r
library(tidyverse)

df <- tribble(
~ name, ~ t1, ~t2, ~t3,
"Abby", 1, 2, 3,
"Bess", 4, 5, 6,
"Carl", 7, 8, 9
)
```

Here is a one-liner, but my use of `purrr::lift_vd()` makes it a little
astronaut-y..

``` r
df %>%
mutate(t_avg = pmap_dbl(select(., -name), lift_vd(mean)))
#> # A tibble: 3 x 5
#> name t1 t2 t3 t_avg
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abby 1 2 3 2
#> 2 Bess 4 5 6 5
#> 3 Carl 7 8 9 8
```

Interestingly, you don’t need to change the domain for `sum()`:

``` r
df %>%
mutate(t_sum = pmap_dbl(select(., -name), sum))
#> # A tibble: 3 x 5
#> name t1 t2 t3 t_sum
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abby 1 2 3 6
#> 2 Bess 4 5 6 15
#> 3 Carl 7 8 9 24
```

Why is that? Because of the difference in signature of `sum()` and
`mean()`:

``` r
sum(..., na.rm = FALSE)
mean(x, ...)
```

`sum()` has a more favorable signature for the way `purrr::pmap()`
presents the data from each row.

Note that above I’m also showing the use of `select(., SOME EXPRESSION)`
to take control over which variables are passed along to `.f` of
`pmap()`.

## Joining summaries back in

Data frames simply aren’t a convenient storage format if you have a
frequent need to compute summaries, row-wise, on a subset of columns.
This might suggest that your data is in the wrong shape. In any case,
the more transparent ways to do this are also more verbose. More verbose
patterns for this involve using `group_by()` + `summarise()` and,
therefore, obligate you to computing summaries separately and joining
back in.

``` r
(s1 <- df %>%
group_by(name) %>%
summarise(t_avg = mean(c(t1, t2, t3))))
#> # A tibble: 3 x 2
#> name t_avg
#> <chr> <dbl>
#> 1 Abby 2
#> 2 Bess 5
#> 3 Carl 8
df %>%
left_join(s1)
#> Joining, by = "name"
#> # A tibble: 3 x 5
#> name t1 t2 t3 t_avg
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abby 1 2 3 2
#> 2 Bess 4 5 6 5
#> 3 Carl 7 8 9 8

(s2 <- df %>%
gather("time", "val", starts_with("t")) %>%
group_by(name) %>%
summarize(t_avg = mean(val)))
#> # A tibble: 3 x 2
#> name t_avg
#> <chr> <dbl>
#> 1 Abby 2
#> 2 Bess 5
#> 3 Carl 8
df %>%
left_join(s2)
#> Joining, by = "name"
#> # A tibble: 3 x 5
#> name t1 t2 t3 t_avg
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abby 1 2 3 2
#> 2 Bess 4 5 6 5
#> 3 Carl 7 8 9 8

(s3 <- df %>%
column_to_rownames("name") %>%
rowMeans() %>%
enframe())
#> Warning: Setting row names on a tibble is deprecated.
#> # A tibble: 3 x 2
#> name value
#> <chr> <dbl>
#> 1 Abby 2
#> 2 Bess 5
#> 3 Carl 8
df %>%
left_join(s3)
#> Joining, by = "name"
#> # A tibble: 3 x 5
#> name t1 t2 t3 value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abby 1 2 3 2
#> 2 Bess 4 5 6 5
#> 3 Carl 7 8 9 8
```

## Maybe you should use a matrix

If you truly have data where each row is:

- Identifier for this observational unit
- Homogeneous vector of length n for the unit

then you do want to use a matrix with rownames. I used to do this alot
but found that practically none of my data analysis problems live in
this simple world for more than a couple of hours. Eventually I always
get back to a setting where a data frame is the most favorable
receptacle, overall. YMMV.

``` r
m <- matrix(
1:9,
byrow = TRUE, nrow = 3,
dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
)

cbind(m, rowsum = rowSums(m))
#> t1 t2 t3 rowsum
#> Abby 1 2 3 6
#> Bess 4 5 6 15
#> Carl 7 8 9 24
cbind(m, rowmean = rowMeans(m))
#> t1 t2 t3 rowmean
#> Abby 1 2 3 2
#> Bess 4 5 6 5
#> Carl 7 8 9 8
```

0 comments on commit 13ec762

Please sign in to comment.