-
Notifications
You must be signed in to change notification settings - Fork 42
/
Copy pathex09_row-summaries.R
191 lines (174 loc) · 6.86 KB
/
ex09_row-summaries.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
#' ---
#' title: "Row-wise Summaries"
#' author: "Jenny Bryan"
#' date: "`r format(Sys.Date())`"
#' output: github_document
#' ---
#+ setup, include = FALSE, cache = FALSE
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
error = TRUE
)
options(tidyverse.quiet = TRUE)
#' > For rowSums, mtcars %>% mutate(rowsum = pmap_dbl(., sum)) works but is
#' > a tidy oneliner for mean or sd per row?
#' > I'm looking for a tidy version of rowSums, rowMeans and similarly rowSDs...
#'
#' [Two](https://twitter.com/vrnijs/status/995129678284255233)
#' [tweets](https://twitter.com/vrnijs/status/995193240864178177) from Vincent
#' Nijs [github](https://github.com/vnijs),
#' [twitter](https://twitter.com/vrnijs)
#'
#' Good question! This also came up when I was originally casting about for
#' genuine row-wise operations, but I never worked it up. I will do so now!
#' First I set up my example.
#'
#+ body
# ----
library(tidyverse)
df <- tribble(
~ name, ~ t1, ~t2, ~t3,
"Abby", 1, 2, 3,
"Bess", 4, 5, 6,
"Carl", 7, 8, 9
)
#' ## Use `rowSums()` and `rowMeans()` inside `dplyr::mutate()`
#'
#' One "tidy version" of `rowSums()` is to ... just stick `rowSums()` inside a
#' tidyverse pipeline. You can use `rowSums()` and `rowMeans()` inside
#' `mutate()`, because they have a method for `data.frame`:
df %>%
mutate(t_sum = rowSums(select_if(., is.numeric)))
df %>%
mutate(t_avg = rowMeans(select(., -name)))
#' Above I also demonstrate the use of `select(., SOME_EXPRESSION)` to express
#' which variables should be computed on. This comes up a lot in row-wise work
#' with a data frame, because, almost by definition, your variables are of mixed
#' type. These are just a few examples of the different ways to say "use `t1`,
#' `t2`, and `t3`", so we don't try to sum or average `name`. I'll continue to
#' mix these in as we go. They are equally useful when expressing which
#' variables should be forwarded to `.f` inside `pmap_*().`
#'
#' ## Devil's Advocate: can't you just use `rowMeans()` and `rowSums()` alone?
#'
#' This is a great point [raised by Diogo
#' Camacho](https://twitter.com/DiogoMCamacho/status/996178967647412224). If
#' `rowSums()` and `rowMeans()` get the job done, why put yourself through the
#' pain of using `pmap()`, especially inside `mutate()`?
#'
#' There are a few reasons:
#'
#' * You might want to take the median or standard deviation instead of a mean
#' or a sum. You can't assume that base R or an add-on package offers a row-wise
#' `data.frame` method for every function you might need.
#' * You might have several variables besides `name` that need to be retained,
#' but that should not be forwarded to `rowSums()` or `rowMeans()`. A
#' matrix-with-row-names grants you a reprieve for exactly one variable and that
#' variable best not be integer, factor, date, or datetime. Because you must
#' store it as character. It's not a general solution.
#' * Correctness. If you extract the numeric columns or the variables whose
#' names start with `"t"`, compute `rowMeans()` on them, and then column-bind
#' the result back to the data, you are responsible for making sure that the two
#' objects are absolutely, positively row-aligned.
#'
#' I think it's important to have a general strategy for row-wise computation on
#' a subset of the columns in a data frame.
#'
#' ## How to use an arbitrary function inside `pmap()`
#'
#' What if you need to apply `foo()` to rows and the universe has not provided a
#' special-purpose `rowFoos()` function? Now you do need to use `pmap()` or a
#' type-stable variant, with `foo()` playing the role of `.f`.
#'
#' This works especially well with `sum()`.
df %>%
mutate(t_sum = pmap_dbl(list(t1, t2, t3), sum))
df %>%
mutate(t_sum = pmap_dbl(select(., starts_with("t")), sum))
#' But the original question was about means and standard deviations! Why is
#' that any different? Look at the signature of `sum()` versus a few other
#' numerical summaries:
#'
#+ eval = FALSE
sum(..., na.rm = FALSE)
mean(x, trim = 0, na.rm = FALSE, ...)
median(x, na.rm = FALSE, ...)
var(x, y = NULL, na.rm = FALSE, use)
#' `sum()` is especially `pmap()`-friendly because it takes `...` as its primary
#' argument. In contrast, `mean()` takes a vector `x` as primary argument, which
#' makes it harder to just drop into `pmap()`. This is something you might never
#' think about if you're used to using special-purpose helpers like
#' `rowMeans()`.
#'
#' purrr has a family of `lift_*()` functions that help you convert between
#' these forms. Here I apply `purrr::lift_vd()` to `mean()`, so I can use it
#' inside `pmap()`. The "vd" says I want to convert a function that takes a
#' "**v**ector" into one that takes "**d**ots".
df %>%
mutate(t_avg = pmap_dbl(list(t1, t2, t3), lift_vd(mean)))
#' ## Strategies that use reshaping and joins
#'
#' Data frames simply aren't a convenient storage format if you have a frequent
#' need to compute summaries, row-wise, on a subset of columns. It is highly
#' suggestive that your data is in the wrong shape, i.e. it's not tidy. Here we
#' explore some approaches that rely on reshaping and/or joining. They are more
#' transparent than using `lift_*()` with `pmap()` inside `mutate()` and,
#' consequently, more verbose.
#'
#' They all rely on forming row-wise summaries, then joining back to the data.
#'
#' ### Gather, group, summarize
(s <- df %>%
gather("time", "val", starts_with("t")) %>%
group_by(name) %>%
summarize(t_avg = mean(val), t_sum = sum(val)))
df %>%
left_join(s)
#' ### Group then summarise, with explicit `c()`
(s <- df %>%
group_by(name) %>%
summarise(t_avg = mean(c(t1, t2, t3))))
df %>%
left_join(s)
#' ### Nesting
#'
#' Let's revisit a pattern from
#' [`ex08_nesting-is-good`](ex08_nesting-is-good.md). This is another way to
#' "package" up the values of `t1`, `t2`, and `t3` in a way that make both
#' `mean()` and `sum()` happy. *thanks @krlmlr*
(s <- df %>%
gather("key", "value", -name) %>%
nest(-name) %>%
mutate(
sum = map(data, "value") %>% map_dbl(sum),
mean = map(data, "value") %>% map_dbl(mean)
) %>%
select(-data))
df %>%
left_join(s)
#' ### Yet another way to use `rowMeans()`
(s <- df %>%
column_to_rownames("name") %>%
rowMeans() %>%
enframe())
df %>%
left_join(s)
#' ## Maybe you should use a matrix
#'
#' If you truly have data where each row is:
#'
#' * Identifier for this observational unit
#' * Homogeneous vector of length n for the unit
#'
#' then you do want to use a matrix with rownames. I used to do this alot but
#' found that practically none of my data analysis problems live in this simple
#' world for more than a couple of hours. Eventually I always get back to a
#' setting where a data frame is the most favorable receptacle, overall. YMMV.
m <- matrix(
1:9,
byrow = TRUE, nrow = 3,
dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
)
cbind(m, rowsum = rowSums(m))
cbind(m, rowmean = rowMeans(m))