Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: function to add "totals" as an indexed group for summarizing #3463

Closed
sfirke opened this issue Mar 26, 2018 · 9 comments
Closed
Labels
feature a feature request or enhancement

Comments

@sfirke
Copy link
Contributor

sfirke commented Mar 26, 2018

(this is a generalization and extension of #1253)

This function would add a group to a grouped_df, called "Total" (or maybe a user-specified name), that has all row numbers in its index. Once this group is added to the tibble, summarise functions would treat it like any other group.

Hypothetical example:

mtcars %>%
  group_by(cyl) %>%
  add_totals_group() %>%
  summarise(avg_mpg = mean(mpg),
            n = n())

# A tibble: 4 x 3
  cyl   avg_mpg     n
  <chr>   <dbl> <int>
1 4        26.7    11
2 6        19.7     7
3 8        15.1    14
4 Total    20.1    32

The actual code to produce this hypothetical result is:

rbind(
  mtcars %>%
    group_by(cyl) %>%
    summarise(avg_mpg = mean(mpg),
              n = n()),
  mtcars %>%
    summarise(avg_mpg = mean(mpg),
              n = n()) %>%
    mutate(cyl = "Total")
)

I started working on it but ran into internal dplyr validation checks that don't like my "corrupt" grouped_df. Here's my non-working attempt to tinker with the group indices:

add_totals_group <- function(dat){
  cur_indices <- attr(dat, "indices")
  attr(dat, "indices")[[length(cur_indices) + 1]] <- seq(nrow(dat))
  attr(dat, "group_sizes") <- c(attr(dat, "group_sizes"), nrow(dat))
  attr(dat, "biggest_group_size") <- nrow(dat)
  attr(dat, "labels") <- rbind(rep("Total", ncol(dat)))
  dat
}
@romainfrancois
Copy link
Member

That's the kind of things tidy evaluation can help you build.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

yada <- function(data, ... ){
  groups <- quos(...)

  structure(
    list(
      with = group_by(data, !!!groups),
      without = data
    ),
    class = "grouped_and_not_grouped"
  )

}

summarise.grouped_and_not_grouped <- function(data, ...){
  exprs <- quos(...)

  with    <- summarise( data$with, !!!exprs )
  without <- summarise( data$without, !!!exprs )

  bind_rows( with, without )

}

mtcars %>%
  yada(cyl) %>%
  summarise( avg_mpg = mean(mpg), n = n())
#> # A tibble: 4 x 3
#>     cyl avg_mpg     n
#>   <dbl>   <dbl> <int>
#> 1    4.    26.7    11
#> 2    6.    19.7     7
#> 3    8.    15.1    14
#> 4   NA     20.1    32

Might need some extra help with the NA, but I don't think this should be in dplyr.

@sfirke
Copy link
Contributor Author

sfirke commented Mar 28, 2018

I appreciate the example with tidyeval. But compared to using the grouping internals of dplyr, this feels heavy and brittle. Adding an index group would allow for a versatile, natural user experience, compared to another package taking this approach. For instance this fails:

mtcars %>%
  yada(cyl) %>%
  filter(mpg < 30) %>% 
  summarise( avg_mpg = mean(mpg), n = n())

Because we're now relying on another class. But would work if done through the grouping indices. And something simple like this would work:

mtcars %>%
  group_by(cyl) %>%
  add_total_grp() %>%
  tally()

Without needing to define a new class in a new package and write S3 methods for all the dplyr verbs. Not to mention other tidyverse functions that take advantage of grouping.

As an outsider, the approach of using the grouping indices is really appealing for the user experience it would provide. But I am ignorant of how feasible that is on the back end. How difficult would it be to accommodate a total group? If the answer is that it would require a massive rewrite, or it's impossible, or "tricky and simply not a priority", I can accept that.

@romainfrancois
Copy link
Member

I am also concerned about the change of class in your original example where the cyl column became a character to accommodate for Total.

@sfirke
Copy link
Contributor Author

sfirke commented Mar 28, 2018

Yep. To me that seems inevitable with introducing an all group. It can't be NA because there could be cyl values of NA already. But maybe someone else has an idea for how to address that.

@romainfrancois
Copy link
Member

About your initial add_totals_group, mind that dplyr uses 0 based indices, so I guess you'd need to use something like seq(nrow(dat)) - 1L

add_totals_group <- function(dat){
  cur_indices <- attr(dat, "indices")
  attr(dat, "indices")[[length(cur_indices) + 1]] <- seq(nrow(dat)) - 1L
  attr(dat, "group_sizes") <- c(attr(dat, "group_sizes"), nrow(dat))
  attr(dat, "biggest_group_size") <- nrow(dat)
  attr(dat, "labels") <- rbind(rep("Total", ncol(dat)))
  dat
}

not sure that's enough to fool it.

@krlmlr
Copy link
Member

krlmlr commented Apr 1, 2018

I'd like to treat attributes, in particular for grouped df-s, as private to dplyr.

As Romain has shown, it is now easy to build a custom class that will forward the same operation to both a grouped and an ungrouped version of the same table, and produce a nice output. I agree with Romain's and Hadley's original comment in #1253 that this feels out of scope for dplyr.

@sfirke
Copy link
Contributor Author

sfirke commented Apr 2, 2018

👍 Thanks Roman and Kirill for considering and Romain for the proof of concept example to get something started. To anyone who sees this issue in the future: if you draft a package with these totals functions I'll user-test it 😀

@krlmlr krlmlr added feature a feature request or enhancement data frame labels Apr 9, 2018
@krlmlr
Copy link
Member

krlmlr commented Apr 9, 2018

Thanks!

@krlmlr krlmlr closed this as completed Apr 9, 2018
@lock
Copy link

lock bot commented Oct 7, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants