Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keepby= to do what by= does now. #1880

Open
mattdowle opened this issue Oct 14, 2016 · 12 comments
Open

Add keepby= to do what by= does now. #1880

mattdowle opened this issue Oct 14, 2016 · 12 comments

Comments

@mattdowle
Copy link
Member

mattdowle commented Oct 14, 2016

For discussion ...

Currently by= returns the groups in the order each group first appeared. I think most often most people (including myself now) actually want the groups returned in group order by default; i.e. what keyby= does. Worse, some people think the groups are returned in random order and that that is for speed reasons; i.e. speed comes first. In fact, it actually takes data.table longer to return the groups in original order than it does to order the groups (it finds the order of the groups to find the groups in the first place rather than using hash tables). It has to do an extra step to get back to first appearance order. Of the users that have been left with the impression that data.table returns groups in random order for speed reasons, some of those then start to fear that the row order within the groups are not retained for speed reasons either. Nothing could be further from the truth.

NB1: We're not talking about the order of the rows within each group here at all. The order of the rows within each group is always retained. Always has been always will be. Set in stone. We're only talking about the order that the groups appear in the result returned. This FR doesn't apply when a := is present either; e.g. recall that the groups don't even have to be contiguous with a := by group.

NB2: When we say 'first appeared in the data' or 'order of the rows within each group' we mean after i has been applied, if present. Since DT[i, j, by] is the same as DT[i][,j,by].

Keeping the groups in first appearance order comes up for some/many users and is really important for them. In fact that's what I needed when I first created by= which is why I made that the default. But it can appear strange to others that the order that the groups first appear in the data should be relied on. Therefore, this FR is to create keepby= to do what by= does now. That way readers of the code in future will know that this query is expecting to keep first appearance order of groups retained. Retain the important ability, but make it clearer that the query uses it. I haven't seen this ability in other software so another but smaller motivating factor is to more easily explain that "data.table has keepby=".

Alternative keyword: batchby=. But to me that conveys batches defined by contiguous groups, like this answer where the same group values occur later but the user wishes them to be a separate group. For that, the new rleid() on the right hand side of by= makes most sense. That way it can be applied on a column basis.

For example, if this goes ahead, I'll change this answer to use keepby= to make it clearer to readers of that solution. That question requires something to be done depending on the previous group. See also #606.

Then over 2 years in the usual way, two options :

  1. Slowly deprecate both by= and unnamed 3rd argument inside [...]. The grouping argument would need to be explicitly named either keyby= or keepby= going forward.
    or
  2. Add option to change by= to do what keyby= does now, default FALSE. Then change default to TRUE. Then remove option. Then deprecate and remove keyby=. It would be by= or keepby= going forward. This avoids the need to explain that keyby= is a "by (which keeps the appearance order) followed by a setkey". That would be easier for new users but possibly harder migration path for existing. The default by= would be faster and more convenient in most cases. My guess is that very little existing usage relies on by= keeping group appearance order; mostly it's followed by an [order(...)] or something that doesn't depend on group appearance order. Anywhere that does rely on group appearance order should at least be changed to explicitly use keepby= to make that clearer, I'm thinking.
@franknarf1
Copy link
Contributor

franknarf1 commented Oct 14, 2016

I like the idea and would've never guessed speed consideration went the opposite way.

To me, keepby sounds like "keep the by columns in the result (as opposed to only showing the result of j)". This is similar to the meaning of keep.by in split.data.table and sounds complementary to the FR #1269 for returning only the by while excluding j.

So maybe a different arg name can be thought up...?

@mattdowle
Copy link
Member Author

mattdowle commented Oct 15, 2016

I think people don't raise it because they think it's fundamental to data.table's speed. @franknarf1's comment above is one example which shows one person didn't realize by= actually takes longer than keyby=. I don't have a benchmark. I could make one. It needs to have a lot of groups that are each together (so the random access to collect the groups together doesn't dominate the timing) but the group values are out of order, and types make a difference.
One example of an issue is the answer I pointed to is not very clear that it depends on knowing that by= keeps the group order. See the 2nd comment which doesn't realize that the solution already retains the appearance order of the groups within Patient: "Now, can you extend it so that: (a) it will work even if Treatment ID is not always in ascending order"? If it had been keepby= (or better name as Frank commented above) then it would have already been clearer that it already does retain the group order.

@mattdowle mattdowle added this to the v1.9.10 milestone Oct 15, 2016
@MichaelChirico
Copy link
Member

I'll read more in detail later; for now, count me among long-time DT users
who assumed keyby was slower than by.

On Oct 14, 2016 10:08 PM, "Matt Dowle" [email protected] wrote:

I think people don't raise it because they think it's fundamental to
data.table's speed. @franknarf1 https://github.com/franknarf1's comment
above is one example which shows one person didn't realize by= actually
takes longer than keyby=. I don't have a benchmark. I could make one. It
needs to have a lot of groups that are each together but the group values
are out of order, and types make a difference. One example of an issue is
the answer http://stackoverflow.com/a/8833899/403310 I pointed to is
not very clear that it depends on knowing that by= keeps the group order.
See the 2nd comment
http://stackoverflow.com/questions/8826652/sort-and-output-records-with-sas-and-r/8833899#comment11038681_8833899
from Josh who well versed with data.table and even he doesn't realize or
had forgotten that the solution already retains the appearance order of the
groups within Patient. Otherwise why did he write "Now, can you extend it
so that: (a) it will work even if Treatment ID is not always in ascending
order"?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1880 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdbeHaucENjs0b2MY9vQZaiVgmLsAks5q0DWLgaJpZM4KXgJC
.

@mattdowle
Copy link
Member Author

Ok so keyby= isn't faster than by=. But that's because keyby= is doing too much work. Here's a benchmark with the more detailed verbose messages I just added. Laptop sized data is enough to see it with lots of small groups.

require(data.table)
N=2e8
DT = data.table(id=sample(1e8,N,replace=TRUE), v=sample(100,N,replace=TRUE))

system.time(ans1 <- DT[,.(sum(v),.N),by=id,verbose=TRUE])
Detected that j uses these columns: v
Finding groups using forderv ... 5.046 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.076 sec
Getting back original order ... 8.449 sec
lapply optimization is on, j unchanged as 'list(sum(v), .N)'
GForce optimized j to 'list(gsum(v), .N)'
Making each group and running j (GForce TRUE) ... 7.596 secs
user system elapsed
20.644 0.592 21.236

system.time(ans2 <- DT[,.(sum(v),.N),keyby=id,verbose=TRUE])
Detected that j uses these columns: v
Finding groups using forderv ... 5.199 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.074 sec
Getting back original order ... 8.628 sec
lapply optimization is on, j unchanged as 'list(sum(v), .N)'
GForce optimized j to 'list(gsum(v), .N)'
Making each group and running j (GForce TRUE) ... 7.165 secs
setkey() afterwards ... 4.683 secs
user system elapsed
25.148 0.672 25.819

identical(setkey(ans1,id), ans2)
[1] TRUE

I don't see why keyby= does those two lines in bold. If we just remove them in the correct way it should be the same result, and twice as fast (25s down to 12s in this small example). Then keyby= would be faster than by= as I thought it was already.

Just as an aside really.

@eantonya
Copy link
Contributor

I use keyby about as frequently as I use by, but even when I don't I rarely need to keep the original order, so would likely benefit from this change.

That said, I always thought that keeping the order is a really cool feature of data.table, and this change is weirding me out a little. And, as Arun pointed out, how does one reconcile sorting with by=.EACHI, which imo clearly shouldn't be sorted?

@ben519
Copy link

ben519 commented Oct 15, 2016

Add me to that list of users who assumed keyby was slower than by (although, I guess it was true, but shouldn't have been true.)

I'd also be a little sad to see the behavior of by change. (I'm pretty sure it's going to silently break some code I wrote for my old company.)

I think I'd rather keep by as is and add an argument like keeporder set to TRUE by default to determine whether or not to retain group order.

@mattdowle
Copy link
Member Author

mattdowle commented Oct 19, 2016

With that change just now to correct keyby= performance (for this release 1.9.8), new result is as expected.

require(data.table)
N=2e8
DT = data.table(id=sample(1e8,N,replace=TRUE), v=sample(100,N,replace=TRUE))

system.time(ans1 <- DT[,.(sum(v),.N),by=id,verbose=TRUE])
Detected that j uses these columns: v
Finding groups using forderv ... 5.256 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.071 sec
Getting back original order ... 6.602 sec
lapply optimization is on, j unchanged as 'list(sum(v), .N)'
GForce optimized j to 'list(gsum(v), .N)'
Making each group and running j (GForce TRUE) ... 7.449 secs
user system elapsed
18.724 0.712 19.437

system.time(ans2 <- DT[,.(sum(v),.N),keyby=id,verbose=TRUE])
Detected that j uses these columns: v
Finding groups using forderv ... 5.216 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.073 sec
lapply optimization is on, j unchanged as 'list(sum(v), .N)'
GForce optimized j to 'list(gsum(v), .N)'
Making each group and running j (GForce TRUE) ... 7.682 secs
user system elapsed
12.600 0.424 13.024

identical(setkey(ans1,id), ans2)
[1] TRUE

That's just as an aside to this issue which is still up for discussion for distant future if at all.

@ghost
Copy link

ghost commented Nov 30, 2016

Never knew that keyby was faster, but in general I find the output of the sorted keyby more logical in nearly all cases. So I am for option 2:

Add option to change by= to do what keyby= does now, default FALSE. Then change default to TRUE. Then remove option.

because that saves typing - keyby is what by should have been, so to speak.

@mattdowle mattdowle modified the milestones: v1.10.4, v1.10.2 Dec 9, 2016
@MichaelChirico
Copy link
Member

IIUC this is no longer true but still in ?data.table:

keyby : Same as by, but with an additional setkey() run on the by columns of the result, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted.

@jangorecki
Copy link
Member

jangorecki commented Oct 30, 2018

Presently with the recent changes. Note that those timings are likely to change in near future (before next stable release). Posting to present as is at the moment.

library(data.table)
file = "G1_1e8_1e2.csv" # G1_1e9_1e2.csv
X = fread(file)
X[1L,]
system.time(ans<-X[, .(v1=sum(v1)), keyby=id1])[["elapsed"]]
system.time(ans<-X[, .(v1=sum(v1)), keyby=.(id1, id2)])[["elapsed"]]
system.time(ans<-X[, .(v1=sum(v1), v3=mean(v3)), keyby=id3])[["elapsed"]]
system.time(ans<-X[, lapply(.SD, mean), keyby=id4, .SDcols=v1:v3])[["elapsed"]]
system.time(ans<-X[, lapply(.SD, sum), keyby=id6, .SDcols=v1:v3])[["elapsed"]]
library(data.table)
file = "G1_1e8_1e2.csv" # G1_1e9_1e2.csv
X = fread(file)
X[1L,]
system.time(ans<-X[, .(v1=sum(v1)), by=id1])[["elapsed"]]
system.time(ans<-X[, .(v1=sum(v1)), by=.(id1, id2)])[["elapsed"]]
system.time(ans<-X[, .(v1=sum(v1), v3=mean(v3)), by=id3])[["elapsed"]]
system.time(ans<-X[, lapply(.SD, mean), by=id4, .SDcols=v1:v3])[["elapsed"]]
system.time(ans<-X[, lapply(.SD, sum), by=id6, .SDcols=v1:v3])[["elapsed"]]
data.table::fread("
in_rows,scenario,question,elapsed_sec
1e8,keyby,q1,2.533
1e8,keyby,q2,4.362
1e8,keyby,q3,12.114
1e8,keyby,q4,2.596
1e8,keyby,q5,8.217
1e8,by,q1,2.275
1e8,by,q2,3.356
1e8,by,q3,8.714
1e8,by,q4,3.077
1e8,by,q5,9.317
1e9,keyby,q1,19.018
1e9,keyby,q2,28.200
1e9,keyby,q3,106.692
1e9,keyby,q4,27.114
1e9,keyby,q5,83.962
1e9,by,q1,18.377
1e9,by,q2,26.407
1e9,by,q3,79.512
1e9,by,q4,25.027
1e9,by,q5,80.496
") -> d
data.table::dcast(
  d, question + in_rows ~ scenario, value.var="elapsed_sec"
  )[, keyby_by:=(keyby-by)/by][, knitr::kable(.SD)]
|question | in_rows|     by|   keyby|   keyby_by|
|:--------|-------:|------:|-------:|----------:|
|q1       |   1e+08|  2.275|   2.533|  0.1134066|
|q1       |   1e+09| 18.377|  19.018|  0.0348806|
|q2       |   1e+08|  3.356|   4.362|  0.2997616|
|q2       |   1e+09| 26.407|  28.200|  0.0678987|
|q3       |   1e+08|  8.714|  12.114|  0.3901767|
|q3       |   1e+09| 79.512| 106.692|  0.3418352|
|q4       |   1e+08|  3.077|   2.596| -0.1563211|
|q4       |   1e+09| 25.027|  27.114|  0.0833899|
|q5       |   1e+08|  9.317|   8.217| -0.1180638|
|q5       |   1e+09| 80.496|  83.962|  0.0430580|

As we can see speed depends on query type, and datasize, but most of the time by is faster, and scales better with number of groups.

@MichaelChirico
Copy link
Member

Why not instead add an orderby argument which can take a keyword like orderby = 'KEEP' (or similar) to accomplish what keepby would?

orderby also has many other use cases -- I often find myself in exploratory mode doing stuff like

DT[ , .N, by = grp][order(N)]

Could become

DT[ , .N, by = grp, orderby = 'N']

@jangorecki
Copy link
Member

I think retaining original order is a nice feature. I like the fact that we do it by default.
Being backward compatible is another argument to not change default behavior of that.
In 2018 by was faster than keyby, now again keyby seems to be faster than by. The difference is not really big, I would say if one is not interested in original order of data then it probably better to have data sorted for any further operation to benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants