-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add keepby= to do what by= does now. #1880
Comments
I like the idea and would've never guessed speed consideration went the opposite way. To me, So maybe a different arg name can be thought up...? |
I think people don't raise it because they think it's fundamental to data.table's speed. @franknarf1's comment above is one example which shows one person didn't realize |
I'll read more in detail later; for now, count me among long-time DT users On Oct 14, 2016 10:08 PM, "Matt Dowle" [email protected] wrote:
|
Ok so
I don't see why Just as an aside really. |
I use That said, I always thought that keeping the order is a really cool feature of |
Add me to that list of users who assumed I'd also be a little sad to see the behavior of I think I'd rather keep |
With that change just now to correct
That's just as an aside to this issue which is still up for discussion for distant future if at all. |
Never knew that keyby was faster, but in general I find the output of the sorted
because that saves typing - keyby is what by should have been, so to speak. |
IIUC this is no longer true but still in
|
Presently with the recent changes. Note that those timings are likely to change in near future (before next stable release). Posting to present as is at the moment. library(data.table)
file = "G1_1e8_1e2.csv" # G1_1e9_1e2.csv
X = fread(file)
X[1L,]
system.time(ans<-X[, .(v1=sum(v1)), keyby=id1])[["elapsed"]]
system.time(ans<-X[, .(v1=sum(v1)), keyby=.(id1, id2)])[["elapsed"]]
system.time(ans<-X[, .(v1=sum(v1), v3=mean(v3)), keyby=id3])[["elapsed"]]
system.time(ans<-X[, lapply(.SD, mean), keyby=id4, .SDcols=v1:v3])[["elapsed"]]
system.time(ans<-X[, lapply(.SD, sum), keyby=id6, .SDcols=v1:v3])[["elapsed"]] library(data.table)
file = "G1_1e8_1e2.csv" # G1_1e9_1e2.csv
X = fread(file)
X[1L,]
system.time(ans<-X[, .(v1=sum(v1)), by=id1])[["elapsed"]]
system.time(ans<-X[, .(v1=sum(v1)), by=.(id1, id2)])[["elapsed"]]
system.time(ans<-X[, .(v1=sum(v1), v3=mean(v3)), by=id3])[["elapsed"]]
system.time(ans<-X[, lapply(.SD, mean), by=id4, .SDcols=v1:v3])[["elapsed"]]
system.time(ans<-X[, lapply(.SD, sum), by=id6, .SDcols=v1:v3])[["elapsed"]] data.table::fread("
in_rows,scenario,question,elapsed_sec
1e8,keyby,q1,2.533
1e8,keyby,q2,4.362
1e8,keyby,q3,12.114
1e8,keyby,q4,2.596
1e8,keyby,q5,8.217
1e8,by,q1,2.275
1e8,by,q2,3.356
1e8,by,q3,8.714
1e8,by,q4,3.077
1e8,by,q5,9.317
1e9,keyby,q1,19.018
1e9,keyby,q2,28.200
1e9,keyby,q3,106.692
1e9,keyby,q4,27.114
1e9,keyby,q5,83.962
1e9,by,q1,18.377
1e9,by,q2,26.407
1e9,by,q3,79.512
1e9,by,q4,25.027
1e9,by,q5,80.496
") -> d
data.table::dcast(
d, question + in_rows ~ scenario, value.var="elapsed_sec"
)[, keyby_by:=(keyby-by)/by][, knitr::kable(.SD)]
As we can see speed depends on query type, and datasize, but most of the time |
Why not instead add an
Could become
|
I think retaining original order is a nice feature. I like the fact that we do it by default. |
For discussion ...
Currently
by=
returns the groups in the order each group first appeared. I think most often most people (including myself now) actually want the groups returned in group order by default; i.e. whatkeyby=
does. Worse, some people think the groups are returned in random order and that that is for speed reasons; i.e. speed comes first. In fact, it actually takes data.table longer to return the groups in original order than it does to order the groups (it finds the order of the groups to find the groups in the first place rather than using hash tables). It has to do an extra step to get back to first appearance order. Of the users that have been left with the impression that data.table returns groups in random order for speed reasons, some of those then start to fear that the row order within the groups are not retained for speed reasons either. Nothing could be further from the truth.NB1: We're not talking about the order of the rows within each group here at all. The order of the rows within each group is always retained. Always has been always will be. Set in stone. We're only talking about the order that the groups appear in the result returned. This FR doesn't apply when a
:=
is present either; e.g. recall that the groups don't even have to be contiguous with a:=
by group.NB2: When we say 'first appeared in the data' or 'order of the rows within each group' we mean after
i
has been applied, if present. SinceDT[i, j, by]
is the same asDT[i][,j,by]
.Keeping the groups in first appearance order comes up for some/many users and is really important for them. In fact that's what I needed when I first created
by=
which is why I made that the default. But it can appear strange to others that the order that the groups first appear in the data should be relied on. Therefore, this FR is to createkeepby=
to do whatby=
does now. That way readers of the code in future will know that this query is expecting to keep first appearance order of groups retained. Retain the important ability, but make it clearer that the query uses it. I haven't seen this ability in other software so another but smaller motivating factor is to more easily explain that "data.table haskeepby=
".Alternative keyword:
batchby=
. But to me that conveys batches defined by contiguous groups, like this answer where the same group values occur later but the user wishes them to be a separate group. For that, the newrleid()
on the right hand side ofby=
makes most sense. That way it can be applied on a column basis.For example, if this goes ahead, I'll change this answer to use
keepby=
to make it clearer to readers of that solution. That question requires something to be done depending on the previous group. See also #606.Then over 2 years in the usual way, two options :
by=
and unnamed 3rd argument inside[...]
. The grouping argument would need to be explicitly named eitherkeyby=
orkeepby=
going forward.or
by=
to do whatkeyby=
does now, default FALSE. Then change default to TRUE. Then remove option. Then deprecate and removekeyby=
. It would beby=
orkeepby=
going forward. This avoids the need to explain thatkeyby=
is a "by (which keeps the appearance order) followed by a setkey". That would be easier for new users but possibly harder migration path for existing. The defaultby=
would be faster and more convenient in most cases. My guess is that very little existing usage relies onby=
keeping group appearance order; mostly it's followed by an[order(...)]
or something that doesn't depend on group appearance order. Anywhere that does rely on group appearance order should at least be changed to explicitly usekeepby=
to make that clearer, I'm thinking.The text was updated successfully, but these errors were encountered: