-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
let j = c(prefix = lapply(.SD, f)) work when optimized #2311
Comments
Because @franknarf1 posted this comment on a SO question which I believe is related to my SO question: Do we need to convert single elements of j to a list when the overall result of j is a list anyway?, I link to my corresponding issue: Possible inconsistencies in the autonaming and renaming of .N |
This issue also applies to concatenating a tagged list to a lapply(..SD, FUN) call inside of c(), but only if there's a by statement: library(data.table)
mtcarsdt <- as.data.table(mtcars)
#without by and with lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(MPG=list(mpg), lapply(.SD,sum)), .SDcols=c("drat","wt")])
#> [1] "MPG" "drat" "wt"
#with by but no lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(MPG=list(mpg)), by="cyl",.SDcols=c("drat","wt")])
#> Warning in `[.data.table`(mtcarsdt, , c(MPG = list(mpg)), by = "cyl", .SDcols
#> = c("drat", : This j doesn't use .SD but .SDcols has been supplied.
#> Ignoring .SDcols. See ?data.table.
#> [1] "cyl" "MPG"
#with by and lapply(SD,FUN):
names(mtcarsdt[, c(MPG=list(mpg), lapply(.SD,sum)), by="cyl",.SDcols=c("drat","wt")])
#> [1] "cyl" "V1" "drat" "wt"
#without by and with lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec), lapply(.SD,sum)), .SDcols=c("drat","wt")])
#> [1] "mean.mpg" "mean2" "mean.qsec" "drat" "wt"
#with by but no lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec)), by="cyl", .SDcols=c("drat","wt")])
#> Warning in `[.data.table`(mtcarsdt, , c(mean = list(mpg = mpg, wt,
#> qsec = qsec)), : This j doesn't use .SD but .SDcols has been supplied.
#> Ignoring .SDcols. See ?data.table.
#> [1] "cyl" "mean.mpg" "mean2" "mean.qsec"
#with by and lapply(SD,FUN):
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec), lapply(.SD,sum)), by="cyl",.SDcols=c("drat","wt")])
#> [1] "cyl" "mpg" "V2" "qsec" "drat" "wt" Created on 2021-01-31 by the reprex package (v0.3.0) |
More weird behavior: when library(data.table)
M <- as.data.table(mtcars)
names(M[, c(list(mpg),lapply(.SD, mean)), by="cyl"])
#> [1] "cyl" "V1" "mpg" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
#> [11] "gear" "carb"
names(M[, c(list(mpg),lapply(.SD, mean))])
#> [1] "V1" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
#> [11] "gear" "carb"
old = options(datatable.optimize = 0)
names(M[, c(list(mpg),lapply(.SD, mean)), by="cyl"])
#> [1] "cyl" "" "mpg" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
#> [11] "gear" "carb"
names(M[, c(list(mpg),lapply(.SD, mean))])
#> [1] "V1" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
#> [11] "gear" "carb" Created on 2021-02-01 by the reprex package (v0.3.0) |
One more bit of weirdness which is fixed in the PR:
In the PR, unnamed arguments of c() will now always get a corresponding return column named "V<position>". |
@myoung3 could you please add unit test for that if it is not yet there? |
@jangorecki oops, replied on the PR but yeah there's a unit test for this now. |
For what it's worth I consider this a bugfix not an enhancement. The examples I've provided here, combined with the original motivating example, together demonstrate that the creation of column names with c() is inconsistent to the point of being broken. |
It's also worth noting that fixing this will make it easier to apply multiple functions to all columns in .SD: Currently the following code has the problem that the resulting column names are indistinguishable (ie, .SD column names get repeated for mean and sum) because the "mean" and "sum" tags don't get prepended:
With the bugfix implemented, you'll actually be able to use the above code because the column names will be distinguishable and created in a predictable manner (ie consistent with base R). |
Adding prefix support for cube(data.table(mtcars),
j = c(lapply(.SD, mean), lapply(.SD, sum)),
by = c("cyl"),
.SDcols = c("mpg", "wt"))
# Error: There exists duplicated column names in the results, ensure the column passed/evaluated in `j` and those in `by` are not overlapping. Modifying the above to cube(data.table(mtcars),
j = c(mean = lapply(.SD, mean)),
by = "cyl",
.SDcols = c("mpg", "wt"))
#> cyl mpg wt mean.mpg mean.wt
#> <num> <num> <num> <num> <num>
#> 1: 6 19.74286 3.117143 NA NA
#> 2: 4 26.66364 2.285727 NA NA
#> 3: 8 15.10000 3.999214 NA NA
#> 4: NA NA NA 20.09062 3.21725 |
With
The prefix "sq" disappears in the latter case. Judging by the
verbose=TRUE
output, this is due to "lapply optimization" which takes names strictly from.SD
.(Sorry if I filed this issue already before this; I'm sure it's bothered me for a while. A recent case: I wanted to write
IDDT = dat[order(-t), c(max = .SD[1]), by=ID]
, but no dice.)Related: #1604
The text was updated successfully, but these errors were encountered: