Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

let j = c(prefix = lapply(.SD, f)) work when optimized #2311

Closed
franknarf1 opened this issue Aug 18, 2017 · 9 comments · Fixed by #4883
Closed

let j = c(prefix = lapply(.SD, f)) work when optimized #2311

franknarf1 opened this issue Aug 18, 2017 · 9 comments · Fixed by #4883
Labels
enhancement top request One of our most-requested issues

Comments

@franknarf1
Copy link
Contributor

With

library(data.table)
DT = data.table(a = 1, b = 2)

DT[, c(sq = lapply(.SD, "^", 2))]
#    sq.a sq.b
# 1:    1    4
DT[, c(sq = lapply(.SD, "^", 2)), by=1]
#    NA a b
# 1:  1 1 4

The prefix "sq" disappears in the latter case. Judging by the verbose=TRUE output, this is due to "lapply optimization" which takes names strictly from .SD.

(Sorry if I filed this issue already before this; I'm sure it's bothered me for a while. A recent case: I wanted to write IDDT = dat[order(-t), c(max = .SD[1]), by=ID], but no dice.)

Related: #1604

@Henrik-P
Copy link

Henrik-P commented Sep 4, 2017

Because @franknarf1 posted this comment on a SO question which I believe is related to my SO question: Do we need to convert single elements of j to a list when the overall result of j is a list anyway?, I link to my corresponding issue: Possible inconsistencies in the autonaming and renaming of .N

@mattdowle mattdowle added this to the Candidate milestone Sep 4, 2017
@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@jangorecki jangorecki added this to the 1.13.0 milestone Sep 21, 2019
@jangorecki jangorecki changed the title [Request] let j = c(prefix = lapply(.SD, f)) work when optimized let j = c(prefix = lapply(.SD, f)) work when optimized Sep 21, 2019
@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@jangorecki jangorecki removed the High label Jun 8, 2020
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@myoung3
Copy link
Contributor

myoung3 commented Jan 31, 2021

This issue also applies to concatenating a tagged list to a lapply(..SD, FUN) call inside of c(), but only if there's a by statement:

library(data.table)
mtcarsdt <- as.data.table(mtcars)

#without by and with lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(MPG=list(mpg), lapply(.SD,sum)), .SDcols=c("drat","wt")])
#> [1] "MPG"  "drat" "wt"

#with by but no lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(MPG=list(mpg)), by="cyl",.SDcols=c("drat","wt")])
#> Warning in `[.data.table`(mtcarsdt, , c(MPG = list(mpg)), by = "cyl", .SDcols
#> = c("drat", : This j doesn't use .SD but .SDcols has been supplied.
#> Ignoring .SDcols. See ?data.table.
#> [1] "cyl" "MPG"

#with by and lapply(SD,FUN):
names(mtcarsdt[, c(MPG=list(mpg), lapply(.SD,sum)), by="cyl",.SDcols=c("drat","wt")])
#> [1] "cyl"  "V1"   "drat" "wt"


#without by and with lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec), lapply(.SD,sum)), .SDcols=c("drat","wt")])
#> [1] "mean.mpg"  "mean2"     "mean.qsec" "drat"      "wt"

#with by but no lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec)), by="cyl", .SDcols=c("drat","wt")])
#> Warning in `[.data.table`(mtcarsdt, , c(mean = list(mpg = mpg, wt,
#> qsec = qsec)), : This j doesn't use .SD but .SDcols has been supplied.
#> Ignoring .SDcols. See ?data.table.
#> [1] "cyl"       "mean.mpg"  "mean2"     "mean.qsec"

#with by and lapply(SD,FUN):
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec), lapply(.SD,sum)), by="cyl",.SDcols=c("drat","wt")])
#> [1] "cyl"  "mpg"  "V2"   "qsec" "drat" "wt"

Created on 2021-01-31 by the reprex package (v0.3.0)

@myoung3
Copy link
Contributor

myoung3 commented Feb 2, 2021

More weird behavior: when optimize=0, an unnamed list (e.g. c(list(mpg)) ) doesn't get colum name V1 and instead gets "", but only when there's a by variable

library(data.table)
  M <- as.data.table(mtcars)
  names(M[, c(list(mpg),lapply(.SD, mean)), by="cyl"])
#>  [1] "cyl"  "V1"   "mpg"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
#> [11] "gear" "carb"
  names(M[, c(list(mpg),lapply(.SD, mean))])
#>  [1] "V1"   "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
#> [11] "gear" "carb"
  
  old = options(datatable.optimize = 0)
  names(M[, c(list(mpg),lapply(.SD, mean)), by="cyl"])
#>  [1] "cyl"  ""     "mpg"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
#> [11] "gear" "carb"
  names(M[, c(list(mpg),lapply(.SD, mean))])
#>  [1] "V1"   "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
#> [11] "gear" "carb"

Created on 2021-02-01 by the reprex package (v0.3.0)

@myoung3
Copy link
Contributor

myoung3 commented Feb 20, 2021

One more bit of weirdness which is fixed in the PR:

> mtcars[, c(wt[1], mean=list(a=mean(hp), b=mean(wt)),lapply(.SD,sum)),by="cyl"]
   cyl         mean.a   mean.b   mpg   disp   hp  drat     wt   qsec vs am gear carb
1:   6 2.62 122.28571 3.117143 138.2 1283.2  856 25.10 21.820 125.84  4  3   27   24
2:   4 2.32  82.63636 2.285727 293.3 1156.5  909 44.78 25.143 210.51 10  8   45   17
3:   8 3.44 209.21429 3.999214 211.4 4943.4 2929 45.21 55.989 234.81  0  2   46   49
> mtcars[, c(wt[1], mean=list(a=mean(hp), b=mean(wt)),lapply(.SD,sum))]
     V1   mean.a  mean.b   mpg cyl   disp   hp   drat      wt   qsec vs am gear carb
1: 2.62 146.6875 3.21725 642.9 198 7383.1 4694 115.09 102.952 571.16 14 13  118   90

In the PR, unnamed arguments of c() will now always get a corresponding return column named "V<position>".

@jangorecki
Copy link
Member

jangorecki commented Feb 20, 2021

@myoung3 could you please add unit test for that if it is not yet there?

@myoung3
Copy link
Contributor

myoung3 commented Feb 20, 2021

@jangorecki oops, replied on the PR but yeah there's a unit test for this now.

@myoung3
Copy link
Contributor

myoung3 commented Feb 23, 2021

For what it's worth I consider this a bugfix not an enhancement. The examples I've provided here, combined with the original motivating example, together demonstrate that the creation of column names with c() is inconsistent to the point of being broken.

@myoung3
Copy link
Contributor

myoung3 commented Feb 23, 2021

It's also worth noting that fixing this will make it easier to apply multiple functions to all columns in .SD:

Currently the following code has the problem that the resulting column names are indistinguishable (ie, .SD column names get repeated for mean and sum) because the "mean" and "sum" tags don't get prepended:

x[, c(mean=lapply(.SD,mean),sum=lapply(.SD,sum)),by="z"]

With the bugfix implemented, you'll actually be able to use the above code because the column names will be distinguishable and created in a predictable manner (ie consistent with base R).

@grantmcdermott
Copy link
Contributor

Adding prefix support for lapply(.SD,...) would also be handy for groupingset operations. Unlike the regular [ operations above, which (unfortunately) permit duplicate column names, the following doesn't work because of the name clash:

cube(data.table(mtcars), 
     j = c(lapply(.SD, mean), lapply(.SD, sum)), 
     by = c("cyl"), 
     .SDcols = c("mpg", "wt"))
# Error: There exists duplicated column names in the results, ensure the column passed/evaluated in `j` and those in `by` are not overlapping.

Modifying the above to j = c(mean = lapply(.SD, mean), sum = lapply(.SD, sum)) triggers the same error, since naming appears to be ignored for all but the full sample. This is easier to see if we just perform a single lapply call.

cube(data.table(mtcars), 
     j = c(mean = lapply(.SD, mean)), 
     by = "cyl", 
     .SDcols = c("mpg", "wt"))
#>      cyl      mpg       wt mean.mpg mean.wt
#>    <num>    <num>    <num>    <num>   <num>
#> 1:     6 19.74286 3.117143       NA      NA
#> 2:     4 26.66364 2.285727       NA      NA
#> 3:     8 15.10000 3.999214       NA      NA
#> 4:    NA       NA       NA 20.09062 3.21725

@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants