let j = c(prefix = lapply(.SD, f)) work when optimized #2311

franknarf1 · 2017-08-18T16:41:27Z

With

library(data.table)
DT = data.table(a = 1, b = 2)

DT[, c(sq = lapply(.SD, "^", 2))]
#    sq.a sq.b
# 1:    1    4
DT[, c(sq = lapply(.SD, "^", 2)), by=1]
#    NA a b
# 1:  1 1 4

The prefix "sq" disappears in the latter case. Judging by the verbose=TRUE output, this is due to "lapply optimization" which takes names strictly from .SD.

(Sorry if I filed this issue already before this; I'm sure it's bothered me for a while. A recent case: I wanted to write IDDT = dat[order(-t), c(max = .SD[1]), by=ID], but no dice.)

Related: #1604

The text was updated successfully, but these errors were encountered:

Henrik-P · 2017-09-04T20:20:19Z

Because @franknarf1 posted this comment on a SO question which I believe is related to my SO question: Do we need to convert single elements of j to a list when the overall result of j is a list anyway?, I link to my corresponding issue: Possible inconsistencies in the autonaming and renaming of .N

myoung3 · 2021-01-31T20:11:48Z

This issue also applies to concatenating a tagged list to a lapply(..SD, FUN) call inside of c(), but only if there's a by statement:

library(data.table)
mtcarsdt <- as.data.table(mtcars)

#without by and with lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(MPG=list(mpg), lapply(.SD,sum)), .SDcols=c("drat","wt")])
#> [1] "MPG"  "drat" "wt"

#with by but no lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(MPG=list(mpg)), by="cyl",.SDcols=c("drat","wt")])
#> Warning in `[.data.table`(mtcarsdt, , c(MPG = list(mpg)), by = "cyl", .SDcols
#> = c("drat", : This j doesn't use .SD but .SDcols has been supplied.
#> Ignoring .SDcols. See ?data.table.
#> [1] "cyl" "MPG"

#with by and lapply(SD,FUN):
names(mtcarsdt[, c(MPG=list(mpg), lapply(.SD,sum)), by="cyl",.SDcols=c("drat","wt")])
#> [1] "cyl"  "V1"   "drat" "wt"


#without by and with lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec), lapply(.SD,sum)), .SDcols=c("drat","wt")])
#> [1] "mean.mpg"  "mean2"     "mean.qsec" "drat"      "wt"

#with by but no lapply(SD,FUN), correct behavior:
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec)), by="cyl", .SDcols=c("drat","wt")])
#> Warning in `[.data.table`(mtcarsdt, , c(mean = list(mpg = mpg, wt,
#> qsec = qsec)), : This j doesn't use .SD but .SDcols has been supplied.
#> Ignoring .SDcols. See ?data.table.
#> [1] "cyl"       "mean.mpg"  "mean2"     "mean.qsec"

#with by and lapply(SD,FUN):
names(mtcarsdt[, c(mean=list(mpg=mpg,wt, qsec=qsec), lapply(.SD,sum)), by="cyl",.SDcols=c("drat","wt")])
#> [1] "cyl"  "mpg"  "V2"   "qsec" "drat" "wt"

^{Created on 2021-01-31 by the reprex package (v0.3.0)}

myoung3 · 2021-02-02T08:01:15Z

More weird behavior: when optimize=0, an unnamed list (e.g. c(list(mpg)) ) doesn't get colum name V1 and instead gets "", but only when there's a by variable

library(data.table)
  M <- as.data.table(mtcars)
  names(M[, c(list(mpg),lapply(.SD, mean)), by="cyl"])
#>  [1] "cyl"  "V1"   "mpg"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
#> [11] "gear" "carb"
  names(M[, c(list(mpg),lapply(.SD, mean))])
#>  [1] "V1"   "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
#> [11] "gear" "carb"
  
  old = options(datatable.optimize = 0)
  names(M[, c(list(mpg),lapply(.SD, mean)), by="cyl"])
#>  [1] "cyl"  ""     "mpg"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
#> [11] "gear" "carb"
  names(M[, c(list(mpg),lapply(.SD, mean))])
#>  [1] "V1"   "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
#> [11] "gear" "carb"

^{Created on 2021-02-01 by the reprex package (v0.3.0)}

myoung3 · 2021-02-20T06:45:16Z

One more bit of weirdness which is fixed in the PR:

> mtcars[, c(wt[1], mean=list(a=mean(hp), b=mean(wt)),lapply(.SD,sum)),by="cyl"]
   cyl         mean.a   mean.b   mpg   disp   hp  drat     wt   qsec vs am gear carb
1:   6 2.62 122.28571 3.117143 138.2 1283.2  856 25.10 21.820 125.84  4  3   27   24
2:   4 2.32  82.63636 2.285727 293.3 1156.5  909 44.78 25.143 210.51 10  8   45   17
3:   8 3.44 209.21429 3.999214 211.4 4943.4 2929 45.21 55.989 234.81  0  2   46   49
> mtcars[, c(wt[1], mean=list(a=mean(hp), b=mean(wt)),lapply(.SD,sum))]
     V1   mean.a  mean.b   mpg cyl   disp   hp   drat      wt   qsec vs am gear carb
1: 2.62 146.6875 3.21725 642.9 198 7383.1 4694 115.09 102.952 571.16 14 13  118   90

In the PR, unnamed arguments of c() will now always get a corresponding return column named "V<position>".

jangorecki · 2021-02-20T11:22:46Z

@myoung3 could you please add unit test for that if it is not yet there?

myoung3 · 2021-02-20T17:43:53Z

@jangorecki oops, replied on the PR but yeah there's a unit test for this now.

myoung3 · 2021-02-23T21:09:02Z

For what it's worth I consider this a bugfix not an enhancement. The examples I've provided here, combined with the original motivating example, together demonstrate that the creation of column names with c() is inconsistent to the point of being broken.

myoung3 · 2021-02-23T21:13:55Z

It's also worth noting that fixing this will make it easier to apply multiple functions to all columns in .SD:

Currently the following code has the problem that the resulting column names are indistinguishable (ie, .SD column names get repeated for mean and sum) because the "mean" and "sum" tags don't get prepended:

x[, c(mean=lapply(.SD,mean),sum=lapply(.SD,sum)),by="z"]

With the bugfix implemented, you'll actually be able to use the above code because the column names will be distinguishable and created in a predictable manner (ie consistent with base R).

grantmcdermott · 2022-03-07T21:11:32Z

Adding prefix support for lapply(.SD,...) would also be handy for groupingset operations. Unlike the regular [ operations above, which (unfortunately) permit duplicate column names, the following doesn't work because of the name clash:

cube(data.table(mtcars), 
     j = c(lapply(.SD, mean), lapply(.SD, sum)), 
     by = c("cyl"), 
     .SDcols = c("mpg", "wt"))
# Error: There exists duplicated column names in the results, ensure the column passed/evaluated in `j` and those in `by` are not overlapping.

Modifying the above to j = c(mean = lapply(.SD, mean), sum = lapply(.SD, sum)) triggers the same error, since naming appears to be ignored for all but the full sample. This is easier to see if we just perform a single lapply call.

cube(data.table(mtcars), 
     j = c(mean = lapply(.SD, mean)), 
     by = "cyl", 
     .SDcols = c("mpg", "wt"))
#>      cyl      mpg       wt mean.mpg mean.wt
#>    <num>    <num>    <num>    <num>   <num>
#> 1:     6 19.74286 3.117143       NA      NA
#> 2:     4 26.66364 2.285727       NA      NA
#> 3:     8 15.10000 3.999214       NA      NA
#> 4:    NA       NA       NA 20.09062 3.21725

mattdowle added this to the Candidate milestone Sep 4, 2017

mattdowle removed this from the Candidate milestone May 10, 2018

Henrik-P mentioned this issue Dec 23, 2018

Document when exactly the column names from j are preserved #3244

Closed

jangorecki mentioned this issue Sep 21, 2019

FR: lapply + other columns: improving automatic handling of column names #1604

Closed

jangorecki added this to the 1.13.0 milestone Sep 21, 2019

jangorecki added the High label Sep 21, 2019

jangorecki changed the title ~~[Request] let j = c(prefix = lapply(.SD, f)) work when optimized~~ let j = c(prefix = lapply(.SD, f)) work when optimized Sep 21, 2019

jangorecki added the enhancement label Sep 21, 2019

mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019

jangorecki removed the High label Jun 8, 2020

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

franknarf1 mentioned this issue Jan 31, 2021

Set up rowwise() and colwise() functions for use in .SD #1063

Open

6 tasks

myoung3 mentioned this issue Jan 31, 2021

Fix naming in j=c() under by= queries with lapply() optimization #4883

Merged

myoung3 mentioned this issue Jul 1, 2021

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

Open

mattdowle removed this from the 1.14.1 milestone Aug 28, 2021

MichaelChirico mentioned this issue Mar 16, 2024

Master list of most-requested issues #3189

Open

75 tasks

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

MichaelChirico closed this as completed in #4883 Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

let j = c(prefix = lapply(.SD, f)) work when optimized #2311

let j = c(prefix = lapply(.SD, f)) work when optimized #2311

franknarf1 commented Aug 18, 2017

Henrik-P commented Sep 4, 2017

myoung3 commented Jan 31, 2021

myoung3 commented Feb 2, 2021 •

edited

Loading

myoung3 commented Feb 20, 2021 •

edited

Loading

jangorecki commented Feb 20, 2021 •

edited

Loading

myoung3 commented Feb 20, 2021

myoung3 commented Feb 23, 2021 •

edited

Loading

myoung3 commented Feb 23, 2021 •

edited

Loading

grantmcdermott commented Mar 7, 2022

let j = c(prefix = lapply(.SD, f)) work when optimized #2311

let j = c(prefix = lapply(.SD, f)) work when optimized #2311

Comments

franknarf1 commented Aug 18, 2017

Henrik-P commented Sep 4, 2017

myoung3 commented Jan 31, 2021

myoung3 commented Feb 2, 2021 • edited Loading

myoung3 commented Feb 20, 2021 • edited Loading

jangorecki commented Feb 20, 2021 • edited Loading

myoung3 commented Feb 20, 2021

myoung3 commented Feb 23, 2021 • edited Loading

myoung3 commented Feb 23, 2021 • edited Loading

grantmcdermott commented Mar 7, 2022

myoung3 commented Feb 2, 2021 •

edited

Loading

myoung3 commented Feb 20, 2021 •

edited

Loading

jangorecki commented Feb 20, 2021 •

edited

Loading

myoung3 commented Feb 23, 2021 •

edited

Loading

myoung3 commented Feb 23, 2021 •

edited

Loading