[R-Forge #1705] Change CJ internally not to expand, for efficiency #652

arunsrinivasan · 2014-06-08T13:17:09Z

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

CJ() could return an irregular column length data.table (its inputs unchanged), marked somehow so that binarysearch knows to %% index through the CJ columns, rather than allocating long vectors and populating them as CJ currently does. Purely for speed, with no changes for users unless perhaps CJ() has been used outside [.data.table.

Or, more generally, data.table columns could be marked so that indexing to them knows they should be recycled, without actually allocating and recycling the columns. Discuss on datatable-help needed as under-the-hood code e.g. DT$foo[903] might not work if it was internally a short vector. In reality it's probably only useful for CJ() as regular data.table's only have recycled data perhaps to start with creating them with say an NA column but then are quickly populated. If a large table has a column with the same value (say NA or 0) for every row, perhaps the column is redundant. Can't imagine that recycling >1 items is often useful (other than in CJ()).

jan-glx · 2015-09-10T16:10:23Z

Yes please. Maybe some non standard evaluation hacking in [.data.table is enough.
I guess you should decide how far you want to go with optimizing i (I am thinking of | and &)..If not very far very soon, I would really want to have a way to x[y, on=c("a"=b)] retaining the key of x( see #1268), such that the concerned user can achieve high performance easily in an imperative way.
best,Jan

see for example this >50x speedup:

library(data.table)
options(datatable.verbose = F)
dt = data.table(x = sample(1E5, 1E7, replace=T), 
                y = sample(1E5, 1E7, replace=T),
                key = c("x", "y"))
goodX = sample(1E5, 1E4, replace=F)
goodY = sample(1E5, 1E4, replace=F)

fCJ <- function() dt[CJ(goodX, goodY), nomatch=0]
fDJ <- function() setkey(setattr(setattr(dt[data.table("x"=goodX, key="x"), nomatch=0],
                                         "sorted",c("x","y")
                                         )[data.table("y"=goodY, key="y"), nomatch=0, on="y"],
                                 "sorted",c("y","x"))
                         ,x, y)[]
system.time(rCJ <- fCJ())
##    user  system elapsed 
##    2.33    0.56    2.89
system.time(rDJ <- fDJ())
##    user  system elapsed 
##    0.02    0.03    0.05
identical(rCJ, rDJ)
## [1] TRUE

EDIT: Ah sry, before posting I did some further "optimization" and didn't notice it produced wrong results with it. My point now holds hopefully:)

eantonya · 2015-09-10T16:21:47Z

@jan-glx I'm confused - what exactly are you trying to demonstrate? I didn't run your code, but you show that the result of your two functions is not identical, then you talk about a speedup...

Fwiw I would want to see a different function if this idea gets implemented, and not a drastic change to what CJ does - as I use it the other way around all the time - CJ(...)[...].

jan-glx · 2015-09-10T17:33:58Z

Here is the example with the "optimization" on again. And another direct implementation that might make clearer when this is useful.

library(data.table)
options(datatable.verbose = F)
dt = data.table(x = sample(1E5, 1E7, replace=T), 
                y = sample(1E5, 1E7, replace=T),
                key = c("x", "y"))
goodX = sample(1E5, 1E4, replace=T)
goodY = sample(1E5, 1E4, replace=T)

fNJ <- function() dt[x %in% goodX & y %in% goodY, nomatch=0]
fCJ <- function() dt[CJ(goodX, goodY, unique = TRUE), nomatch=0]
fDJ <- function() setkey(setattr(setattr(dt[unique(data.table("x"=goodX, key="x")), nomatch=0],
                                         "sorted",c("x","y")
                                         )[unique(data.table("y"=goodY, key="y")), nomatch=0, on="y"],
                                 "sorted",c("y","x"))
                         ,x, y)[]

system.time(rNJ <- fNJ())
##    user  system elapsed 
##    1.17    0.03    1.21
system.time(rCJ <- fCJ())
##    user  system elapsed 
##    2.06    0.59    2.66
system.time(rDJ <- fDJ())
##    user  system elapsed 
##    0.05    0.00    0.05
identical(rNJ, rCJ)
## [1] TRUE
identical(rCJ, rDJ)
## [1] TRUE

franknarf1 · 2015-09-10T19:06:41Z

Hm, are you just asking that autoindexing be extended? Seems like the clearest demonstration is just

r <- dt[x %in% sort(unique(goodX))][y %in% sort(unique(goodY))] # just as fast as rDJ
identical(r,rDJ) # TRUE

(All of your setattr sort of cloud what's going on to me, not to mention the nomatch=0 which should have no effect in fNJ.) Anyway, I agree that further optimization for fNJ would be nice, as would some convenient shortcut, like DT[list(x=goodX, y=goodY, z=goodZ), on=.ALLIN]

jan-glx · 2015-09-11T09:49:52Z

Kinda, yes. I would prefer to to describe what I want instead of heaving to explicitly say want R to do. However current optimization only works for some (through the most important) cases. And I am afraid it is supporting the use of data.table in a not data.table way leading to problems as soon as optimization does not kick in. E.g. see:

> library(data.table)
> dt = data.table(x = sample(1E5, 1E8, replace=T))
> setkey(dt,x,verbose=T)
forder took 2.47 sec
reorder took 2.16 sec
> system.time(dt[.(42),])
   user  system elapsed 
      0       0       0 
> system.time(dt[x==42,])
   user  system elapsed 
      0       0       0 
> system.time(dt[42==x,])
   user  system elapsed 
   0.72    0.38    1.09 
> system.time(dt[x %in% 42,])
   user  system elapsed 
      0       0       0 
> system.time(dt[x %in% 42 & TRUE,])
   user  system elapsed 
   4.38    0.54    4.92

But I am not sure what I want or what the best solution would be...an how much of it concerns this issue...
Expressing the WHERE ... AND ... as CJ felt fine for me (apart from the fact that it expands and is terrible slow (for a dt command))... A real query language for i would be cool (one that warns you if it cannot be optimized to joins)
Alternatively if one could do joins that preserve the keys or if data.table would automatically create and keep if easily possible indices like in what @jangorecki seems to be implementing That would be awesome as well.

PS: In your %in% query: Why is the key preserved (only) if both good arrays are sorted?

jangorecki · 2015-09-11T10:24:55Z

Regarding multiple indexes you've linked, it quite heavy implementation as it stores not just integer order sequence for each index but all the column values included in the index for each index, see idxv.R#L10. I'm not sure if this what you discuss here fits to the first post which is clearly focused on single change for CJ.

MichaelChirico · 2018-10-11T16:06:07Z

where does this stand now that R has ALTREP?

jan-glx · 2018-10-11T17:35:44Z

where does this stand now that R has ALTREP?

If the following version includes ALTREP then the room for improvement is still present.

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.4    RevoUtils_11.0.0     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0

jangorecki · 2018-10-12T21:00:17Z

I wouldn't go that way unless there will be officially documented (at least on r-devel mailing) ALTREP api to represent cross product of data.frames, otherwise such API is likely to change without any notice.

MichaelChirico · 2024-04-14T05:58:04Z

such API is likely to change without any notice.

I think it could be our own ALTREP class, though I still don't find good documentation on how to use ALTREP, exactly :)

nkurz mentioned this issue Aug 3, 2015

dt[TRUE, B := B * 2] allocates 28 * nrow(dt) bytes of data #1249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #1705] Change CJ internally not to expand, for efficiency #652

[R-Forge #1705] Change CJ internally not to expand, for efficiency #652

arunsrinivasan commented Jun 8, 2014

jan-glx commented Sep 10, 2015

eantonya commented Sep 10, 2015

jan-glx commented Sep 10, 2015

franknarf1 commented Sep 10, 2015

jan-glx commented Sep 11, 2015

jangorecki commented Sep 11, 2015

MichaelChirico commented Oct 11, 2018

jan-glx commented Oct 11, 2018

jangorecki commented Oct 12, 2018

MichaelChirico commented Apr 14, 2024

[R-Forge #1705] Change CJ internally not to expand, for efficiency #652

[R-Forge #1705] Change CJ internally not to expand, for efficiency #652

Comments

arunsrinivasan commented Jun 8, 2014

jan-glx commented Sep 10, 2015

eantonya commented Sep 10, 2015

jan-glx commented Sep 10, 2015

franknarf1 commented Sep 10, 2015

jan-glx commented Sep 11, 2015

jangorecki commented Sep 11, 2015

MichaelChirico commented Oct 11, 2018

jan-glx commented Oct 11, 2018

jangorecki commented Oct 12, 2018

MichaelChirico commented Apr 14, 2024