Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dt[TRUE, B := B * 2] allocates 28 * nrow(dt) bytes of data #1249

Closed
nkurz opened this issue Aug 3, 2015 · 0 comments
Closed

dt[TRUE, B := B * 2] allocates 28 * nrow(dt) bytes of data #1249

nkurz opened this issue Aug 3, 2015 · 0 comments
Milestone

Comments

@nkurz
Copy link

nkurz commented Aug 3, 2015

In #1248 dt[TRUE, ...] is suggested as a workaround for dt[, ...] not copying in place. When I profiled the memory operations here, I found that while it does correctly copy in place, the combination of R 3.1.2 and the current 'master' allocates (and presumably copies) much more data than is saved by copying in place. Specifically, it churns through 28B for each row in the data table even though the final update is made in place.

library(data.table)
N = 1000000           # or any large number of rows
dt = data.table(A=1:N, B=rnorm(N)) # something with N rows
dt[TRUE, B := B * 2] # stabilize with initial dummy update
print(paste("originalAddress =", address(dt$B)))
# NOTE: Rprofmem requires R compiled with --enable-memory-profiling
Rprofmem("mem.txt")  # save allocation to file
dt[TRUE, B := B * 2] # or some in-place update
Rprofmem(NULL)       # close allocation file
print(paste("finishedAddress =", address(dt$B)))

nate@ubuntu:~/R/data.table$ Rscript dt_TRUE.R
[1] "originalAddress = 0x7f2ecccb0010"
[1] "finishedAddress = 0x7f2ecccb0010"

This shows that the address of the B column remains constant, and if you have a version of R compiled with --enable-memory-profiling also produces the output file "mem.txt":

4000040 :"[.data.table" "["
4000040 :"[.data.table" "["
4000040 :"[.data.table" "["
272 :"new.env" "[.data.table" "["
8000040 :"[.data.table" "["
8000040 :"eval" "eval" "[.data.table" "["

While indicative of a problem, this doesn't really tell us how to fix it. So I patched Rprofmem() to include line numbers, and then recompiled data.table with DESCRIPTION changed to "KeepSource: TRUE" and "ByteCompile: FALSE". Only the change to KeepSource is required, but turning off ByteCompile gave better information. My hope is that since I'm profiling memory allocations rather than performance time, the results will be the same.

I also updated the version number for confirmation that I'm using my altered version, and installed it in a separate directory with "R CMD INSTALL data.table -l /R/lib". To use it my altered version, I change the first command in the file to "library(data.table, lib='/R/lib')". When I do this, I see (approximately) the lines at which the allocations are taking place:

4000040 :"[.data.table" #652 "["
4000040 :"[.data.table" #652 "["
4000040 :"[.data.table" #652 "["
272 :"new.env" "[.data.table" #1110 "["
8000040 :".Call" "[.data.table" #1151 "["
8000040 :"eval" "eval" "[.data.table" #1196 "["

I say 'approximately' because sometime the srcref's don't quite match up to the exact line. In this case, it looks like the 3 allocations of 4*N bytes are all happening in the 'else' clause at line 654:

654 : else irows=seq_len(nrow(x))[i]  # e.g. recycling DT[c(TRUE,FALSE),which=TRUE], for completeness 

This just seems to be the nature of R. First it creates the full sequence, then it copies it for the [] function, then it copies it on assignment. The solution would be to find some other way to do this, but I'm not familiar enough to R to now what it would be.

The first allocation of 8*N bytes is here:

1151: ans[[target]] = .Call(CsubsetVector,x[[source]],irows)   # i.e. x[[source]][irows], but guaranteed new memory even for singleton logicals from R 3.1.0

The second allocation of 8*n bytes is at the same location that causes the copy for dt[, ...]:

1196: jval = eval(jsub, SDenv, parent.frame())

I don't understand R's rules well enough to know how to prevent either of these, but hopefully this provides enough information that someone knowledgeable can take a shot at improving this. In place updates can be extremely helpful in allowing R to work with vectors that take up substantial amounts of memory, but the benefit of this approach is mostly negated if multiple additional allocations are required.

I don't think it matters here, but here's how I have R configured for these tests:

CC=icc CXX=icpc AR=xiar LD=xild CFLAGS="-g -O3 -xHost" CXXFLAGS="-g -O3 -xHost" ./configure --with-blas="-lmkl_rt -lpthread" --with-lapack --enable-memory-profiling --enable-R-shlib

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants