You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #1248 dt[TRUE, ...] is suggested as a workaround for dt[, ...] not copying in place. When I profiled the memory operations here, I found that while it does correctly copy in place, the combination of R 3.1.2 and the current 'master' allocates (and presumably copies) much more data than is saved by copying in place. Specifically, it churns through 28B for each row in the data table even though the final update is made in place.
library(data.table)
N=1000000# or any large number of rowsdt= data.table(A=1:N, B=rnorm(N)) # something with N rowsdt[TRUE, B:=B*2] # stabilize with initial dummy update
print(paste("originalAddress =", address(dt$B)))
# NOTE: Rprofmem requires R compiled with --enable-memory-profiling
Rprofmem("mem.txt") # save allocation to filedt[TRUE, B:=B*2] # or some in-place update
Rprofmem(NULL) # close allocation file
print(paste("finishedAddress =", address(dt$B)))
This shows that the address of the B column remains constant, and if you have a version of R compiled with --enable-memory-profiling also produces the output file "mem.txt":
While indicative of a problem, this doesn't really tell us how to fix it. So I patched Rprofmem() to include line numbers, and then recompiled data.table with DESCRIPTION changed to "KeepSource: TRUE" and "ByteCompile: FALSE". Only the change to KeepSource is required, but turning off ByteCompile gave better information. My hope is that since I'm profiling memory allocations rather than performance time, the results will be the same.
I also updated the version number for confirmation that I'm using my altered version, and installed it in a separate directory with "R CMD INSTALL data.table -l /R/lib". To use it my altered version, I change the first command in the file to "library(data.table, lib='/R/lib')". When I do this, I see (approximately) the lines at which the allocations are taking place:
I say 'approximately' because sometime the srcref's don't quite match up to the exact line. In this case, it looks like the 3 allocations of 4*N bytes are all happening in the 'else' clause at line 654:
654 : else irows=seq_len(nrow(x))[i] # e.g. recycling DT[c(TRUE,FALSE),which=TRUE], for completeness
This just seems to be the nature of R. First it creates the full sequence, then it copies it for the [] function, then it copies it on assignment. The solution would be to find some other way to do this, but I'm not familiar enough to R to now what it would be.
The first allocation of 8*N bytes is here:
1151: ans[[target]] = .Call(CsubsetVector,x[[source]],irows) # i.e. x[[source]][irows], but guaranteed new memory even for singleton logicals from R 3.1.0
The second allocation of 8*n bytes is at the same location that causes the copy for dt[, ...]:
1196: jval = eval(jsub, SDenv, parent.frame())
I don't understand R's rules well enough to know how to prevent either of these, but hopefully this provides enough information that someone knowledgeable can take a shot at improving this. In place updates can be extremely helpful in allowing R to work with vectors that take up substantial amounts of memory, but the benefit of this approach is mostly negated if multiple additional allocations are required.
I don't think it matters here, but here's how I have R configured for these tests:
In #1248 dt[TRUE, ...] is suggested as a workaround for dt[, ...] not copying in place. When I profiled the memory operations here, I found that while it does correctly copy in place, the combination of R 3.1.2 and the current 'master' allocates (and presumably copies) much more data than is saved by copying in place. Specifically, it churns through 28B for each row in the data table even though the final update is made in place.
This shows that the address of the B column remains constant, and if you have a version of R compiled with --enable-memory-profiling also produces the output file "mem.txt":
While indicative of a problem, this doesn't really tell us how to fix it. So I patched Rprofmem() to include line numbers, and then recompiled data.table with DESCRIPTION changed to "KeepSource: TRUE" and "ByteCompile: FALSE". Only the change to KeepSource is required, but turning off ByteCompile gave better information. My hope is that since I'm profiling memory allocations rather than performance time, the results will be the same.
I also updated the version number for confirmation that I'm using my altered version, and installed it in a separate directory with "R CMD INSTALL data.table -l
/R/lib". To use it my altered version, I change the first command in the file to "library(data.table, lib='/R/lib')". When I do this, I see (approximately) the lines at which the allocations are taking place:I say 'approximately' because sometime the srcref's don't quite match up to the exact line. In this case, it looks like the 3 allocations of 4*N bytes are all happening in the 'else' clause at line 654:
This just seems to be the nature of R. First it creates the full sequence, then it copies it for the [] function, then it copies it on assignment. The solution would be to find some other way to do this, but I'm not familiar enough to R to now what it would be.
The first allocation of 8*N bytes is here:
The second allocation of 8*n bytes is at the same location that causes the copy for dt[, ...]:
I don't understand R's rules well enough to know how to prevent either of these, but hopefully this provides enough information that someone knowledgeable can take a shot at improving this. In place updates can be extremely helpful in allowing R to work with vectors that take up substantial amounts of memory, but the benefit of this approach is mostly negated if multiple additional allocations are required.
I don't think it matters here, but here's how I have R configured for these tests:
The text was updated successfully, but these errors were encountered: