Skip sorting already sorted #4501

ColeMiller1 · 2020-05-27T01:44:51Z

library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3

x <- as.data.table(as.character(rnorm(200000,1,0.5)))
setkey(x,V1)

##before fix
system.time(a1 <- x[,.(V1)])
#>    user  system elapsed 
#>    0.94    0.50    1.48
system.time(a2 <- x[, c('V1')])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, 1])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, .SD])
#>    user  system elapsed 
#>    0.72    0.52    1.41
system.time(x[, c(.SD, .(V1))])
#>    user  system elapsed 
#>    0.75    0.61    1.53

##after fix

system.time(a1 <- x[,.(V1)])
#>    user  system elapsed 
#>    0.01    0.00    0.01
system.time(a2 <- x[, c('V1')])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, 1])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, .SD])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, c(.SD, .(V1))]) ##still slow!
#>    user  system elapsed 
#>    0.75    0.61    1.53

codecov · 2020-05-27T01:54:25Z

Codecov Report

Merging #4501 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #4501   +/-   ##
=======================================
  Coverage   99.60%   99.61%           
=======================================
  Files          73       73           
  Lines       14102    14116   +14     
=======================================
+ Hits        14047    14061   +14     
  Misses         55       55

Impacted Files	Coverage Δ
R/data.table.R	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88bd1eb...739abe5. Read the comment docs.

MichaelChirico · 2020-05-27T01:59:36Z

R/data.table.R

-      if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x))))  # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)
+      if (haskey(x) && 
+          all(key(x) %chin% names(jval)) &&
+          ((is.null(irows) && jsub %iscall% "list" && all(vapply_1b(as.list(jsub)[-1L], is.name))) || ##fix for #4498


this misses something like key=V1 and j=.(V1, f(V2)), as well as like j=c(list(V1=V1), .SD).

the latter is messy, maybe best to let is.sorted happen there.

I was thinking to add this check much earlier, around when j is first analyzed, there must be some similar loop done over j where we can tack this on.

FWIW both logics miss something like key=V1, j=.(newV = V1), but again, not sure how worth pursuing

Agree, we need more reliable way to figure that out. Better to be more conservative and not retain key when, we cannot be sure. Instead of recalculating order.

MichaelChirico · 2020-05-27T04:30:11Z

R/data.table.R

+      if (is.null(irows) && !is.null(shared_keys)) {
+        setattr(jval, 'sorted', shared_keys)
+        # potentially inefficient backup -- check if jval is sorted by key(x)
+      } else if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)


I would have dropped this branch but there are a few tests relying on it, e.g. when lapply(.SD, as.numeric) is used to integer column, it can still be sorted by key, or when irows is there but it's an ordered subset.

Do we check anywhere else whether irows is an ordered subset? maybe an opportunity to leverage ALTREP to quickly determine 1:n subsets are ordered.

@ColeMiller1 please have a look, I think the logic is improved but maybe could be further.

I see this above:

orderedirows = .Call(CisOrderedSubset, irows, nrow(x)) # TRUE when irows is NULL (i.e. no i clause).

but only in the !missingby region

I like your new fx. We may want to check for is.name(.SD) so we can pull out those SD vars as well.

I think the is.sorted() could be avoided if:

~~This is a join~~ There is a join and both X[Y] have keys in common. or

!length(forder(irows)) edit: or as you point out, we could do .Call(CisOrderedSubset, irows, nrow(x)) here or outside of just the !missingby region

I can work on it some this evening.

MichaelChirico · 2020-05-27T04:31:47Z

inst/tests/tests.Rraw

@@ -3641,7 +3641,7 @@ test(1118, dt[, lapply(.SD, function(y) weighted.mean(y, b2, na.rm=TRUE)), by=x]

 # a(nother) test of #295
 DT <- data.table(x=5:1, y=1:5, key="y")
-test(1119, is.null(key(DT[, list(z = y, y = 1/y)])))
+test(1119, key(DT[, list(z = y, y = 1/y)]), 'z')


before (as in, the reason for this test), I think [ was getting confused by the redefinition of y, but now we re-map the key to z, correctly

jangorecki · 2020-05-27T09:13:10Z

Possible alternative to the approach made so far in this PR is to have a flag to be used all the way [.data.table and set it whenever order can be changed. Current approach tries to figure that out at one place, which is prone to error.

In the first place, we should not try to retain key in every case we do that now, because it wouldn't be possible if we want to reduce the overhead.

If we agree to be more conservative and not call is.sorted at all, then this will be breaking change. I would postpone that for next release.
If we want to retain key always, then we can only optimize "easy cases" to not call is.sorted, which IMO is error prone. As this is not a regression in current devel I would postpone that to next release anyway.

Ultimately I would go for 1. as there is no point to retain key at the cost of extra sort, because it is unknown if that is desired, so imposing overhead in each case like this in unjustified.

MichaelChirico · 2020-05-27T10:16:14Z

Agree w the above. We can leave the simple static case like added here, which should cover most use cases, and plan to deprecate the is.sorted part for later releases -- it feels quite fragile & like it can be a huge gotcha on some queries unexpectedly.

Eventually we could make the logic more sophisticated -- I think it makes more sense to start in limited logic, and add new cases, rather than start in broad logic & narrow it down. I would file a follow-up for that.

jangorecki · 2020-05-27T14:24:00Z

NEWS.md

@@ -109,6 +109,8 @@ unit = "s")

 13. A relatively rare case of segfault when combining non-equi joins with `by=.EACHI` is now fixed, closes [#4388](https://github.com/Rdatatable/data.table/issues/4388).

+14. Selecting keyed list columns will retain key without a performance penalty, closes [#4498](https://github.com/Rdatatable/data.table/issues/4498). Thanks to @user9439449 on StackOverflow for the report.


no need to link user via @ because it could eventually point to a different user here on github

I pinged them on SO to add their name if possible. will leave this for now

mattdowle · 2020-05-27T18:48:52Z

Thanks, @ColeMiller1 for having a good go at this one! Really bad performance here that you fixed, high priority. But yes I agree with the others that the direction in this PR is getting a bit fragile.
Root cause, iiuc, is that is.sorted is calling forder at all. I was surprised to see that. I seem to remember I might have added that -- sorry all. The comment on the line that calls is.sorted contains # TO DO: ... tidy up and make efficient. What that meant was make is.sorted efficient inside it by not calling forderv. Currently Cfsorted works efficiently on a single column. Like R's is.unsorted it just loops through the vector and stops on the first >. That needs to be extended to multi-column, and made to support passing a subset of columns, too. Then is.sorted would be (finally) implemented as originally intended (multi-column and single-sweep) and we can use it without a performance penalty.
There's also a comment in Cfsorted that it can be parallelized. If n blocks are each sorted, and the boundaries of the blocks are in order too, then it's sorted. But just getting rid of the awful forderv call inside is.sorted should solve 99% of this bad performance. I'll take a look at expanding Cfsorted.

jangorecki · 2020-05-27T19:23:31Z

@mattdowle is.sorted even when improved, as you described, should be taken out from there IMO. In the worse case scenario, when input is already sorted, then is.sorted will need to loop over all elements anyway. Case reported in this issue is a single column, so should use faster is.sorted already, but because it still needs to iterate over all elements, it is slow. User might not need a key, so we should not impose overhead by default.

mattdowle · 2020-05-27T19:42:58Z

@jangorecki I think you're saying that is.sorted is dispatching to Cfsorted at the moment. If so, I don't think that's correct. is.sorted is being passed a list (a data.table) so the first branch is happening which calls forderv. The warning is happening and being suppressed by the caller. What's taking the huge amount of time in this one-column example is sorting the large amount of unique strings.

jangorecki · 2020-05-27T19:45:18Z

Uh, yes, you are right, it is single column, but wrapped in a list. But even then, when using faster is.sorted, we need to loop over whole input, in case it was sorted.

mattdowle · 2020-05-27T19:52:51Z

Yes but looping contiguously over the input (not jumping around it) is usually insignificant. In this case there are huge number of unique strings, so, when the CHARSXP pointers are not equal, the strings have to compared, which may not be insignificant in this case, but let's see.
If it isn't sorted, it'll return quickly.
If it is sorted, then even the insignificant contiguous-scan is probably worth it, for the onward benefit of retaining the key. But let's see what the timing is for this worst-case very-many-unique-string example, before deciding.

MichaelChirico · 2020-05-28T03:38:23Z

R/data.table.R

      if (is.null(irows) && !is.null(shared_keys)) {
        setattr(jval, 'sorted', shared_keys)
        # potentially inefficient backup -- check if jval is sorted by key(x)
-      } else if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)
+      } else if (haskey(x) && all(key(x) %chin% names(jval)) && (is.null(irows) || (!roll && .Call(CisOrderedSubset, irows, nrow(x))) || suppressWarnings(is.sorted(jval, by=key(x))))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)


hmm isn't this condition wrong though? Since all we've checked is that key(x) shows up in names(jval), a query like this would pass right?

DT = data.table(a=1, b=2, c=3, key='a,b,c') DT[ , .(a = 10:1, b=1:10, c=10:1)]

(1) DT is keyed
(2) a,b,c all show up as names in jval
(3) is.null(irows)

We still need to run the is.sorted in this case right?

^ confirmed on this example

@ColeMiller1 I guess we want this new condition inside the earlier branch right?

if ((is.null(irows) || (!roll && .Call(CisOrderedSubset, irows, nrow(x)))) && !is.null(shared_keys)) {

It is a fragile approach. I wanted to capture the example of c(.SD, .(newvar = 3+3)) but based on Matt's work on is.sorted, I do not think we need NSE to try to capture it. Will remove now.

Edit: Note, it was too fragile and would have been changed regardless of Matt's work.

MichaelChirico · 2020-05-28T04:36:54Z

Redirecting @mattdowle 's comment from #4508 since that discussion should be on this issue:

Are we comfortable it is safe; i.e. won't result in an invalid key. I was going to look at it next. This speed issue was very bad in this edge case, but at least the result was correct.

That's the rub, yes... with that in mind I'm tempted to roll back @ColeMiller1 's latest commit: 21e68c0

I'm confident get_shared_keys is working as intended & it's pretty low-risk -- when x is keyed by K1, ..., KN we re-apply existing keys K1, ..., Kn (n <= N) when:

i is empty (is.null(irows))
j is a call to list
K1, ..., Kn show up in j as names directly, like .(a = K1) or .(K2)

K(n+1) is the first key that doesn't show up as a name.

That seems safe to me.

21e68c0 adds a few more cases that pass the smell test but should at least be vetted more thoroughly:

j is .SD and the keys are available from .SDcols
i may be non-NULL, but it's an ordered subset

MichaelChirico · 2024-07-11T14:21:59Z

@ColeMiller1 want to have a go at adding a performance regression test here? :)

github-actions · 2024-07-11T14:37:33Z

Generated via commit b594be7

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 11 minutes and 58 seconds

Time taken to run atime::atime_pkg on the tests: 5 minutes and 7 seconds

ColeMiller1 · 2024-07-12T03:08:35Z

I can no longer reproduce such large differences. I have been doing more with WSL and that might be part of it. With 2 million rows, the patch is super fast (0.001s) while the parent is not the worst (0.1s) at performing x[, .SD].

I tried to put together a regression and while atime seems great, I don't know the best route to test on my local. Nor do I know if this is best practices for the tests. For example, the smallest N I use is 5 because I am not sure we get much information with results much lower than 10,000 rows for this particular PR / issue.

  # Issue reported in https://github.com/Rdatatable/data.table/issues/4498
  # To be fixed in: https://github.com/Rdatatable/data.table/pull/4501
  "Skip sorting keyed values" = atime::atime_test(
    N = 10^seq(5, 7),
    setup = {
        L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
        setkey(L, V1)
    },
    expr = {
        x[, .SD]
    },
    Before = "3ca83738d70d5597d9e168077f3768e32569c790" # Parent of the first commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/commit/3ca83738d70d5597d9e168077f3768e32569c790)
    After = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/pull/4501/commits)
  )

tdhock · 2024-07-12T11:35:31Z

usually better to start with a very small N. (10 = 10^1)
When in doubt, you can leave N blank, because the default N is usually pretty good, 2^seq(1, 20)
You can test on your local machine by just using the same arguments with atime::atime_versions,

vinfo <- atime::atime_versions(
   pkg.edit.fun = pkg.edit.fun, 
    N = 10^seq(1, 7, by=0.5),
    setup = {
        L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
        setkey(L, V1)
    },
    expr = {
        data.table:::`[.data.table`(L, , .SD)
    },
    Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
    Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
)
plot(vinfo)
refs <- atime::references_best(vinfo)
plot(refs)
pred <- predict(refs)
plot(pred)

Also you should use data.table:::`[.data.table`(x,i,j) instead of x[i,j] because the different versions work by substituting data.table:: with data.table.someSHAversion:: (a different package name is created and installed for each version, and to have that package work you need to provide pkg.edit.fun https://github.com/Rdatatable/data.table/blob/master/.ci/atime/tests.R#L43)
Also is there a historical commit that caused a regression? If so please add that commit with name "Regression"
Otherwise if there is no regression (code was always slow in the past) please change Before to Slow and After to Fast.
I have updated the wiki page https://github.com/Rdatatable/data.table/wiki/Performance-testing to document.
In particular please read https://github.com/Rdatatable/data.table/wiki/Performance-testing#documentation-of-historical-commits which explains how to document the historical commits.
It looks to me like your version and documentation comment for the Before commit were incorrect. Here is my reasoning

The PR commits tab https://github.com/Rdatatable/data.table/pull/4501/commits shows that the first commit in this PR is 7463633
The commit page for that SHA1 7463633 shows that cacdc92 is its parent

ColeMiller1 · 2024-07-13T01:47:53Z

Very helpful, Toby. Thanks!

I do not believe there was a specific commit that caused it. More, Michael identified that there was room for making the logic better so we could re-use an existing key in certain circumstances. Around the same time, Matt found a way to fix the poor performance of ordering so that the is.sorted() call was less expensive. That meant that this PR was less important.

I am still learning about git. I assumed when Michael merged master into this branch, it would mean that the master commit at the time of the merge would serve as the "parent". Two parents were listed on that commit - one parent was the last commit on this branch and the other parent was the last commit on master. Maybe I'm over thinking it.
353dc7a

I actually had changed my sample from Slow to Before and Fast to After so I guess I'm still getting the hang of this. For general regression testing, I'm not sure we would need additional commits, right? In this case, base=master and HEAD=fix_sorting_on_sorted is what we would need to review if this commit is working. If I understand the intention of the graphs, adding Slow and Fast might clutter up future PRs. Alternatively, it might be nice if on only the branch the test was made for that it showed the additional information. E.g., on this PR we might see something like Skip sorting keyed values with Slow/Fast commits on the graph but on other PR branches, we wouldn't.

Anyway, I couldn't get atime to work on my local machine. It wants a pkg.path. I looked at the documentation and I think I know what to do. But... should I continue with a performance test for this PR?

tdhock · 2024-07-13T02:12:53Z

Hi Cole sorry about forgetting the pkg.path in the code above, it should be the path to your clone of the data.table git repo.
I hear that you don't think this PR is a huge performance improvement after all, but the atime result I computed still looks like a pretty big performance improvement,

pkg.edit.fun = function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
vinfo <- atime::atime_versions(
  pkg.path="~/R/data.table",
  N=2^seq(1,20),
  pkg.edit.fun = pkg.edit.fun, 
  setup = {
    L = as.data.table(as.character(rnorm(N, 1, 0.5)))
    setkey(L, V1)
  },
  expr = {
    data.table:::`[.data.table`(L, , .SD)
  },
  Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
  Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
)
plot(vinfo)

tdhock · 2024-07-13T02:23:04Z

About the commit ids/docs my issue was with this line

    Before = "3ca83738d70d5597d9e168077f3768e32569c790" # Parent of the first commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/commit/3ca83738d70d5597d9e168077f3768e32569c790)

which is problematic because 3ca8373 is not the parent of the first commit in this PR. It is a parent of the last commit in this PR. Both commits are interesting but make sure to document them correctly with the comments so we can verify their provenance. Here is the result I computed with them:

In agreement with your prediction that Matt's changes had already resulted in some speedups, we see that "master parent" is faster than "Slow parent of first." But we also see that "Fast last" and "second to last" are faster still, which means this PR is actually helpful, relative to current master (not just old master at the time of the first commit in this PR. We should probably include both as historical Slow references. Maybe "Slow old" and "Slow new" ?

vinfo <- atime::atime_versions(
  pkg.path="~/R/data.table",
  N=2^seq(1,20),
  pkg.edit.fun = pkg.edit.fun, 
  setup = {
    L = as.data.table(as.character(rnorm(N, 1, 0.5)))
    setkey(L, V1)
  },
  expr = {
    data.table:::`[.data.table`(L, , .SD)
  },
  "master\nparent"="3ca83738d70d5597d9e168077f3768e32569c790",
  "second\nto last"="739abe53d0722c2e71d59c14b414d6d56f213173",
  "Slow\nparent\nof first" = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
  "Fast\nLast" = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
)
plot(vinfo)

tdhock · 2024-07-13T02:25:41Z

I'm still a little confused about how the expr data.table:::`[.data.table`(L, , .SD) involves sorting, maybe you can clarify in the comments next to this performance test case?

tdhock · 2024-07-13T02:39:34Z

For general regression testing, I'm not sure we would need additional commits, right? In this case, base=master and HEAD=fix_sorting_on_sorted is what we would need to review if this commit is working. If I understand the intention of the graphs, adding Slow and Fast might clutter up future PRs.

It is true that Slow and Fast (and other historical references) are not strictly necessary to see future regressions, and they could add visual clutter, but they also essential to answer an important question: is the current performance as good/bad (or better than) the bad/goos performance we have seen in the past? (at the time the issue was created/fixed)

tdhock · 2024-07-13T02:51:30Z

Running predict method then plot method shows the data size N that the code is able to handle for a given time limit, 0.01 seconds is the default used here. We see that the PR (Fast Last) is more than 10x faster than current master (master parent) which is great!

ColeMiller1 · 2024-07-14T12:05:01Z

I'm still a little confused about how the expr data.table:::[.data.table(L, , .SD) involves sorting, maybe you can clarify in the comments next to this performance test case?

Yeah, I can add some comments. Previous behavior was to be very cautious about setting keys. If dt had a key, all names were in jsub, and it was verified to be sorted, a key would be retained. In this PR, the physical verification of sorted-ness is removed as well as retaining the key for a subset of key columns when there is no subset (i.e., irows is null). This is the old method which is still used

data.table/R/data.table.R

Line 1432 in bbe7563

if (haskey(x) && all(key(x) %chin% names(jval)) && is.sorted(jval, by=key(x)))

About the commit ids/docs my issue was with this line ... which is problematic because 3ca8373 is not the parent of the first commit in this PR. It is a parent of the last commit in this PR.

You are right - I should have updated the comment! I was too focused on what was the branch parent.

It is true that Slow and Fast (and other historical references) are not strictly necessary to see future regressions, and they could add visual clutter, but they also essential to answer an important question: is the current performance as good/bad (or better than) the bad/good performance we have seen in the past? (at the time the issue was created/fixed)

For me, the bad performance might still clutter things up in other PRs. After all, we fixed it! I can see your point about the good performance to showcase the best it has been although I would be OK without having the "good" performance as well.

tdhock · 2024-07-14T12:33:49Z

If dt had a key, all names were in jsub, and it was verified to be sorted, a key would be retained. In this PR, the physical verification of sorted-ness is removed as well as retaining the key for a subset of key columns when there is no subset ...

So for the performance test, j=.SD so all names are in jsub, so the old code does a linear scan of the data to verify the sound-ness of the key? Whereas the new code avoids that linear scan? For me it is surprising that the key is involved at all when there is no i nor by arg.

the bad performance might still clutter things up in other PRs. After all, we fixed it! ...

you are right about the clutter but the benefit of being able to easily/immediately see the historical references outweighs the drawback of clutter.

We want to be able to see if future changes undo the historical fix. most of the time all of the different versions in a PR will all be the same as the "Fast" or "Fixed" commit but if there is some difference, we want to be able to easily/immediately see if it is as bad as we saw when the fix PR was merged.

A great example is the setDT performance test, which currently shows the result below. Without the "Slow" reference we may wonder if CRAN is faster or slower than the historical bad performance we had seen in the past. But with that reference it is clear that CRAN performance is the same as the historical bad performance that we are trying to fix.

ColeMiller1 · 2024-07-14T13:22:12Z

So for the performance test, j=.SD so all names are in jsub, so the old code does a linear scan of the data to verify the sound-ness of the key? Whereas the new code avoids that linear scan? For me it is surprising that the key is involved at all when there is no i nor by arg.

The idea is more that the new DT can safely retain a key. I mostly understand why this was implemented - someone spent the time to create a key initially and it would be good to retain it if possible.

dt = data.table(a = 1:2, key = 'a')
new_dt = dt[, .SD]   # or dt[, .(a)]

new_dt

## Key: <a>
##         a
##     <int>
##  1:     1
##  2:     2

address(dt) == address(new_dt)
## [1] FALSE

A test case (1102.3) where maybe we went too far with trying to retain keys is this one. I would propose at this point that the user can decide to recreate the key.

dt[, lapply(.SD, function(x) as.integer(x * 1000))]

##  Key: <a>
##         a
##     <int>
##  1:  1000
##  2:  2000

tdhock · 2024-07-17T11:45:07Z

Hi, I added a performance test and resolved conflicts, please review. In particular the comment I wrote to explain the performance test expr arg is "New DT can safely retain key." Do you think that is sufficient?

Anirban166

LGTM for the test!

.ci/atime/tests.R

MichaelChirico · 2024-07-29T16:19:00Z

Is the atime test here ready to go? I see some activity but comments still unresolved. @tdhock / @Anirban166

tdhock

yes I approve

tdhock · 2024-07-29T19:05:15Z

restoring branch because it is required in an atime performance test, which errored recently: https://github.com/Rdatatable/data.table/actions/runs/10149565942/job/28064770069?pr=6288

Error in value[[3L]](cond) : 
  Error in revparse_single(object, branch): Error in 'git2r_revparse_single': Requested object could not be found
 when trying to checkout 353dc7a6b66563b61e44b2fa0d7b73a0f97ca461

restoring (un-deleting) this branch should fix that error.
please try to remember to not delete branches which have historical commits that are relevant to performance testing.

MichaelChirico · 2024-07-29T19:44:41Z

Ah, thanks. It's hard to remember :)

ColeMiller1 added 3 commits May 26, 2020 20:50

Update data.table.R

7463633

Update data.table.R

ac5561d

Update NEWS.md

934b1dc

MichaelChirico reviewed May 27, 2020

View reviewed changes

improved logic for catching for-sure-sorted jval

d1662b3

MichaelChirico reviewed May 27, 2020

View reviewed changes

MichaelChirico mentioned this pull request May 27, 2020

some sapply->vapply replacements #4502

Merged

jangorecki reviewed May 27, 2020

View reviewed changes

Allow .SD; allow original is.sorted portion to escape if irows is.null

21e68c0

mattdowle mentioned this pull request May 28, 2020

is.sorted now multi-column and does not use forderv #4508

Merged

MichaelChirico reviewed May 28, 2020

View reviewed changes

ColeMiller1 and others added 6 commits May 28, 2020 05:09

partial rollback

4b6ea43

Fixed logic of .SD selection

83ff6c1

Merge branch 'master' into fix_sorting_on_sorted

2d97ae0

typo

d4c3a6b

typo

739abe5

Merge branch 'master' into fix_sorting_on_sorted

353dc7a

DT[,.SD] improved in #4501

245a800

tdhock requested review from tdhock and Anirban166 as code owners July 17, 2024 11:36

tdhock added 2 commits July 17, 2024 07:40

Slow is a recognized test name with a specific color

23ec54f

merge

45125ad

tdhock added 2 commits July 17, 2024 09:00

fix lint

1b5f5a1

bigger N and set.seed

6bb0642

Anirban166 reviewed Jul 17, 2024

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

.ci/atime/tests.R Outdated Show resolved Hide resolved

nearly->close-to

92a0711

MichaelChirico added this to the 1.16.0 milestone Jul 28, 2024

tdhock approved these changes Jul 29, 2024

View reviewed changes

Merge branch 'master' into fix_sorting_on_sorted

b594be7

MichaelChirico merged commit 680b5e8 into master Jul 29, 2024
5 checks passed

MichaelChirico deleted the fix_sorting_on_sorted branch July 29, 2024 16:55

tdhock restored the fix_sorting_on_sorted branch July 29, 2024 19:02

MichaelChirico mentioned this pull request Jul 29, 2024

Use a better approach for atime branch clean-up #6325

Open

		@@ -109,6 +109,8 @@ unit = "s")

		13. A relatively rare case of segfault when combining non-equi joins with `by=.EACHI` is now fixed, closes [#4388](https://github.com/Rdatatable/data.table/issues/4388).

		14. Selecting keyed list columns will retain key without a performance penalty, closes [#4498](https://github.com/Rdatatable/data.table/issues/4498). Thanks to @user9439449 on StackOverflow for the report.

Skip sorting already sorted #4501

Skip sorting already sorted #4501

Conversation

ColeMiller1 commented May 27, 2020 • edited Loading

codecov bot commented May 27, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

jangorecki May 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ColeMiller1 May 27, 2020 • edited Loading

Choose a reason for hiding this comment

MichaelChirico May 27, 2020 • edited Loading

Choose a reason for hiding this comment

jangorecki commented May 27, 2020 • edited Loading

MichaelChirico commented May 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle commented May 27, 2020

jangorecki commented May 27, 2020 • edited Loading

mattdowle commented May 27, 2020

jangorecki commented May 27, 2020

mattdowle commented May 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ColeMiller1 May 28, 2020 • edited Loading

Choose a reason for hiding this comment

MichaelChirico commented May 28, 2020

MichaelChirico commented Jul 11, 2024

github-actions bot commented Jul 11, 2024 • edited Loading

ColeMiller1 commented Jul 12, 2024

tdhock commented Jul 12, 2024 • edited Loading

ColeMiller1 commented Jul 13, 2024

tdhock commented Jul 13, 2024

tdhock commented Jul 13, 2024

tdhock commented Jul 13, 2024

tdhock commented Jul 13, 2024

tdhock commented Jul 13, 2024

ColeMiller1 commented Jul 14, 2024

tdhock commented Jul 14, 2024

ColeMiller1 commented Jul 14, 2024

tdhock commented Jul 17, 2024

Anirban166 left a comment

Choose a reason for hiding this comment

MichaelChirico commented Jul 29, 2024

tdhock left a comment

Choose a reason for hiding this comment

tdhock commented Jul 29, 2024

MichaelChirico commented Jul 29, 2024

ColeMiller1 commented May 27, 2020 •

edited

Loading

codecov bot commented May 27, 2020 •

edited

Loading

jangorecki May 27, 2020 •

edited

Loading

ColeMiller1 May 27, 2020 •

edited

Loading

MichaelChirico May 27, 2020 •

edited

Loading

jangorecki commented May 27, 2020 •

edited

Loading

jangorecki commented May 27, 2020 •

edited

Loading

mattdowle commented May 27, 2020 •

edited

Loading

ColeMiller1 May 28, 2020 •

edited

Loading

github-actions bot commented Jul 11, 2024 •

edited

Loading

tdhock commented Jul 12, 2024 •

edited

Loading