-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip sorting already sorted #4501
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4501 +/- ##
=======================================
Coverage 99.60% 99.61%
=======================================
Files 73 73
Lines 14102 14116 +14
=======================================
+ Hits 14047 14061 +14
Misses 55 55
Continue to review full report at Codecov.
|
R/data.table.R
Outdated
if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient) | ||
if (haskey(x) && | ||
all(key(x) %chin% names(jval)) && | ||
((is.null(irows) && jsub %iscall% "list" && all(vapply_1b(as.list(jsub)[-1L], is.name))) || ##fix for #4498 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this misses something like key=V1 and j=.(V1, f(V2)), as well as like j=c(list(V1=V1), .SD).
the latter is messy, maybe best to let is.sorted happen there.
I was thinking to add this check much earlier, around when j is first analyzed, there must be some similar loop done over j where we can tack this on.
FWIW both logics miss something like key=V1, j=.(newV = V1), but again, not sure how worth pursuing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we need more reliable way to figure that out. Better to be more conservative and not retain key when, we cannot be sure. Instead of recalculating order.
R/data.table.R
Outdated
if (is.null(irows) && !is.null(shared_keys)) { | ||
setattr(jval, 'sorted', shared_keys) | ||
# potentially inefficient backup -- check if jval is sorted by key(x) | ||
} else if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have dropped this branch but there are a few tests relying on it, e.g. when lapply(.SD, as.numeric)
is used to integer column, it can still be sorted by key, or when irows
is there but it's an ordered subset.
Do we check anywhere else whether irows
is an ordered subset? maybe an opportunity to leverage ALTREP
to quickly determine 1:n
subsets are ordered.
@ColeMiller1 please have a look, I think the logic is improved but maybe could be further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this above:
orderedirows = .Call(CisOrderedSubset, irows, nrow(x)) # TRUE when irows is NULL (i.e. no i clause).
but only in the !missingby
region
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your new fx. We may want to check for is.name(.SD)
so we can pull out those SD vars as well.
I think the is.sorted()
could be avoided if:
This is a joinThere is a join and bothX[Y]
have keys in common. or!length(forder(irows))
edit: or as you point out, we could do.Call(CisOrderedSubset, irows, nrow(x))
here or outside of just the!missingby
region
I can work on it some this evening.
inst/tests/tests.Rraw
Outdated
@@ -3641,7 +3641,7 @@ test(1118, dt[, lapply(.SD, function(y) weighted.mean(y, b2, na.rm=TRUE)), by=x] | |||
|
|||
# a(nother) test of #295 | |||
DT <- data.table(x=5:1, y=1:5, key="y") | |||
test(1119, is.null(key(DT[, list(z = y, y = 1/y)]))) | |||
test(1119, key(DT[, list(z = y, y = 1/y)]), 'z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before (as in, the reason for this test), I think [
was getting confused by the redefinition of y
, but now we re-map the key to z
, correctly
Possible alternative to the approach made so far in this PR is to have a flag to be used all the way In the first place, we should not try to retain key in every case we do that now, because it wouldn't be possible if we want to reduce the overhead.
Ultimately I would go for 1. as there is no point to retain key at the cost of extra sort, because it is unknown if that is desired, so imposing overhead in each case like this in unjustified. |
Agree w the above. We can leave the simple static case like added here, which should cover most use cases, and plan to deprecate the Eventually we could make the logic more sophisticated -- I think it makes more sense to start in limited logic, and add new cases, rather than start in broad logic & narrow it down. I would file a follow-up for that. |
NEWS.md
Outdated
@@ -109,6 +109,8 @@ unit = "s") | |||
|
|||
13. A relatively rare case of segfault when combining non-equi joins with `by=.EACHI` is now fixed, closes [#4388](https://github.com/Rdatatable/data.table/issues/4388). | |||
|
|||
14. Selecting keyed list columns will retain key without a performance penalty, closes [#4498](https://github.com/Rdatatable/data.table/issues/4498). Thanks to @user9439449 on StackOverflow for the report. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to link user via @
because it could eventually point to a different user here on github
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pinged them on SO to add their name if possible. will leave this for now
Thanks, @ColeMiller1 for having a good go at this one! Really bad performance here that you fixed, high priority. But yes I agree with the others that the direction in this PR is getting a bit fragile. |
@mattdowle |
@jangorecki I think you're saying that |
Uh, yes, you are right, it is single column, but wrapped in a list. But even then, when using faster is.sorted, we need to loop over whole input, in case it was sorted. |
Yes but looping contiguously over the input (not jumping around it) is usually insignificant. In this case there are huge number of unique strings, so, when the CHARSXP pointers are not equal, the strings have to compared, which may not be insignificant in this case, but let's see. |
R/data.table.R
Outdated
if (is.null(irows) && !is.null(shared_keys)) { | ||
setattr(jval, 'sorted', shared_keys) | ||
# potentially inefficient backup -- check if jval is sorted by key(x) | ||
} else if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient) | ||
} else if (haskey(x) && all(key(x) %chin% names(jval)) && (is.null(irows) || (!roll && .Call(CisOrderedSubset, irows, nrow(x))) || suppressWarnings(is.sorted(jval, by=key(x))))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm isn't this condition wrong though? Since all we've checked is that key(x)
shows up in names(jval)
, a query like this would pass right?
DT = data.table(a=1, b=2, c=3, key='a,b,c')
DT[ , .(a = 10:1, b=1:10, c=10:1)]
(1) DT
is keyed
(2) a,b,c
all show up as names in jval
(3) is.null(irows)
We still need to run the is.sorted
in this case right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ confirmed on this example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ColeMiller1 I guess we want this new condition inside the earlier branch right?
if ((is.null(irows) || (!roll && .Call(CisOrderedSubset, irows, nrow(x)))) && !is.null(shared_keys)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a fragile approach. I wanted to capture the example of c(.SD, .(newvar = 3+3))
but based on Matt's work on is.sorted, I do not think we need NSE to try to capture it. Will remove now.
Edit: Note, it was too fragile and would have been changed regardless of Matt's work.
Redirecting @mattdowle 's comment from #4508 since that discussion should be on this issue:
That's the rub, yes... with that in mind I'm tempted to roll back @ColeMiller1 's latest commit: 21e68c0 I'm confident
That seems safe to me. 21e68c0 adds a few more cases that pass the smell test but should at least be vetted more thoroughly:
|
@ColeMiller1 want to have a go at adding a performance regression test here? :) |
Generated via commit b594be7 Download link for the artifact containing the test results: ↓ atime-results.zip Time taken to finish the standard R installation steps: 11 minutes and 58 seconds Time taken to run |
I can no longer reproduce such large differences. I have been doing more with WSL and that might be part of it. With 2 million rows, the patch is super fast (0.001s) while the parent is not the worst (0.1s) at performing I tried to put together a regression and while # Issue reported in https://github.com/Rdatatable/data.table/issues/4498
# To be fixed in: https://github.com/Rdatatable/data.table/pull/4501
"Skip sorting keyed values" = atime::atime_test(
N = 10^seq(5, 7),
setup = {
L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
setkey(L, V1)
},
expr = {
x[, .SD]
},
Before = "3ca83738d70d5597d9e168077f3768e32569c790" # Parent of the first commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/commit/3ca83738d70d5597d9e168077f3768e32569c790)
After = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/pull/4501/commits)
) |
usually better to start with a very small N. (10 = 10^1) vinfo <- atime::atime_versions(
pkg.edit.fun = pkg.edit.fun,
N = 10^seq(1, 7, by=0.5),
setup = {
L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
setkey(L, V1)
},
expr = {
data.table:::`[.data.table`(L, , .SD)
},
Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
)
plot(vinfo)
refs <- atime::references_best(vinfo)
plot(refs)
pred <- predict(refs)
plot(pred) Also you should use
|
Very helpful, Toby. Thanks! I do not believe there was a specific commit that caused it. More, Michael identified that there was room for making the logic better so we could re-use an existing key in certain circumstances. Around the same time, Matt found a way to fix the poor performance of ordering so that the I am still learning about git. I assumed when Michael merged master into this branch, it would mean that the master commit at the time of the merge would serve as the "parent". Two parents were listed on that commit - one parent was the last commit on this branch and the other parent was the last commit on master. Maybe I'm over thinking it. I actually had changed my sample from Slow to Before and Fast to After so I guess I'm still getting the hang of this. For general regression testing, I'm not sure we would need additional commits, right? In this case, Anyway, I couldn't get |
About the commit ids/docs my issue was with this line Before = "3ca83738d70d5597d9e168077f3768e32569c790" # Parent of the first commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/commit/3ca83738d70d5597d9e168077f3768e32569c790) which is problematic because 3ca8373 is not the parent of the first commit in this PR. It is a parent of the last commit in this PR. Both commits are interesting but make sure to document them correctly with the comments so we can verify their provenance. Here is the result I computed with them: vinfo <- atime::atime_versions(
pkg.path="~/R/data.table",
N=2^seq(1,20),
pkg.edit.fun = pkg.edit.fun,
setup = {
L = as.data.table(as.character(rnorm(N, 1, 0.5)))
setkey(L, V1)
},
expr = {
data.table:::`[.data.table`(L, , .SD)
},
"master\nparent"="3ca83738d70d5597d9e168077f3768e32569c790",
"second\nto last"="739abe53d0722c2e71d59c14b414d6d56f213173",
"Slow\nparent\nof first" = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
"Fast\nLast" = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
)
plot(vinfo) |
I'm still a little confused about how the expr |
It is true that Slow and Fast (and other historical references) are not strictly necessary to see future regressions, and they could add visual clutter, but they also essential to answer an important question: is the current performance as good/bad (or better than) the bad/goos performance we have seen in the past? (at the time the issue was created/fixed) |
Yeah, I can add some comments. Previous behavior was to be very cautious about setting keys. If Line 1432 in bbe7563
You are right - I should have updated the comment! I was too focused on what was the branch parent.
For me, the bad performance might still clutter things up in other PRs. After all, we fixed it! I can see your point about the good performance to showcase the best it has been although I would be OK without having the "good" performance as well. |
The idea is more that the new DT can safely retain a key. I mostly understand why this was implemented - someone spent the time to create a key initially and it would be good to retain it if possible. dt = data.table(a = 1:2, key = 'a')
new_dt = dt[, .SD] # or dt[, .(a)]
new_dt
## Key: <a>
## a
## <int>
## 1: 1
## 2: 2
address(dt) == address(new_dt)
## [1] FALSE A test case (1102.3) where maybe we went too far with trying to retain keys is this one. I would propose at this point that the user can decide to recreate the key. dt[, lapply(.SD, function(x) as.integer(x * 1000))]
## Key: <a>
## a
## <int>
## 1: 1000
## 2: 2000 |
Hi, I added a performance test and resolved conflicts, please review. In particular the comment I wrote to explain the performance test expr arg is "New DT can safely retain key." Do you think that is sufficient? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the test!
Is the atime test here ready to go? I see some activity but comments still unresolved. @tdhock / @Anirban166 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I approve
restoring branch because it is required in an atime performance test, which errored recently: https://github.com/Rdatatable/data.table/actions/runs/10149565942/job/28064770069?pr=6288
restoring (un-deleting) this branch should fix that error. |
Ah, thanks. It's hard to remember :) |
Closes #4498