-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunked generic slice eq #116422
Chunked generic slice eq #116422
Conversation
r? @scottmcm (rustbot has picked a reviewer for you, use r? to override) |
41cfb68
to
9d3905f
Compare
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…<try> Chunked generic slice eq looks nice in a microbenchmark, let's see if perf agrees ``` OLD: slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns NEW: slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns ```
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Hmm, we used to have a 4-way unroll in all the short-circuiting slice iterator stuff too, because it looks great in simple stuff, but ended up removing it because adding 4× the callsites for non-trivial things was much worse. Might be worth checking the bench for things like |
library/core/src/slice/cmp.rs
Outdated
&& chunks_a | ||
.iter() | ||
.zip(chunks_b.iter()) | ||
.all(|(a, b)| (a[0] == b[0]) & (a[1] == b[1]) & (a[2] == b[2]) & (a[3] == b[3])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is &
because of things like #105259 (comment) ?
It seems wrong to do this for general arbitrarily-expensive ==
. Is there a way it could be written to pre-load the values so that LLVM knows it's allowed to remove the short-circuit, but it's still up to the optimizer to decide if it's worth doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are a bunch of other open issues about &&
not doing the right thing too. I initially used flag &= ...
but that unrolled to a sequence of cmp + jump pairs. Which was faster too but a lot more instructions. This version does a vpxor
on avx2.
They're all references anyway, so we can't really "load" them. We could special-case it for T: Copy
but that wouldn't make a difference for &str
or similar.
I'm considering to peel the first chunk (in addition to the residual) and do it the conventional way if perf results are bad. This should hopefully catch a lot of non-equal results and mostly do the unrolled variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried preloading all the values with let _ = ManuallyDrop::new(ptr::read(self.get_unchecked(i)));
etc. and then using &&
but it didn't help.
That was just an unroll though, still short-circuiting for each item, right? This is meant to also enable vectorization for a subset of the cases. |
Finished benchmarking commit (dfe514e): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 623.638s -> 622.479s (-0.19%) |
Ok, the opt-results have their codegen units perturbed. Bad for comparisons. Maybe we should have some 1CGU benchmarks... But the check-results are informative. One of the issues is |
9d3905f
to
82ee190
Compare
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…<try> Chunked generic slice eq looks nice in a microbenchmark, let's see if perf agrees ``` OLD: slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns NEW: slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns ```
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (f2bd50d): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 622.556s -> 625.604s (0.49%) |
…<try> Chunked generic slice eq looks nice in a microbenchmark, let's see if perf agrees ``` OLD: slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns NEW: slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns ```
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (8ea9808): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 623.37s -> 626.587s (0.52%) |
@bors try |
This comment has been minimized.
This comment has been minimized.
…<try> Chunked generic slice eq looks nice in a microbenchmark, let's see if perf agrees ``` OLD: slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns NEW: slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns ```
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (10dcf57): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 620.986s -> 623.686s (0.43%) |
self.iter().zip(other.iter()).all(|(x, y)| x == y) | ||
// ZSTs have no identity and slices don't guarantee which addresses-to-ZSTs they produce | ||
// so we only need to compare them once to determine the behavior of the PartialEq impl | ||
if const { mem::size_of::<A>() == 0 && mem::size_of::<B>() == 0 } { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if const { mem::size_of::<A>() == 0 && mem::size_of::<B>() == 0 } { | |
if const { A::IS_ZST && B::IS_ZST } { |
default fn equal(&self, other: &[B]) -> bool { | ||
if self.len() != other.len() { | ||
return false; | ||
} | ||
|
||
self.iter().zip(other.iter()).all(|(x, y)| x == y) | ||
// ZSTs have no identity and slices don't guarantee which addresses-to-ZSTs they produce | ||
// so we only need to compare them once to determine the behavior of the PartialEq impl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eq
could still have side-effects though, no?
We have https://rust-lang.github.io/rfcs/1521-copy-clone-semantics.html to let us skip side-effects in clone
sometimes, but I don't think we have a justification to skip side-effects in PartialEq
.
ef4600d
to
5850f36
Compare
This vectorizes nicely on AVX2 (got it down to 12ns/iteration), has a guaranteed preroll that's not chunked in case comparisons are expensive and produces less unoptimized llvm-ir than the current implementation. After optimizations it's more IR due to unrolling. If this won't make perf happy then I'm out of ideas. @bors try |
This comment has been minimized.
This comment has been minimized.
…<try> Chunked generic slice eq looks nice in a microbenchmark, let's see if perf agrees ``` OLD: slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns NEW: slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns ```
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (1cf5d79): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 622.45s -> 625.291s (0.46%) |
looks nice in a microbenchmark, let's see if perf agrees