Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked generic slice eq #116422

Closed
wants to merge 3 commits into from

Conversation

the8472
Copy link
Member

@the8472 the8472 commented Oct 4, 2023

looks nice in a microbenchmark, let's see if perf agrees

OLD:
    slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns
NEW:
    slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns

@rustbot
Copy link
Collaborator

rustbot commented Oct 4, 2023

r? @scottmcm

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 4, 2023
@the8472 the8472 force-pushed the chunked-generic-slice-eq branch from 41cfb68 to 9d3905f Compare October 4, 2023 14:26
@the8472
Copy link
Member Author

the8472 commented Oct 4, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 4, 2023
@bors
Copy link
Contributor

bors commented Oct 4, 2023

⌛ Trying commit 9d3905f with merge dfe514e...

bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 4, 2023
…<try>

Chunked generic slice eq

looks nice in a microbenchmark, let's see if perf agrees

```
OLD:
    slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns
NEW:
    slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns
```
@bors
Copy link
Contributor

bors commented Oct 4, 2023

☀️ Try build successful - checks-actions
Build commit: dfe514e (dfe514e21c59daa28d30461acb33491f3aab90a0)

@rust-timer

This comment has been minimized.

@scottmcm
Copy link
Member

scottmcm commented Oct 4, 2023

Hmm, we used to have a 4-way unroll in all the short-circuiting slice iterator stuff too, because it looks great in simple stuff, but ended up removing it because adding 4× the callsites for non-trivial things was much worse.

Might be worth checking the bench for things like [String]?

&& chunks_a
.iter()
.zip(chunks_b.iter())
.all(|(a, b)| (a[0] == b[0]) & (a[1] == b[1]) & (a[2] == b[2]) & (a[3] == b[3]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is & because of things like #105259 (comment) ?

It seems wrong to do this for general arbitrarily-expensive ==. Is there a way it could be written to pre-load the values so that LLVM knows it's allowed to remove the short-circuit, but it's still up to the optimizer to decide if it's worth doing?

Copy link
Member Author

@the8472 the8472 Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are a bunch of other open issues about && not doing the right thing too. I initially used flag &= ... but that unrolled to a sequence of cmp + jump pairs. Which was faster too but a lot more instructions. This version does a vpxor on avx2.

They're all references anyway, so we can't really "load" them. We could special-case it for T: Copy but that wouldn't make a difference for &str or similar.

I'm considering to peel the first chunk (in addition to the residual) and do it the conventional way if perf results are bad. This should hopefully catch a lot of non-equal results and mostly do the unrolled variant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried preloading all the values with let _ = ManuallyDrop::new(ptr::read(self.get_unchecked(i))); etc. and then using && but it didn't help.

@the8472
Copy link
Member Author

the8472 commented Oct 4, 2023

Hmm, we used to have a 4-way unroll in all the short-circuiting slice iterator stuff too, because it looks great in simple stuff,

That was just an unroll though, still short-circuiting for each item, right? This is meant to also enable vectorization for a subset of the cases.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (dfe514e): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.3% [0.2%, 4.2%] 17
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.7% [-0.7%, -0.7%] 2
Improvements ✅
(secondary)
-0.6% [-0.6%, -0.6%] 2
All ❌✅ (primary) 1.1% [-0.7%, 4.2%] 19

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.3% [0.3%, 2.3%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.8% [-5.3%, -4.3%] 2
Improvements ✅
(secondary)
-2.5% [-2.5%, -2.5%] 1
All ❌✅ (primary) -1.8% [-5.3%, 2.3%] 4

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.5% [1.9%, 3.3%] 3
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.0% [-2.0%, -2.0%] 1
All ❌✅ (primary) 2.5% [1.9%, 3.3%] 3

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.7% [0.2%, 4.8%] 28
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.3% [-0.5%, -0.2%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.6% [-0.5%, 4.8%] 31

Bootstrap: 623.638s -> 622.479s (-0.19%)
Artifact size: 272.00 MiB -> 272.04 MiB (0.01%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Oct 4, 2023
@the8472
Copy link
Member Author

the8472 commented Oct 4, 2023

Ok, the opt-results have their codegen units perturbed. Bad for comparisons. Maybe we should have some 1CGU benchmarks...

But the check-results are informative. One of the issues is ValTree which can be nested slices. So yeah, splitting a short and long case probably makes sense + peeling the first chunk.

@the8472 the8472 force-pushed the chunked-generic-slice-eq branch from 9d3905f to 82ee190 Compare October 4, 2023 18:07
@the8472
Copy link
Member Author

the8472 commented Oct 4, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 4, 2023
@bors
Copy link
Contributor

bors commented Oct 4, 2023

⌛ Trying commit 82ee190 with merge f2bd50d...

bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 4, 2023
…<try>

Chunked generic slice eq

looks nice in a microbenchmark, let's see if perf agrees

```
OLD:
    slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns
NEW:
    slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns
```
@bors
Copy link
Contributor

bors commented Oct 4, 2023

☀️ Try build successful - checks-actions
Build commit: f2bd50d (f2bd50d077e49715c223093b18dc62ad82270170)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (f2bd50d): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.4% [0.3%, 9.0%] 25
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.3% [-0.3%, -0.3%] 1
All ❌✅ (primary) 1.4% [0.3%, 9.0%] 25

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.0% [0.1%, 3.2%] 4
Regressions ❌
(secondary)
2.8% [2.8%, 2.8%] 1
Improvements ✅
(primary)
-4.1% [-7.0%, -2.9%] 4
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -1.0% [-7.0%, 3.2%] 8

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.5% [1.2%, 7.8%] 6
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.5% [1.2%, 7.8%] 6

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.8% [0.1%, 4.3%] 36
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.1% [-0.1%, -0.1%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.7% [-0.1%, 4.3%] 38

Bootstrap: 622.556s -> 625.604s (0.49%)
Artifact size: 272.00 MiB -> 272.14 MiB (0.05%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 4, 2023
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 4, 2023
@bors
Copy link
Contributor

bors commented Oct 4, 2023

⌛ Trying commit 2f78bce with merge 8ea9808...

bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 4, 2023
…<try>

Chunked generic slice eq

looks nice in a microbenchmark, let's see if perf agrees

```
OLD:
    slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns
NEW:
    slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns
```
@bors
Copy link
Contributor

bors commented Oct 5, 2023

☀️ Try build successful - checks-actions
Build commit: 8ea9808 (8ea980854189e3f078df96bb53682379b7508af4)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (8ea9808): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.3% [0.3%, 9.5%] 30
Regressions ❌
(secondary)
0.5% [0.2%, 0.7%] 4
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.3% [0.3%, 9.5%] 30

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.6% [0.1%, 4.9%] 5
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.2% [-4.1%, -2.3%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.4% [-4.1%, 4.9%] 8

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.6% [1.3%, 8.1%] 6
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.6% [1.3%, 8.1%] 6

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.8% [0.1%, 4.3%] 36
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.2% [-0.2%, -0.2%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.7% [-0.2%, 4.3%] 37

Bootstrap: 623.37s -> 626.587s (0.52%)
Artifact size: 272.00 MiB -> 272.14 MiB (0.05%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 5, 2023
@the8472
Copy link
Member Author

the8472 commented Oct 5, 2023

@bors try
@rust-timer queue include=html5ever,webrender,bitmaps,helloworld,cargo,hyper,exa,unify-linearly

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 5, 2023
@bors
Copy link
Contributor

bors commented Oct 5, 2023

⌛ Trying commit ef4600d with merge 10dcf57...

bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 5, 2023
…<try>

Chunked generic slice eq

looks nice in a microbenchmark, let's see if perf agrees

```
OLD:
    slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns
NEW:
    slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns
```
@bors
Copy link
Contributor

bors commented Oct 5, 2023

☀️ Try build successful - checks-actions
Build commit: 10dcf57 (10dcf575ae94b07bcff5033f5969e8b0570bcffa)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (10dcf57): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.2% [0.7%, 3.4%] 3
Regressions ❌
(secondary)
0.6% [0.6%, 0.6%] 1
Improvements ✅
(primary)
-0.4% [-0.5%, -0.2%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.9% [-0.5%, 3.4%] 6

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.5% [-10.4%, -0.8%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -4.5% [-10.4%, -0.8%] 3

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.6% [1.9%, 3.2%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.6% [1.9%, 3.2%] 2

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.3% [0.3%, 0.4%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.6% [-1.6%, -0.1%] 11
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.5% [-1.6%, 0.4%] 13

Bootstrap: 620.986s -> 623.686s (0.43%)
Artifact size: 271.97 MiB -> 272.02 MiB (0.02%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 5, 2023
self.iter().zip(other.iter()).all(|(x, y)| x == y)
// ZSTs have no identity and slices don't guarantee which addresses-to-ZSTs they produce
// so we only need to compare them once to determine the behavior of the PartialEq impl
if const { mem::size_of::<A>() == 0 && mem::size_of::<B>() == 0 } {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if const { mem::size_of::<A>() == 0 && mem::size_of::<B>() == 0 } {
if const { A::IS_ZST && B::IS_ZST } {

default fn equal(&self, other: &[B]) -> bool {
if self.len() != other.len() {
return false;
}

self.iter().zip(other.iter()).all(|(x, y)| x == y)
// ZSTs have no identity and slices don't guarantee which addresses-to-ZSTs they produce
// so we only need to compare them once to determine the behavior of the PartialEq impl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eq could still have side-effects though, no?

We have https://rust-lang.github.io/rfcs/1521-copy-clone-semantics.html to let us skip side-effects in clone sometimes, but I don't think we have a justification to skip side-effects in PartialEq.

@the8472 the8472 force-pushed the chunked-generic-slice-eq branch from ef4600d to 5850f36 Compare October 7, 2023 19:19
@the8472
Copy link
Member Author

the8472 commented Oct 7, 2023

This vectorizes nicely on AVX2 (got it down to 12ns/iteration), has a guaranteed preroll that's not chunked in case comparisons are expensive and produces less unoptimized llvm-ir than the current implementation. After optimizations it's more IR due to unrolling.

If this won't make perf happy then I'm out of ideas.

@bors try
@rust-timer queue include=html5ever,webrender,bitmaps,helloworld,cargo,hyper,exa,unify-linearly

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 7, 2023
@bors
Copy link
Contributor

bors commented Oct 7, 2023

⌛ Trying commit 5850f36 with merge 1cf5d79...

bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 7, 2023
…<try>

Chunked generic slice eq

looks nice in a microbenchmark, let's see if perf agrees

```
OLD:
    slice::slice_cmp_generic 54.00ns/iter +/- 1.00ns
NEW:
    slice::slice_cmp_generic 20.00ns/iter +/- 2.00ns
```
@bors
Copy link
Contributor

bors commented Oct 7, 2023

☀️ Try build successful - checks-actions
Build commit: 1cf5d79 (1cf5d79b8e9f5ab7ed26e766f08b6277c092f157)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (1cf5d79): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.6% [0.9%, 2.4%] 9
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.5% [-0.5%, -0.5%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.3% [-0.5%, 2.4%] 10

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.8% [-0.8%, -0.7%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.8% [-0.8%, -0.7%] 2

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.5% [1.0%, 2.1%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.5% [1.0%, 2.1%] 2

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.1%, 0.1%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.8% [-1.9%, -0.0%] 10
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.7% [-1.9%, 0.1%] 11

Bootstrap: 622.45s -> 625.291s (0.46%)
Artifact size: 270.69 MiB -> 270.67 MiB (-0.01%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 7, 2023
@the8472 the8472 closed this Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants