-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize reverse_copy() #804
Conversation
Fixed function signature and conditions for vectorization.
Line 11 in 88f8f44
Line 16 in 88f8f44
Lines 24 to 28 in 88f8f44
|
Is already defined in xutility
Rename __std_reverse_trivially_copyable_X to __std_reverse_copy_trivially_copyable_X
Any chance you would be willing to provide some performance numbers for before/after this change? Edit: And ideally the source used to acquire the numbers. |
Sure, I did some basic micro benchmarking of the implementation with google/benchmark on my x64 machine. A summary of the results can be found on google docs, the code is available on gist. I’d be happy to do more benchmarking if required. |
That google doc isn't public. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs testing exercising each of the newly added special paths.
Let us know if you need any help adding test coverage, or if there are any questions we can answer. 😺 |
@pawREP, just a friendly ping. Are you available to add testing for this? Otherwise it will have to wait until another contributor or maintainer adds testing. |
I think the benefits we saw for reverse are sufficient to motivate this. All this PR needs to land is some tests. |
Here's the benchmark I wrote way back when I added the vector reverse:
Here are results from my 3970X; first, the important ones like vector: by the time you get to 64 elements the wins are huge. There's no reason this wouldn't apply just as much to reverse_copy (although absolute wins might be lower because memory bandwidth consumption is higher for that algorithm)
The full list:
|
Here would be a good place to add the tests: https://github.com/microsoft/STL/blob/master/tests/std/tests/VSO_0000000_vector_algorithms/test.cpp |
Remove an unnecessary semicolon after the function definition. Use the modern ranges::reverse pattern. Remove unnecessary bool_constant. Add test coverage for varying containers.
I pushed a significant simplification. Originally, we had both Additionally, @CaseyCarter used a simpler technique that we developed for vectorizing I removed an unnecessary semicolon after the function definition. I also removed an unnecessary use of I added test coverage for what happens when the containers/iterators vary in strength. This was prompted by my concern that |
Also remove an unnecessary std:: in the test.
I noticed the comment banner said:
So I changed the separately compiled functions accordingly, synthesizing the return value in the header. Thanks to @BillyONeal for writing the original extremely detailed comment banner; otherwise this would have appeared as a regression in optimized code that would have taken forever to diagnose. |
The 1, 4, and 8-byte reverse_copy implementations all start by testing for 32 bytes of AVX2 work. This is exactly half of what the 1/2/4/8-byte reverse implementations test for; I believe they need 64 bytes because they swap the front and the back of the range (so, 32 bytes on each side). Uniquely, 2-byte reverse_copy said 64. I believe that this is a copy-paste error, and that it should say 32 like all other reverse_copy implementations. This couldn't have been caught by testing, because the effect was to disable the AVX2 optimization for 2-byte elements, where the source range was between [32, 64) bytes, resulting in reduced performance only.
The 1, 4, and 8-byte Uniquely, 2-byte This couldn't have been caught by testing, because the effect was to disable the AVX2 optimization for 2-byte elements, where the source range was between [32, 64) bytes, resulting in reduced performance only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go! 🌔 🚀 🪐
I think I just copy pasta'd an email from Neeraj about it :) |
[microsoftGH-804 squashed and rebased]
Thanks for your contribution! I was getting tired of always forward copying my vector eyes. |
Co-authored-by: Billy Robert O'Neal III <[email protected]> Co-authored-by: Stephan T. Lavavej <[email protected]>
Resolves #181
The implementation closely follows the one for vectorized
std::reverse
.This currently includes the define of _USE_STD_VECTOR_ALGORITHMS in<algorithm>
, which presumably should be changed. What's the best solution here?