Vectorize reverse_copy() #804

pawREP · 2020-05-06T20:27:26Z

Resolves #181

The implementation closely follows the one for vectorized std::reverse.

~~This currently includes the define of _USE_STD_VECTOR_ALGORITHMS in <algorithm>, which presumably should be changed. What's the best solution here?~~

msftclas · 2020-05-06T20:27:38Z

All CLA requirements met.

Fixed function signature and conditions for vectorization.

StephanTLavavej · 2020-05-07T08:52:41Z

This currently includes the define of _USE_STD_VECTOR_ALGORITHMS in <algorithm>, which presumably should be changed. What's the best solution here?

<algorithm> includes <xmemory> includes <xutility> which defines this macro, so you should be able to use it without doing anything special:

STL/stl/inc/algorithm

Line 11 in 88f8f44

#include <xmemory>

STL/stl/inc/xmemory

Line 16 in 88f8f44

#include <xutility>

STL/stl/inc/xutility

Lines 24 to 28 in 88f8f44

    
           #if (defined(_M_IX86) || defined(_M_X64)) && !defined(_M_CEE_PURE) && !defined(_M_HYBRID)
 
           #define _USE_STD_VECTOR_ALGORITHMS 1
 
           #else
 
           #define _USE_STD_VECTOR_ALGORITHMS 0
 
           #endif

Is already defined in xutility

Rename __std_reverse_trivially_copyable_X to __std_reverse_copy_trivially_copyable_X

cbezault · 2020-05-08T20:20:37Z

Any chance you would be willing to provide some performance numbers for before/after this change?

Edit: And ideally the source used to acquire the numbers.

pawREP · 2020-05-09T01:25:05Z

Any chance you would be willing to provide some performance numbers for before/after this change?

Edit: And ideally the source used to acquire the numbers.

Sure, I did some basic micro benchmarking of the implementation with google/benchmark on my x64 machine. A summary of the results can be found on google docs, the code is available on gist. I’d be happy to do more benchmarking if required.

cbezault · 2020-05-11T19:56:54Z

That google doc isn't public.

BillyONeal

I think this needs testing exercising each of the newly added special paths.

stl/src/vector_algorithms.cpp

StephanTLavavej · 2020-05-27T20:13:53Z

Let us know if you need any help adding test coverage, or if there are any questions we can answer. 😺

cbezault · 2020-06-04T20:07:37Z

@pawREP, just a friendly ping. Are you available to add testing for this? Otherwise it will have to wait until another contributor or maintainer adds testing.

BillyONeal · 2020-07-21T18:25:39Z

Any chance you would be willing to provide some performance numbers for before/after this change?

Edit: And ideally the source used to acquire the numbers.

I think the benefits we saw for reverse are sufficient to motivate this. All this PR needs to land is some tests.

BillyONeal · 2020-07-21T18:34:28Z

Here's the benchmark I wrote way back when I added the vector reverse:

#include <algorithm>
#include <benchmark/benchmark.h>
#include <deque>
#include <functional>
#include <list>
#include <numeric>
#include <stdlib.h>
#include <utility>
#include <vector>

using namespace std;

void verify(bool b) {
  if (!b) {
    exit(1);
  }
}

template<class _BidIt>
void plain_bidi_reverse(_BidIt _First, _BidIt _Last)
	{
	for (; _First != _Last && _First != --_Last; ++_First)
		{
		const auto _Temp = *_First;
		*_First = *_Last;
		*_Last = _Temp;
		}
	}

template <class Container, class TestedFn>
inline void RunTest(benchmark::State &state, size_t dataSize, TestedFn fn) {
  Container data(dataSize);
  iota(data.begin(), data.end(),
       static_cast<typename Container::value_type>(1));
  fn(data);
  verify(is_sorted(data.begin(), data.end(), greater<>{}));
  fn(data);
  verify(is_sorted(data.begin(), data.end(), less<>{}));
  for (auto _ : state) {
    (void)_;
    fn(data);
  }
}

template <class Container> void BenchPlainBidiReverse(benchmark::State &state) {
  RunTest<Container>(state, static_cast<size_t>(state.range(0)),
                     [](auto &c) { plain_bidi_reverse(c.begin(), c.end()); });
}

template <class Container> void BenchStdReverse(benchmark::State &state) {
  RunTest<Container>(state, static_cast<size_t>(state.range(0)),
                     [](auto &c) { reverse(c.begin(), c.end()); });
}

BENCHMARK_TEMPLATE(BenchStdReverse, deque<unsigned int>)->Range(8, 100'000);
BENCHMARK_TEMPLATE(BenchStdReverse, list<unsigned int>)->Range(8, 100'000);

BENCHMARK_TEMPLATE(BenchPlainBidiReverse, vector<unsigned char>)->Range(8, 255);
BENCHMARK_TEMPLATE(BenchStdReverse, vector<unsigned char>)->Range(8, 255);
BENCHMARK_TEMPLATE(BenchPlainBidiReverse, vector<unsigned short>)->Range(8, 65535);
BENCHMARK_TEMPLATE(BenchStdReverse, vector<unsigned short>)->Range(8, 65535);
BENCHMARK_TEMPLATE(BenchPlainBidiReverse, vector<unsigned int>)->Range(8, 100'000);
BENCHMARK_TEMPLATE(BenchStdReverse, vector<unsigned int>)->Range(8, 100'000);

extern "C" extern long __isa_enabled;
constexpr long __ISA_AVAILABLE_SSE2 = 1;
constexpr long __ISA_AVAILABLE_AVX2 = 5;

#include <emmintrin.h>
#include <intrin.h>
#include <xmmintrin.h>

extern "C" void _cdecl __std_sse_reverse_trivially_copyable_4(
    unsigned int *_First, unsigned int *_Last) throw() {
  if (_Last - _First > 8
#ifndef _M_X64
      && _bittest(&__isa_enabled, __ISA_AVAILABLE_SSE2)
#endif /* _M_X64 */
          ) {
    unsigned int *_Stop_at = _First + ((_Last - _First) >> 3 << 2);
    do {
      _Last -= 4;
      const __m128i _Left =
          _mm_loadu_si128(reinterpret_cast<__m128i *>(_First));
      const __m128i _Right =
          _mm_loadu_si128(reinterpret_cast<__m128i *>(_Last));
      const __m128i _Left_reversed = _mm_shuffle_epi32(_Left, 27);
      const __m128i _Right_reversed = _mm_shuffle_epi32(_Right, 27);
      _mm_storeu_si128(reinterpret_cast<__m128i *>(_First), _Right_reversed);
      _mm_storeu_si128(reinterpret_cast<__m128i *>(_Last), _Left_reversed);
      _First += 4;
    } while (_First != _Stop_at);
  }

  for (; _First != _Last && _First != --_Last; ++_First) {
    const unsigned int _Temp = *_First;
    *_First = *_Last;
    *_Last = _Temp;
  }
}

void BenchUnsignedIntSseReverse(benchmark::State &state) {
  RunTest<vector<unsigned int>>(state, static_cast<size_t>(state.range(0)), [](auto &c) {
    __std_sse_reverse_trivially_copyable_4(&*c.begin(), &*c.end());
  });
}

BENCHMARK(BenchUnsignedIntSseReverse)->Range(8, 100'000);

extern "C" void _cdecl __std_avx2_reverse_trivially_copyable_4(
    unsigned int *_First, unsigned int *_Last) throw() {
  if (_Last - _First > 16 && _bittest(&__isa_enabled, __ISA_AVAILABLE_AVX2)) {
    unsigned int *_Stop_at = _First + ((_Last - _First) >> 4 << 3);
    do {
      _Last -= 8;
      const __m256i _Left =
          _mm256_loadu_si256(reinterpret_cast<__m256i *>(_First));
      const __m256i _Right =
          _mm256_loadu_si256(reinterpret_cast<__m256i *>(_Last));
      const __m256i _Left_lane_reversed = _mm256_shuffle_epi32(_Left, 27);
      const __m256i _Right_lane_reversed = _mm256_shuffle_epi32(_Right, 27);
      const __m256i _Left_reversed =
          _mm256_permute4x64_epi64(_Left_lane_reversed, 78);
      const __m256i _Right_reversed =
          _mm256_permute4x64_epi64(_Right_lane_reversed, 78);
      _mm256_storeu_si256(reinterpret_cast<__m256i *>(_First), _Right_reversed);
      _mm256_storeu_si256(reinterpret_cast<__m256i *>(_Last), _Left_reversed);
      _First += 8;
    } while (_First != _Stop_at);
  }

  for (; _First != _Last && _First != --_Last; ++_First) {
    const unsigned int _Temp = *_First;
    *_First = *_Last;
    *_Last = _Temp;
  }
}

void BenchAvx2UnsignedIntReverse(benchmark::State &state) {
  RunTest<vector<unsigned int>>(state, static_cast<size_t>(state.range(0)), [](auto &c) {
    __std_avx2_reverse_trivially_copyable_4(&*c.begin(), &*c.end());
  });
}

BENCHMARK(BenchAvx2UnsignedIntReverse)->Range(8, 100'000);

BENCHMARK_TEMPLATE(BenchPlainBidiReverse, vector<unsigned long long>)
    ->Range(8, 100'000);
BENCHMARK_TEMPLATE(BenchStdReverse, vector<unsigned long long>)
    ->Range(8, 100'000);

BENCHMARK_MAIN();

Here are results from my 3970X; first, the important ones like vector: by the time you get to 64 elements the wins are huge. There's no reason this wouldn't apply just as much to reverse_copy (although absolute wins might be lower because memory bandwidth consumption is higher for that algorithm)

---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
BenchPlainBidiReverse<vector<unsigned char>>/8                 2.14 ns         2.10 ns    320000000
BenchPlainBidiReverse<vector<unsigned char>>/64                23.6 ns         23.4 ns     21333333
BenchPlainBidiReverse<vector<unsigned char>>/255                119 ns          120 ns      5600000
BenchStdReverse<vector<unsigned char>>/8                       3.57 ns         3.61 ns    203636364
BenchStdReverse<vector<unsigned char>>/64                      3.85 ns         3.85 ns    194782609
BenchStdReverse<vector<unsigned char>>/255                     14.2 ns         14.4 ns     49777778
BenchPlainBidiReverse<vector<unsigned long long>>/8            1.92 ns         1.93 ns    373333333
BenchPlainBidiReverse<vector<unsigned long long>>/64           15.4 ns         15.7 ns     49777778
BenchPlainBidiReverse<vector<unsigned long long>>/512           132 ns          131 ns      5600000
BenchPlainBidiReverse<vector<unsigned long long>>/4096         1017 ns         1004 ns       746667
BenchPlainBidiReverse<vector<unsigned long long>>/32768        8397 ns         8371 ns        89600
BenchPlainBidiReverse<vector<unsigned long long>>/100000      25675 ns        25495 ns        26353
BenchStdReverse<vector<unsigned long long>>/8                  3.46 ns         3.52 ns    213333333
BenchStdReverse<vector<unsigned long long>>/64                 8.98 ns         9.00 ns     74666667
BenchStdReverse<vector<unsigned long long>>/512                62.0 ns         61.4 ns     11200000
BenchStdReverse<vector<unsigned long long>>/4096                493 ns          488 ns      1120000
BenchStdReverse<vector<unsigned long long>>/32768              3890 ns         3836 ns       179200
BenchStdReverse<vector<unsigned long long>>/100000            11978 ns        11998 ns        56000

The full list:

D:\vclib-benchmarks\windows.x64.release>.\reverse.exe
07/21/20 11:28:22
Running .\reverse.exe
Run on (64 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 512 KiB (x32)
  L3 Unified 16384 KiB (x8)
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
BenchStdReverse<deque<unsigned int>>/8                         7.41 ns         7.32 ns     89600000
BenchStdReverse<deque<unsigned int>>/64                        50.8 ns         51.6 ns     10000000
BenchStdReverse<deque<unsigned int>>/512                        407 ns          410 ns      1792000
BenchStdReverse<deque<unsigned int>>/4096                      4039 ns         4028 ns       213333
BenchStdReverse<deque<unsigned int>>/32768                    26808 ns        26367 ns        24889
BenchStdReverse<deque<unsigned int>>/100000                   92349 ns        92072 ns         7467
BenchStdReverse<list<unsigned int>>/8                          3.65 ns         3.61 ns    194782609
BenchStdReverse<list<unsigned int>>/64                         36.6 ns         36.8 ns     18666667
BenchStdReverse<list<unsigned int>>/512                         314 ns          314 ns      2240000
BenchStdReverse<list<unsigned int>>/4096                       3560 ns         3530 ns       194783
BenchStdReverse<list<unsigned int>>/32768                     43682 ns        43493 ns        15448
BenchStdReverse<list<unsigned int>>/100000                   123828 ns       122768 ns         5600
BenchPlainBidiReverse<vector<unsigned char>>/8                 2.14 ns         2.10 ns    320000000
BenchPlainBidiReverse<vector<unsigned char>>/64                23.6 ns         23.4 ns     21333333
BenchPlainBidiReverse<vector<unsigned char>>/255                119 ns          120 ns      5600000
BenchStdReverse<vector<unsigned char>>/8                       3.57 ns         3.61 ns    203636364
BenchStdReverse<vector<unsigned char>>/64                      3.85 ns         3.85 ns    194782609
BenchStdReverse<vector<unsigned char>>/255                     14.2 ns         14.4 ns     49777778
BenchPlainBidiReverse<vector<unsigned short>>/8                1.90 ns         1.88 ns    373333333
BenchPlainBidiReverse<vector<unsigned short>>/64               31.9 ns         32.1 ns     22400000
BenchPlainBidiReverse<vector<unsigned short>>/512               251 ns          251 ns      2800000
BenchPlainBidiReverse<vector<unsigned short>>/4096             2047 ns         2040 ns       344615
BenchPlainBidiReverse<vector<unsigned short>>/32768           16528 ns        16741 ns        44800
BenchPlainBidiReverse<vector<unsigned short>>/65535           33049 ns        32993 ns        20364
BenchStdReverse<vector<unsigned short>>/8                      2.91 ns         2.93 ns    224000000
BenchStdReverse<vector<unsigned short>>/64                     3.82 ns         3.85 ns    186666667
BenchStdReverse<vector<unsigned short>>/512                    19.9 ns         19.9 ns     34461538
BenchStdReverse<vector<unsigned short>>/4096                    139 ns          138 ns      4977778
BenchStdReverse<vector<unsigned short>>/32768                  1124 ns         1123 ns       640000
BenchStdReverse<vector<unsigned short>>/65535                  2486 ns         2455 ns       280000
BenchPlainBidiReverse<vector<unsigned int>>/8                  1.90 ns         1.88 ns    373333333
BenchPlainBidiReverse<vector<unsigned int>>/64                 15.3 ns         15.4 ns     49777778
BenchPlainBidiReverse<vector<unsigned int>>/512                 131 ns          129 ns      4977778
BenchPlainBidiReverse<vector<unsigned int>>/4096               1017 ns         1001 ns       640000
BenchPlainBidiReverse<vector<unsigned int>>/32768              8173 ns         8196 ns        89600
BenchPlainBidiReverse<vector<unsigned int>>/100000            25026 ns        25112 ns        28000
BenchStdReverse<vector<unsigned int>>/8                        2.17 ns         2.20 ns    320000000
BenchStdReverse<vector<unsigned int>>/64                       5.26 ns         5.16 ns    112000000
BenchStdReverse<vector<unsigned int>>/512                      35.1 ns         35.3 ns     20363636
BenchStdReverse<vector<unsigned int>>/4096                      277 ns          276 ns      2488889
BenchStdReverse<vector<unsigned int>>/32768                    2231 ns         2246 ns       320000
BenchStdReverse<vector<unsigned int>>/100000                   6814 ns         6801 ns        89600
BenchUnsignedIntSseReverse/8                                   2.01 ns         2.04 ns    344615385
BenchUnsignedIntSseReverse/64                                  4.81 ns         4.76 ns    144516129
BenchUnsignedIntSseReverse/512                                 32.2 ns         32.1 ns     22400000
BenchUnsignedIntSseReverse/4096                                 250 ns          251 ns      2800000
BenchUnsignedIntSseReverse/32768                               2140 ns         2131 ns       344615
BenchUnsignedIntSseReverse/100000                              6558 ns         6557 ns       112000
BenchAvx2UnsignedIntReverse/8                                  2.11 ns         2.13 ns    344615385
BenchAvx2UnsignedIntReverse/64                                 4.73 ns         4.81 ns    149333333
BenchAvx2UnsignedIntReverse/512                                30.2 ns         30.5 ns     23578947
BenchAvx2UnsignedIntReverse/4096                                242 ns          241 ns      2986667
BenchAvx2UnsignedIntReverse/32768                              1945 ns         1967 ns       373333
BenchAvx2UnsignedIntReverse/100000                             5955 ns         5999 ns       112000
BenchPlainBidiReverse<vector<unsigned long long>>/8            1.92 ns         1.93 ns    373333333
BenchPlainBidiReverse<vector<unsigned long long>>/64           15.4 ns         15.7 ns     49777778
BenchPlainBidiReverse<vector<unsigned long long>>/512           132 ns          131 ns      5600000
BenchPlainBidiReverse<vector<unsigned long long>>/4096         1017 ns         1004 ns       746667
BenchPlainBidiReverse<vector<unsigned long long>>/32768        8397 ns         8371 ns        89600
BenchPlainBidiReverse<vector<unsigned long long>>/100000      25675 ns        25495 ns        26353
BenchStdReverse<vector<unsigned long long>>/8                  3.46 ns         3.52 ns    213333333
BenchStdReverse<vector<unsigned long long>>/64                 8.98 ns         9.00 ns     74666667
BenchStdReverse<vector<unsigned long long>>/512                62.0 ns         61.4 ns     11200000
BenchStdReverse<vector<unsigned long long>>/4096                493 ns          488 ns      1120000
BenchStdReverse<vector<unsigned long long>>/32768              3890 ns         3836 ns       179200
BenchStdReverse<vector<unsigned long long>>/100000            11978 ns        11998 ns        56000

BillyONeal · 2020-07-21T19:12:09Z

Here would be a good place to add the tests: https://github.com/microsoft/STL/blob/master/tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

Remove an unnecessary semicolon after the function definition. Use the modern ranges::reverse pattern. Remove unnecessary bool_constant. Add test coverage for varying containers.

StephanTLavavej · 2020-07-23T02:35:39Z

I pushed a significant simplification. Originally, we had both if constexpr and tag dispatch codepaths for vectorized reverse, which you carefully followed for reverse_copy (thanks!). Now, all of our supported compilers activate if constexpr except for the old version of CUDA we still test with. We think it's reasonable to restrict the vectorization codepaths to when if constexpr is available (which will temporarily affect CUDA, but not in a way that triggers compiler errors, or generates anything worse than what we did for years before implementing vectorization, and it will return after we can upgrade to CUDA 10.1 Update 2).

Additionally, @CaseyCarter used a simpler technique that we developed for vectorizing ranges::reverse, which avoids a lot of the repetition for the 1/2/4/8 byte cases; I have enhanced both std::reverse and std::reverse_copy to follow that.

I removed an unnecessary semicolon after the function definition.

I also removed an unnecessary use of bool_constant; conjunction_v directly consumes the type traits structs (and is more efficient doing so).

I added test coverage for what happens when the containers/iterators vary in strength. This was prompted by my concern that reverse_copy might mishandle copying a contiguous range to a non-contiguous range; this fear was unfounded but it's still good to have the coverage.

Also remove an unnecessary std:: in the test.

StephanTLavavej · 2020-07-23T02:54:12Z

I noticed the comment banner said:

The optimizer also assumes in that case that a pointer parameter is not returned to the caller via the return value, so functions using "noalias" must usually return void.

So I changed the separately compiled functions accordingly, synthesizing the return value in the header. Thanks to @BillyONeal for writing the original extremely detailed comment banner; otherwise this would have appeared as a regression in optimized code that would have taken forever to diagnose.

The 1, 4, and 8-byte reverse_copy implementations all start by testing for 32 bytes of AVX2 work. This is exactly half of what the 1/2/4/8-byte reverse implementations test for; I believe they need 64 bytes because they swap the front and the back of the range (so, 32 bytes on each side). Uniquely, 2-byte reverse_copy said 64. I believe that this is a copy-paste error, and that it should say 32 like all other reverse_copy implementations. This couldn't have been caught by testing, because the effect was to disable the AVX2 optimization for 2-byte elements, where the source range was between [32, 64) bytes, resulting in reduced performance only.

StephanTLavavej · 2020-07-23T03:46:22Z

Cleanup: Use const void* _Stop_at, we don't write through this.
Cleanups: while loop, and _Byte_length can just take const void*.
Copy-paste fix! Change 64 to 32.

The 1, 4, and 8-byte reverse_copy implementations all start by testing for 32 bytes of AVX2 work. This is exactly half of what the 1/2/4/8-byte reverse implementations test for; I believe they need 64 bytes because they swap the front and the back of the range (so, 32 bytes on each side).

Uniquely, 2-byte reverse_copy said 64. I believe that this is a copy-paste error, and that it should say 32 like all other reverse_copy implementations.

This couldn't have been caught by testing, because the effect was to disable the AVX2 optimization for 2-byte elements, where the source range was between [32, 64) bytes, resulting in reduced performance only.

StephanTLavavej

I think this is good to go! 🌔 🚀 🪐

BillyONeal · 2020-07-23T07:11:22Z

So I changed the separately compiled functions accordingly, synthesizing the return value in the header. Thanks to @BillyONeal for writing the original extremely detailed comment banner

I think I just copy pasta'd an email from Neeraj about it :)

[microsoftGH-804 squashed and rebased]

CaseyCarter · 2020-07-27T18:07:22Z

Thanks for your contribution! I was getting tired of always forward copying my vector eyes.

Co-authored-by: Billy Robert O'Neal III <[email protected]> Co-authored-by: Stephan T. Lavavej <[email protected]>

pawREP added 3 commits May 6, 2020 17:20

Add vectorized reverse copy functions

aa76476

Vectorize reverse_copy

2d73df7

Replace overloads of _Advance_bytes and _Byte_length with templates

7d80f16

pawREP requested a review from a team as a code owner May 6, 2020 20:27

pawREP marked this pull request as draft May 6, 2020 21:45

CaseyCarter added the performance Must go faster label May 7, 2020

Revision 1

4add28b

Fixed function signature and conditions for vectorization.

pawREP added 2 commits May 7, 2020 18:38

Remove _USE_STD_VECTOR_ALGORITHMS define

2e19adb

Is already defined in xutility

Rename __std_reverse_trivially_copyable_X

a780f10

Rename __std_reverse_trivially_copyable_X to __std_reverse_copy_trivially_copyable_X

pawREP marked this pull request as ready for review May 7, 2020 18:07

StephanTLavavej added the info needed We need more info before working on this label May 8, 2020

BillyONeal reviewed May 11, 2020

View reviewed changes

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

StephanTLavavej removed the info needed We need more info before working on this label May 20, 2020

BillyONeal added 3 commits July 21, 2020 12:41

Merge remote-tracking branch 'origin/master' into Vectorize-reverse_copy

b00bc06

Add test cases.

e63348e

Return _Advance_bytes to being an overload set rather than a template.

d516b5e

BillyONeal approved these changes Jul 22, 2020

View reviewed changes

mnatsuhara assigned StephanTLavavej Jul 22, 2020

Simplify by vectorizing _HAS_IF_CONSTEXPR only.

54f9b22

Remove an unnecessary semicolon after the function definition. Use the modern ranges::reverse pattern. Remove unnecessary bool_constant. Add test coverage for varying containers.

Fix bug: As the comment says, noalias should return void.

49a0f5f

Also remove an unnecessary std:: in the test.

StephanTLavavej added 3 commits July 22, 2020 20:34

Cleanup: Use const void* _Stop_at, we don't write through this.

07e8d81

Cleanups: while loop, and _Byte_length can just take const void*.

e59e076

StephanTLavavej approved these changes Jul 23, 2020

View reviewed changes

StephanTLavavej assigned BillyONeal and unassigned StephanTLavavej Jul 23, 2020

StephanTLavavej unassigned BillyONeal Jul 23, 2020

StephanTLavavej mentioned this pull request Jul 23, 2020

First STL Contributors Meetup #921

Closed

CaseyCarter self-assigned this Jul 27, 2020

CaseyCarter pushed a commit to CaseyCarter/STL that referenced this pull request Jul 27, 2020

Vectorize std::reverse_copy

784eb50

[microsoftGH-804 squashed and rebased]

CaseyCarter mentioned this pull request Jul 27, 2020

P0896R4 <ranges> #39

Closed

Suppress some overzealous static analysis warnings

5cd8c63

CaseyCarter merged commit 1555e0b into microsoft:master Jul 27, 2020

CaseyCarter removed their assignment Jul 27, 2020

CaseyCarter pushed a commit to CaseyCarter/STL that referenced this pull request Jul 28, 2020

Vectorize reverse_copy() (microsoft#804)

f8f6dff

Co-authored-by: Billy Robert O'Neal III <[email protected]> Co-authored-by: Stephan T. Lavavej <[email protected]>

StephanTLavavej mentioned this pull request Aug 5, 2020

Implement ranges::reverse_copy #1095

Merged

AlexGuteniev mentioned this pull request Jan 7, 2022

Optimize std::vector<:ghost:> copy algorithms #2461

Closed

StephanTLavavej mentioned this pull request Apr 3, 2023

Fix silent bad codegen for vectorized meow_element() above 4 GB #3619

Merged

StephanTLavavej mentioned this pull request May 22, 2024

<algorithm>: reverse_copy is mistakenly vectorized for pair<int&, int&> on 32-bit targets #4686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize reverse_copy() #804

Vectorize reverse_copy() #804

pawREP commented May 6, 2020 •

edited by StephanTLavavej

Loading

msftclas commented May 6, 2020 •

edited

Loading

StephanTLavavej commented May 7, 2020

cbezault commented May 8, 2020 •

edited

Loading

pawREP commented May 9, 2020

cbezault commented May 11, 2020

BillyONeal left a comment

StephanTLavavej commented May 27, 2020

cbezault commented Jun 4, 2020

BillyONeal commented Jul 21, 2020

BillyONeal commented Jul 21, 2020 •

edited

Loading

BillyONeal commented Jul 21, 2020

StephanTLavavej commented Jul 23, 2020

StephanTLavavej commented Jul 23, 2020

StephanTLavavej commented Jul 23, 2020

StephanTLavavej left a comment

BillyONeal commented Jul 23, 2020

CaseyCarter commented Jul 27, 2020

Vectorize reverse_copy() #804

Vectorize reverse_copy() #804

Conversation

pawREP commented May 6, 2020 • edited by StephanTLavavej Loading

msftclas commented May 6, 2020 • edited Loading

StephanTLavavej commented May 7, 2020

cbezault commented May 8, 2020 • edited Loading

pawREP commented May 9, 2020

cbezault commented May 11, 2020

BillyONeal left a comment

Choose a reason for hiding this comment

StephanTLavavej commented May 27, 2020

cbezault commented Jun 4, 2020

BillyONeal commented Jul 21, 2020

BillyONeal commented Jul 21, 2020 • edited Loading

BillyONeal commented Jul 21, 2020

StephanTLavavej commented Jul 23, 2020

StephanTLavavej commented Jul 23, 2020

StephanTLavavej commented Jul 23, 2020

StephanTLavavej left a comment

Choose a reason for hiding this comment

BillyONeal commented Jul 23, 2020

CaseyCarter commented Jul 27, 2020

pawREP commented May 6, 2020 •

edited by StephanTLavavej

Loading

msftclas commented May 6, 2020 •

edited

Loading

cbezault commented May 8, 2020 •

edited

Loading

BillyONeal commented Jul 21, 2020 •

edited

Loading