Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Add MemoryExtensions Reverse API to get parity with array #26381

Merged
merged 3 commits into from
Jan 20, 2018

Conversation

ahsonkhan
Copy link

@ahsonkhan ahsonkhan commented Jan 17, 2018

@ahsonkhan
Copy link
Author

@dotnet-bot test Windows x86 Release Build
@dotnet-bot test Windows x64 Debug Build
@dotnet-bot test UWP CoreCLR x64 Debug Build

Assert.Equal<byte>(expected, actual);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having so many different tests is overkill given the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess that these were copied from the Array.Reverse tests and modified to work with Span. If that is the case, then having test parity with Array.Reverse is worthwhile regardless of the simplicity of the implementation.

Copy link

@ghost ghost Jan 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum, I'd remove the big memory tests - we're having trouble keeping the ones we have running as it is. Given that System.Memory seems be a dumping ground and has 37,000+ tests already, I'd favor careful monitoring of code coverage and fewer but better targeted tests.

(btw, if Array.Reverse is now using the simple implementation, I think the same argument can be made for those tests.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess that these were copied from the Array.Reverse tests

The original Array.Reverse had a pretty complex implementation because of it was half-reflection based, with multiple special fast paths implemented in C++ in the runtime.

This one has a very straightforward implementation, so the testing of many different variants is not warranted.

if Array.Reverse is now using the simple implementation

It does not quite - the non-generic one at least.

Copy link
Author

@ahsonkhan ahsonkhan Jan 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess that these were copied from the Array.Reverse tests and modified to work with Span.

This one has a very straightforward implementation, so the testing of many different variants is not warranted.

They were not taken from the Array.Reverse tests, but I will take a look at them now and try to simplify/reduce the test set. Are there other tests aside from the ones listed below:
https://github.com/dotnet/coreclr/blob/master/tests/src/CoreMangLib/cti/system/array/arrayreverse1.cs
https://github.com/dotnet/coreclr/blob/master/tests/src/CoreMangLib/cti/system/array/arrayreverse2.cs

Copy link
Member

@jkotas jkotas Jan 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahsonkhan
Copy link
Author

ahsonkhan commented Jan 17, 2018

We have some System.Data.SqlClient tests failing on Windows.10.Amd64.Open / Windows.81.Amd64.Open:
System.Data.SqlClient.Tests.SqlConnectionBasicTests/ConnectionTimeoutTestWithThread
https://mc.dot.net/#/user/ahsonkhan/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fcli~2F/d6790697435c78754677c23a57fa4471767aa80c/workItem/System.Data.SqlClient.Tests/analysis/xunit/System.Data.SqlClient.Tests.SqlConnectionBasicTests~2FConnectionTimeoutTestWithThread

Unhandled Exception of Type Xunit.Sdk.TrueException
Message :
Assert.True() Failure
Expected: True
Actual:   False
Stack Trace :
   at System.Data.SqlClient.Tests.SqlConnectionBasicTests.ConnectionTimeoutTestWithThread() in D:\j\workspace\windows-TGrou---f8ac6754\src\System.Data.SqlClient\tests\FunctionalTests\SqlConnectionBasicTests.cs:line 109

Any ideas why?

Here is the filed issue: https://github.com/dotnet/corefx/issues/26382

cc @saurabh500, @corivera

@ianhays
Copy link
Contributor

ianhays commented Jan 17, 2018

@ahsonkhan according to @MattGal here, they recently fixed a CI issue that was hiding some failures. My brotli PR is almost all red for the same reason.

@ahsonkhan
Copy link
Author

From @saurabh500, #26306 (comment)

I will disable the test. @geleems please investigate.

{
T temp = Unsafe.Add(ref p, i);
Unsafe.Add(ref p, i) = Unsafe.Add(ref p, j);
Unsafe.Add(ref p, j) = temp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this end up calculating each of the locations twice, or does the JIT see and remove the duplication?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT does remove one of them, but not both (see the lea instruction). Although, the calculations are done as part of the mov/lea instructions.

00007ff8`72c60940 488b01          mov     rax,qword ptr [rcx] ds:00000008`d577e178=0000018ccee62c88
00007ff8`72c60943 33d2            xor     edx,edx
00007ff8`72c60945 8b4908          mov     ecx,dword ptr [rcx+8]
00007ff8`72c60948 ffc9            dec     ecx
00007ff8`72c6094a 85c9            test    ecx,ecx
00007ff8`72c6094c 7e21            jle     ConsoleApp16!Program.Reverse[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)+0x2f (00007ff8`72c6096f)
00007ff8`72c6094e 4c63c2          movsxd  r8,edx
00007ff8`72c60951 468b0c80        mov     r9d,dword ptr [rax+r8*4]
00007ff8`72c60955 4e8d0480        lea     r8,[rax+r8*4]
00007ff8`72c60959 4c63d1          movsxd  r10,ecx
00007ff8`72c6095c 468b1c90        mov     r11d,dword ptr [rax+r10*4]
00007ff8`72c60960 458918          mov     dword ptr [r8],r11d
00007ff8`72c60963 46890c90        mov     dword ptr [rax+r10*4],r9d
00007ff8`72c60967 ffc2            inc     edx
00007ff8`72c60969 ffc9            dec     ecx
00007ff8`72c6096b 3bd1            cmp     edx,ecx
00007ff8`72c6096d 7cdf            jl      ConsoleApp16!Program.Reverse[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)+0xe (00007ff8`72c6094e)
00007ff8`72c6096f c3              ret

But, we could re-write it as follows to produce optimal disassembly (courtesy of Levi):

// Version A
public static unsafe void Reverse<T>(Span<T> span)
{
    if (span.Length > 1)
    {
        ref T p = ref MemoryMarshal.GetReference(span);
        IntPtr i = IntPtr.Zero;
        IntPtr j = (IntPtr)span.Length - 1;

        do
        {
            var temp1 = Unsafe.Add(ref p, i);
            var temp2 = Unsafe.Add(ref p, j);
            Unsafe.Add(ref p, i) = temp2;
            Unsafe.Add(ref p, j) = temp1;

            i += 1;
            j -= 1;
        } while (i.ToPointer() < j.ToPointer());
    }
}
00007fff`98502740 488b01          mov     rax,qword ptr [rcx] ds:0000003f`fa97dbb8=0000022fca27ae90
00007fff`98502743 8b5108          mov     edx,dword ptr [rcx+8]
00007fff`98502746 83fa01          cmp     edx,1
00007fff`98502749 7e25            jle     EndianReverse!EndianReverse.Program.Reverse[[System.Byte, System.Private.CoreLib]](System.Span`1<Byte>)+0x30 (00007fff`98502770)
00007fff`9850274b 33c9            xor     ecx,ecx
00007fff`9850274d 4863d2          movsxd  rdx,edx
00007fff`98502750 48ffca          dec     rdx
00007fff`98502753 440fb60408      movzx   r8d,byte ptr [rax+rcx]
00007fff`98502758 440fb60c10      movzx   r9d,byte ptr [rax+rdx]
00007fff`9850275d 44880c08        mov     byte ptr [rax+rcx],r9b
00007fff`98502761 44880410        mov     byte ptr [rax+rdx],r8b
00007fff`98502765 48ffc1          inc     rcx
00007fff`98502768 48ffca          dec     rdx
00007fff`9850276b 483bca          cmp     rcx,rdx
00007fff`9850276e 72e3            jb      EndianReverse!EndianReverse.Program.Reverse[[System.Byte, System.Private.CoreLib]](System.Span`1<Byte>)+0x13 (00007fff`98502753)
00007fff`98502770 c3              ret

However, for some reason, the following (sub-optimal) implementation ends up doing slightly better in the micro-benchmarks (especially for Guids):

// Version B
public static unsafe void Reverse<T>(this Span<T> span)
{
    if (span.Length > 1)
    {
        ref T p = ref MemoryMarshal.GetReference(span);
        IntPtr i = IntPtr.Zero;
        IntPtr j = (IntPtr)span.Length - 1;

        do
        {
            T temp = Unsafe.Add(ref p, i);
            Unsafe.Add(ref p, i) = Unsafe.Add(ref p, j);
            Unsafe.Add(ref p, j) = temp;

            i += 1;
            j -= 1;
        } while (i.ToPointer() < j.ToPointer());
    }
}
00007fff`98512740 488b01          mov     rax,qword ptr [rcx] ds:000000d5`db37dd08=00000218b916ae90
00007fff`98512743 8b5108          mov     edx,dword ptr [rcx+8]
00007fff`98512746 83fa01          cmp     edx,1
00007fff`98512749 7e28            jle     EndianReverse!EndianReverse.Program.Reverse[[System.Byte, System.Private.CoreLib]](System.Span`1<Byte>)+0x33 (00007fff`98512773)
00007fff`9851274b 33c9            xor     ecx,ecx
00007fff`9851274d 4863d2          movsxd  rdx,edx
00007fff`98512750 48ffca          dec     rdx
00007fff`98512753 440fb60408      movzx   r8d,byte ptr [rax+rcx]
00007fff`98512758 4c8d0c08        lea     r9,[rax+rcx]
00007fff`9851275c 440fb61410      movzx   r10d,byte ptr [rax+rdx]
00007fff`98512761 458811          mov     byte ptr [r9],r10b
00007fff`98512764 44880410        mov     byte ptr [rax+rdx],r8b
00007fff`98512768 48ffc1          inc     rcx
00007fff`9851276b 48ffca          dec     rdx
00007fff`9851276e 483bca          cmp     rcx,rdx
00007fff`98512771 72e0            jb      EndianReverse!EndianReverse.Program.Reverse[[System.Byte, System.Private.CoreLib]](System.Span`1<Byte>)+0x13 (00007fff`98512753)
00007fff`98512773 c3              ret

image

Should I change it to the version that produces the most optimal codegen? I can't really explain the micro-benchmark results that I am seeing though since they conflict with the disassembly.

cc @GrabYourPitchforks, @AndyAyersMS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably the codegen for Guids looks differently different; not just that it has an extra lea =>r9 over the "A" version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would stick with the original, especially if we've used that idiomatic sequence for swapping array elements elsewhere. Best not to have too many variant patterns for the same thing if we can help it.

If you think the codegen for that version can be improved, please open an issue over in CoreCLR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue and keeping the implementation unchanged:
https://github.com/dotnet/coreclr/issues/15937

span = actual;
span.Slice(2, 0).Reverse();

Assert.Equal<byte>(expected, span.ToArray());
Copy link

@magol magol Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a test of a empty span

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is testing that reversing an empty slice of a span does not reverse anything in the backing array.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Jan 19, 2018

I propose changing the core loop implementation of Array.Reverse (and the proposal in Span.Reverse) as follows.

Original

            ref T p = ref Unsafe.As<byte, T>(ref array.GetRawSzArrayData());
            int i = index;
            int j = index + length - 1;
            while (i < j)
            {
                T temp = Unsafe.Add(ref p, i);
                Unsafe.Add(ref p, i) = Unsafe.Add(ref p, j);
                Unsafe.Add(ref p, j) = temp;
                i++;
                j--;
            }

(I've omitted the bounds checks at the beginning of the routine. This routine is compiled for T = Int32.)

00007ff8`c205542d 4883c110        add     rcx,10h
00007ff8`c2055431 8bc2            mov     eax,edx
00007ff8`c2055433 428d5400ff      lea     edx,[rax+r8-1]
00007ff8`c2055438 3bc2            cmp     eax,edx
00007ff8`c205543a 7d21            jge     00007ff8`c205545d ; break if i >= j
00007ff8`c205543c 4c63c0          movsxd  r8,eax ; start of loop
00007ff8`c205543f 468b0c81        mov     r9d,dword ptr [rcx+r8*4]
00007ff8`c2055443 4e8d0481        lea     r8,[rcx+r8*4]
00007ff8`c2055447 4c63d2          movsxd  r10,edx
00007ff8`c205544a 468b1c91        mov     r11d,dword ptr [rcx+r10*4]
00007ff8`c205544e 458918          mov     dword ptr [r8],r11d
00007ff8`c2055451 46890c91        mov     dword ptr [rcx+r10*4],r9d
00007ff8`c2055455 ffc0            inc     eax
00007ff8`c2055457 ffca            dec     edx
00007ff8`c2055459 3bc2            cmp     eax,edx
00007ff8`c205545b 7cdf            jl      00007ff8`c205543c ; end of loop

The main loop is 33 bytes of codegen consisting of 11 instructions.

New

            if (length >= 1)
            {
                ByReference<T> src = new ByReference<T>(ref Unsafe.Add(ref Unsafe.As<byte, T>(ref array.GetRawSzArrayData()), index));
                ByReference<T> dest = new ByReference<T>(ref Unsafe.Add(ref src.Value, (IntPtr)length - 1));
                do
                {
                    T temp = src.Value;
                    src.Value = dest.Value;
                    dest.Value = temp;
                    src = new ByReference<T>(ref Unsafe.Add(ref src.Value, 1));
                    dest = new ByReference<T>(ref Unsafe.Add(ref dest.Value, -1));
                } while (Unsafe.IsBefore(ref src.Value, ref dest.Value));
            }
00007ff8`c32615ed 4585c0          test    r8d,r8d
00007ff8`c32615f0 7e2c            jle     00007ff8`c326161e ; break if length <= 1
00007ff8`c32615f2 4883c110        add     rcx,10h
00007ff8`c32615f6 4863c2          movsxd  rax,edx
00007ff8`c32615f9 488d0c81        lea     rcx,[rcx+rax*4]
00007ff8`c32615fd 4963c0          movsxd  rax,r8d
00007ff8`c3261600 48ffc8          dec     rax
00007ff8`c3261603 488d0481        lea     rax,[rcx+rax*4]
00007ff8`c3261607 8b11            mov     edx,dword ptr [rcx] ; start of loop
00007ff8`c3261609 448b00          mov     r8d,dword ptr [rax]
00007ff8`c326160c 448901          mov     dword ptr [rcx],r8d
00007ff8`c326160f 8910            mov     dword ptr [rax],edx
00007ff8`c3261611 4883c104        add     rcx,4
00007ff8`c3261615 4883c0fc        add     rax,0FFFFFFFFFFFFFFFCh
00007ff8`c3261619 483bc8          cmp     rcx,rax
00007ff8`c326161c 72e9            jb      00007ff8`c3261607 ; end of loop

The main loop is 23 bytes of codegen consisting of 8 instructions. The "indirect reference computed from a base and an offset with a known stride length" behavior is replaced with a simple reference. This is accomplished by incrementing / decrementing the ref T local itself (not the data it references) by the stride length on each loop iteration.

IsBefore returns true iff the target of the source reference occurs before the target of the destination reference in the virtual address space of the process.

        [Intrinsic]
        [NonVersionable]
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static bool IsBefore<T>(ref T left, ref T right)
        {
            throw new PlatformNotSupportedException();

            // ldarg.0
            // ldarg.1
            // clt.un
            // ret
        }

This change results in an approx. 25 - 30% speedup on my machine where T's size is <= the native word length (including references). There are diminishing returns if T is a larger struct like DateTimeOffset or Guid; on my machine that showed only a 5 - 10% gain.

@jkotas
Copy link
Member

jkotas commented Jan 19, 2018

ByReference<T> is a special intrinsic that has been tested only as Span field. I would like to get sign-off from the CodeGen that it is going to work correctly in other places like this. It would be better to wait for https://github.com/dotnet/csharplang/blob/master/proposals/ref-local-reassignment.md with optimizations like these.

@jkotas
Copy link
Member

jkotas commented Jan 19, 2018

You may want create an API proposal issue to add Unsafe.IsBefore API (and maybe IsAbove as well) to prep for optimizations like these.

@GrabYourPitchforks
Copy link
Member

Perhaps the results above can be used as an additional argument to prioritize development of the ref reassignment feature given the large perf gains we can milk from it?

@jkotas
Copy link
Member

jkotas commented Jan 19, 2018

@agocke @jaredpar When can we expect "ref reassigment" feature to land? We can use it to get a pretty significant perf wins as demonstrated by @GrabYourPitchforks

@jkotas
Copy link
Member

jkotas commented Jan 19, 2018

API proposal for Unsafe.IsBelow/IsAbove: https://github.com/dotnet/corefx/issues/26451

@jaredpar
Copy link
Member

@jkotas

The plan now is to have this included in C# 7.3.

@ahsonkhan
Copy link
Author

The plan now is to have this included in C# 7.3.

I would stick with the original, especially if we've used that idiomatic sequence for swapping array elements elsewhere.

In that case, I will keep the original implementation, as is.

@ahsonkhan ahsonkhan merged commit ac70074 into dotnet:master Jan 20, 2018
@ahsonkhan ahsonkhan deleted the AddReverse branch January 20, 2018 02:45
@karelz karelz added this to the 2.1.0 milestone Jan 20, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…efx#26381)

* Add MemoryExtensions Reverse API to get parity with array

* Fix tests and add large byte array test

* Remove some redundant tests and the big memory test.


Commit migrated from dotnet/corefx@ac70074
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants