Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#651, for 64 bit type on x86 __iso_volatile_store64 #694

Merged
merged 50 commits into from
Jun 30, 2020

Conversation

AlexGuteniev
Copy link
Contributor

#651, for 64 bit type on x86 __iso_volatile_store64

without complier barrier for memory_order_relaxed,
with complier barrier for memory_order_release,
other orders unchanged.

without complier barrier for memory_order_relaxed,
with complier barrier for memory_order_release,
other orders unchanged.
@BillyONeal BillyONeal self-requested a review April 10, 2020 03:34
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 10, 2020

Here's reduced ICE: DevCom-986061

@AlexGuteniev
Copy link
Contributor Author

Related: I don't get the use of __ldrexd versus __iso_volatile_load64 for ARM.

I mean. if __ldrexd is better, why wouldn't compiler do for __iso_volatile_load64 whatever it does for __ldrexd.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Apr 12, 2020
@cbezault
Copy link
Contributor

Totally possible __iso_volatile_load64 wasn't available on ARM (or was broken) when this was written.

@BillyONeal
Copy link
Member

Totally possible __iso_volatile_load64 wasn't available on ARM (or was broken) when this was written.

ARM and ARM64 have always had the full complement of __iso_volatiles. Only x86 has ever had them missing.

@AlexGuteniev
Copy link
Contributor Author

There is a difference in codegen, __ldrexd produces ldrexd, and __iso_volatile_load64 produces ldrd.
I cannot explain it, I don't understand ARM. Before it is explained, I think __ldrexd should stay.

This program:

#include <intrin.h>

volatile long long var = 0;

long long f1()
{
    return __iso_volatile_load64(&var);
}


long long f2()
{
    return  __ldrexd(&var);
}

int main()
{
    f1();
    f2();
}

Compiled with VS 2019 Preview:

D:\Temp>"C:\Program Files (x86)\Microsoft Visual Studio\2019\Preview\VC\Auxiliary\Build\vcvarsamd64_arm.bat"
**********************************************************************
** Visual Studio 2019 Developer Command Prompt v16.6.0-pre.2.1
** Copyright (c) 2020 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64_arm'

D:\Temp>cl /O2 /FA repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28720.3 for ARM
Copyright (C) Microsoft Corporation.  All rights reserved.

repro.cpp
Microsoft (R) Incremental Linker Version 14.26.28720.3
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:repro.exe
/machine:arm
repro.obj

Produces following asm:

; Listing generated by Microsoft (R) Optimizing Compiler Version 19.26.28720.3 

	TTL	D:\Temp\repro.cpp
	THUMB
	.drectve
	DCB	"-defaultlib:LIBCMT "
	DCB	"-defaultlib:OLDNAMES "

	EXPORT	|?var@@3_JC| [ DATA ]			; var
	.bss
|?var@@3_JC| %	0x8					; var
	EXPORT	|?f1@@YA_JXZ|				; f1
	EXPORT	|?f2@@YA_JXZ|				; f2
	EXPORT	|main|
; Function compile flags: /Ogtpy
;	COMDAT main
.text$mn	SEGMENT

|main|	PROC
; File D:\Temp\repro.cpp
; Line 18
	movs        r0,#0
|$M4|
	bx          lr

	ENDP  ; |main|

; Function compile flags: /Ogtpy
;	COMDAT ?f2@@YA_JXZ
.text$mn	SEGMENT

|?f2@@YA_JXZ| PROC					; f2
; File D:\Temp\repro.cpp
; Line 13
	movw        r3,|?var@@3_JC|
	movt        r3,|?var@@3_JC|
	ldrexd      r0,r1,[r3]
|$M4|
; Line 14
	bx          lr

	ENDP  ; |?f2@@YA_JXZ|, f2

; Function compile flags: /Ogtpy
;	COMDAT ?f1@@YA_JXZ
.text$mn	SEGMENT

|?f1@@YA_JXZ| PROC					; f1
; File D:\Temp\repro.cpp
; Line 7
	movw        r3,|?var@@3_JC|
	movt        r3,|?var@@3_JC|
	ldrd        r0,r1,[r3]
|$M4|
; Line 8
	bx          lr

	ENDP  ; |?f1@@YA_JXZ|, f1

	END

@AlexGuteniev
Copy link
Contributor Author

Regarding everything except __ldrexd, what would be the next step? Wait for DevCom-986061 resolution?

@StephanTLavavej
Copy link
Member

We talked about this in our weekly meeting, and we believe that waiting for the ICE to be fixed is preferable over using inline assembly (which has been problematic in the past). I'll mark this PR as blocked.

@StephanTLavavej StephanTLavavej added the blocked Something is preventing work on this label Apr 15, 2020
Copy link
Contributor

@cbezault cbezault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
My suggestions are weak suggestions, not strong ones, feel free to resolve if you don't think it's worth the effort.

stl/inc/atomic Outdated

#if defined(_M_ARM) || defined(_M_ARM64)
_Memory_barrier();
#else // ^^^ ARM32/ARM64 hardware / x86/x64 hardware vvv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the implementation for x86/x64 likely safe on any other hypothetical architecture? Assumptions about architecture is what has gotten us into this mess of mediocre ARM support we're in now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It is both unsafe, and possibly inefficient where it is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've addressed this place.

But there are some places like here:

STL/stl/inc/atomic

Lines 489 to 495 in 4deaf6c

#if defined(_M_ARM) || defined(_M_ARM64)
_Memory_barrier();
__iso_volatile_store16(_Mem, _As_bytes);
_Memory_barrier();
#else // ^^^ ARM32/ARM64 hardware / x86/x64 hardware vvv
(void) _InterlockedExchange16(_Mem, _As_bytes);
#endif // hardware

I think any of the implementations looks working (apart from _Memory_barrier() undefined on x86/x64, but it can be defined). But it still does not make sense to use the wrong implementation for each platform for performance reasons. Should compilation be broken for unknown platform here as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more okay with it in this case. _InterlockedExchange16 will likely be defined on future architectures. Also you're not directly touching this so no reason to hold up the PR on that.

stl/inc/atomic Outdated

#if defined(_M_ARM) || defined(_M_ARM64)
_Memory_barrier();
#else // ^^^ ARM32/ARM64 hardware / x86/x64 hardware vvv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more okay with it in this case. _InterlockedExchange16 will likely be defined on future architectures. Also you're not directly touching this so no reason to hold up the PR on that.

@BillyONeal
Copy link
Member

Copy link
Member

@StephanTLavavej StephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@BillyONeal has offered to commit/test these minor changes to the GitHub and MSVC PRs; thanks Billy!)

@BillyONeal
Copy link
Member

@AlexGuteniev Thank you for your contribution!

@BillyONeal BillyONeal merged commit bc077c6 into microsoft:master Jun 30, 2020
@AlexGuteniev AlexGuteniev deleted the atomic_x86_store branch June 30, 2020 05:21
StephanTLavavej added a commit to AlexGuteniev/STL that referenced this pull request Jul 2, 2020
StephanTLavavej added a commit that referenced this pull request Jul 3, 2020
This reverts commit ad1a26a which was later modified by #694.

Fixes #971.

Co-authored-by: Stephan T. Lavavej <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<atomic>: use __iso_volatile_store64 on x86 if it is available
6 participants