Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<atomic>: make compare_exchange_weak really weak on ARM64 #775

Open
AlexGuteniev opened this issue May 1, 2020 · 13 comments
Open

<atomic>: make compare_exchange_weak really weak on ARM64 #775

AlexGuteniev opened this issue May 1, 2020 · 13 comments
Labels
ARM64 Related to the ARM64 architecture blocked Something is preventing work on this performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

From #694 (comment):

LL/SC machines are the motivation for compare_exchange_weak existing in the C++ standard. Other compilers like GCC and clang know the difference and emit non-looping code for CAS_weak.

Apparently it cannot be done without compiler support by having more ARM intrinsics exposed.

@StephanTLavavej StephanTLavavej added blocked Something is preventing work on this performance Must go faster labels May 1, 2020
@StephanTLavavej
Copy link
Member

What intrinsics do we need? We should file issues on Developer Community for them.

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented May 2, 2020

I see different approaches:

  1. Add __strexd complementary to __strexd, add smaller size operand versions (__ldrex / __strex, __lrdexh/__strexh, __lrdexb/__strexb), and add the same for ARM64. Together with barrier intrinsics, this will allow to implement all versions of compare_exchange_weak.

This has the advantage, that in addition it can be used as LL/SC for ABA problem avoidance.
@pcordes mentions:

It's somewhat unfortunate that compilers don't expose LL/SC intrinsics that take a lambda for roll-your-own atomic RMW, instead usually making you write a CAS retry loop with a CAS_weak that's implemented as an LL/SC. LL/SC can avoid the ABA problem.

  1. Implement weak family of _InterlockedCompareExchange*. That is to _InterlockedCompareExchange, _InterlockedCompareExchange8_acq, _InterlockedCompareExchange8_rel, _InterlockedCompareExchange8_nf, add weak versions, ditto for other operand sizes.

This has advantage that implementation could potentially switch to real CAS from LL/SC, where it is more efficient.


I see that intrinsics for real CAS are exposed too for ARM64, see __cas* family in ARM64 intrinsics.

Not sure if there should be runtime CPU detection in _InterlockedCompareExchange family to use CAS instead of LL/SC, like for __builtin_popcount on x86, or should it be caller's responsibility instead. This part is already reported as #488

Apparently, you'll need an ARM/ARM64 expert even to create Developer Community issues, and this issue should be considered in conjunction with #488

@pcordes
Copy link

pcordes commented May 2, 2020

I see different approaches:

  1. Add __strexd complementary to __strexd, add smaller size operand versions (__ldrex / __strex, __lrdexh/__strexh, __lrdexb/__strexb), and add the same for ARM64. Together with barrier intrinsics, this will allow to implement all versions of compare_exchange_weak.

IDK if that's viable. The amount of stuff you can do between ldrex and strex and still have the LL/SC commit successfully is very limited, like maybe no other loads or stores (I think?). So debug-builds of code using this couldn't work / would retry forever on strex failing.

That's why I suggested a plausible interface being a function that took a lambda as a parameter to represent the arbitrary ALU work to be done between the LL and the SC in an atomic RMW. (The compiler would have to support this directly as an intrinsic so it can optimize the lambda to only use registers. You couldn't do this with just library code written in plain C++, unless maybe there are optimize pragmas to temporarily force-enable optimization?)

That would let you roll your own atomic_fetch_or just as efficiently as the compiler can do, but for any combination of operations like x |= 2; x &= y;. (Not on top of CAS; on top of LL/SC directly.) e.g. https://godbolt.org/z/NvWAZV is MSVC and GCC code-gen for AArch64 fetch_or with CBNZ to make a retry loop. Both make the same loop around ldxr / stxr (for mo_relaxed). It would be a slight improvement to be able to get arbitrary code into the body of that loop, instead of having to do{ stuff }while(!x.compare_exchange_weak(old, new, mo_relaxed));

But that lambda idea is still limited and doesn't let you compare and branch to not even attempt the SC, like CAS does.

  1. Implement weak family of _InterlockedCompareExchange*. That is to _InterlockedCompareExchange, _InterlockedCompareExchange8_acq, _InterlockedCompareExchange8_rel, _InterlockedCompareExchange8_nf, add weak versions, ditto for other operand sizes.

This is totally separate from what I was suggesting. Yes it would be nice if MSVC for ARM / AArch64 could implement compare_exchange_weak as a single LL/SC without a retry loop, but it's still just implementing CAS.

I'm not sure what kind of interface would be needed to let real algorithms take advantage of LL/SC to avoid ABA problems. Usually that means you want to detect an ABA change between initially reading an old pointer, then updating some still-private data with it, then publishing a new pointer to other threads. If that means you need to do other stores between the LL and the SC, that might not be guaranteed to work.

As I said I'm not an ARM expert, and I haven't really thought about designing algorithms or lock-free data structures around LL/SC because ISO C++ doesn't expose any way to take advantage of it. Having cas_weak succeed on an LL/SC machine only proves that there was no ABA during the actual CAS attempt itself, nothing about whether there was an ABA between when you originally read the "expected" value and when you attempted the CAS.

I'm not going to try to solve that problem in comments here for a compiler I never use, but hopefully that's helpful...

I see that intrinsics for real CAS are exposed too for ARM64, see __cas* family in ARM64 intrinsics.

Those might be ARM v8.1 true atomic CAS instructions which are apparently much better in high-contention situations than LL/SC, e.g. this review of the 64 core Graviton2 compares v8.0 LL/SC against v8.1 CAS for pairs of cores spamming CAS attempts https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd/2

Not sure if there should be runtime CPU detection in _InterlockedCompareExchange family to use CAS instead of LL/SC, like for __builtin_popcount on x86, or should it be caller's responsibility instead. This part is already reported as #488

Maybe, if you really want to make one binary that tries to run not badly everywhere.

You'd really rather do dispatching at a higher level, like 2 versions of a function that uses these operations, not dispatching for every CAS. i.e. multiversion a function that uses these intrinsics, if compiling for ARM. Maybe a compiler option to do that, to make code that can take advantage of ARMv8.1, otherwise just use the LL/SC way.

GCC / clang don't do runtime dispatching for __builtin_popcount. If -mpopcnt isn't enabled at compile time, they just emit a call to a libgcc helper function. (Or clang will auto-vectorize it with SSSE3 or AVX2 pshufb if you popcount in a loop over an array, if either of those are specified as being baseline / assumed available for this build with -m or -march=whatever.)

Compile and single-step int main(int argc, char**argv){return __builtin_popcount(argc);} if you're curious; GCC on my Arch Linux system statically links libgcc so there's not even a chance for dynamic-link-time symbol resolver dispatching like glibc does for memcpy / memset / various other functions with multiple hand-written asm implementations for different CPU feature levels. Being non-inline is pretty bad because the compiler has to assume all call-clobbered registers are clobbered, and so on.

The GCC/clang design model is based around the assumption of static CPU target features, specifying CPU features at compile time. There is some support for automatic function multiversioning with gcc ifunc stuff, I think, but a lot of the design philosophy is opposite to the MSVC approach.

MSVC lets you use intrinsics for instruction-sets you haven't enabled, and it's supposed to be fine as long as execution never reaches that intrinsic on a machine that doesn't support it. IDK if this is why MSVC doesn't optimize intrinsics, e.g. not even doing constant propagation through most intrinsics, like _mm_slli_epi32 (https://godbolt.org/z/aydnaw).

GCC/clang require you to enable -mssse3 for files (or functions) using _mm_shuffle_epi8. This also lets the compiler use any instructions from that extension anywhere in that function. i.e. gcc will never emit an asm instruction for a non-enabled instruction set. (Except of course via inline asm; it doesn't "understand" asm templates, just copies them to the asm output with operands filled in.) There are per-function attributes you can use, but then GCC can't inline an SSE4 function into a function that doesn't have SSE4 enabled. It doesn't know how to keep track of which paths of execution can use what ISA extension while optimizing branches and loops.

Also note that __builtin_popcount isn't really an intrinsic - it's just an operator like ~ or ! that takes an integer input and gives a well-defined result. On x86 with -mpopcnt, the most efficient way to implement it is with a popcnt instruction for a single scalar operand. In the same way that * is multiplication, __builtin_popcount is another computable operation that some but not all computers can do in a single hardware instruction. It's just that the C and C++ languages unfortunately neglected to define any standard name for that operation the way it did for * or unary operators like ~ or -. As far as how a compiler "thinks about" it, it's not different from ~. As usual for GCC, when expanding it into asm instructions, it calls a helper function if the target machine can't do it in a single instruction. Same for wide multiply or divide, or mul / div at all on machines without hardware multiply instructions.

By contrast, Intel's _popcnt_u32 is an intrinsic for that specific instruction. It's only usable when it can compile to a popcnt. (Well, actually GCC still knows how to constant-propagate through it. And clang I think will still auto-vectorize it in a loop over an array.)

@pcordes
Copy link

pcordes commented May 2, 2020

All that said, yes, it seems MSVC is bad at CAS_weak on AArch64: https://godbolt.org/z/7jWLoJ

#include <atomic>

int cas_weak(std::atomic<int> &var, int old, int x) {
    bool ok = var.compare_exchange_weak(old, x);
    return old;
}

x86-64 GCC is ideal:

cas_weak(std::atomic<int>&, int, int):
        mov     eax, esi
        lock cmpxchg    DWORD PTR [rdi], edx
        ret

x86-64 MSVC wastes a CMOV for no reason but is mostly fine:

int cas_weak(std::atomic<int> &,int,int) PROC       ; cas_weak, COMDAT
        mov     eax, edx
        lock cmpxchg DWORD PTR [rcx], r8d
        cmovne  edx, eax
        mov     eax, edx
        ret     0
int cas_weak(std::atomic<int> &,int,int) ENDP       ; cas_weak

AArch64 GCC is non-looping, just one attempt with the branch being the compare part of the CAS itself:

cas_weak(std::atomic<int>&, int, int):
        ldaxr   w3, [x0]              # lda = acquire
        cmp     w3, w1
        bne     .L2
        stlxr   w4, w2, [x0]         # stl = sequential-release
.L2:
        mov     w0, w3
        ret

But MSVC calls a _strong helper function instead of inlining:

|int cas_weak(std::atomic<int> & __ptr64,int,int)| PROC         ; cas_weak
|$LN12|
        str         lr,[sp,#-0x10]!
        sub         sp,sp,#0x10
        str         w1,[sp]
        mov         w4,#5
        mov         x1,sp
        mov         w3,#5
        bl          |int std::_Atomic_compare_exchange_strong_4(unsigned long volatile * __ptr64,unsigned long * __ptr64,unsigned long,std::memory_order,std::memory_order)|
        ldr         w0,[sp]
        add         sp,sp,#0x10
        ldr         lr,[sp],#0x10
        ret

        ENDP  ; |int cas_weak(std::atomic<int> & __ptr64,int,int)|, cas_weak

An obvious first step would be some kind of intrinsic that the compile knows about and can compile into an inline ldaxr / cmp/bne / stlxr.

That's going to be much easier than designing an LL/SC API for C++, and will actually benefit lots of real-world portable ISO C++ code that uses compare_exchange_weak.

And yes that same intrinsic could compile to ARMv8.1 CAS depending on compiler options, if you take the same route as with AVX and have an option to make an ISA extension baseline for the compiler to use on its own, instead of only via special intrinsics.

I don't have a good suggestion for any kind of runtime dispatching; GCC's model where you get max performance by compiling with -O3 -march=native to target the compile host makes a lot of sense, and can work well for server stuff.

@AlexGuteniev
Copy link
Contributor Author

I see. So if exposing LL/SC should be done in entirely novel way, and it is may not work well for a perfect CAS implementation, I'd suggest adding weak IntrelockedCompareExchange intrinsic flavors is the way to implement this improvement.

I don't know if __builtin_popcount in MSVC will do compile time or runtime detection, certainly it will not be just emitting popcnt, as there's already __popcnt intrinsic for that, it will also be constexpr. So I thought CAS feature detection may follow __builtin_popcount approach.

@pcordes
Copy link

pcordes commented May 2, 2020

I see. So if exposing LL/SC should be done in entirely novel way, and it is may not work well for a perfect CAS implementation, I'd suggest adding weak IntrelockedCompareExchange intrinsic flavors is the way to implement this improvement.

Yes, agreed. Having headers implement CAS out of LL/SC intrinsics doesn't sound like a good idea at all.

I don't know if __builtin_popcount in MSVC will do compile time or runtime detection, certainly it will not be just emitting popcnt, as there's already __popcnt intrinsic for that, it will also be constexpr. So I thought CAS feature detection may follow __builtin_popcount approach.

MSVC can't compile __builtin_popcount() - it's a GNU C extension. https://godbolt.org/z/7GD64P
That's what prompted my whole discussion about how x86 compilers that support the GNU dialect of C/C++ (gcc, clang, ICC) handle it. And the fact that you only get static CPU dispatching according to compiler options for it. It's too small to be worth dynamic dispatching as your go-to high performance option, but that's ok for GCC/clang because their design philosophy for supporting ISA extensions is built around using -march= options to enable ISA extensions.

Having a dynamically-linked helper function use popcnt instead of the bithack is somewhat less bad, but still significantly worse than an inlined popcnt. There's really no way around it for something as light weight as popcnt. The situation for atomic RMW is somewhat different, they aren't super fast even in the no contention case, so an indirect function call might not be a total disaster.

@AlexGuteniev
Copy link
Contributor Author

MSVC can't compile __builtin_popcount() - it's a GNU C extension.

It will. And Preview version has it already (but without CPU feature detection yet)

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented May 2, 2020

MSVC could do dynamic dispatching by rewriting code. It is not considered too bad in Windows world. Remember that whatever Linux solves by PIC (position-independent-code), Windows solves by relocating DLLs if needed.

It can also inline popcnt instruction and have a call only for fallback. Will see shortly what it would be.

@AlexGuteniev
Copy link
Contributor Author

bad at CAS_weak on AArch64: https://godbolt.org/z/7jWLoJ

ARM64 case there is not relevant, as it uses old atomics library, now it compiles as follows:

|?cas_weak@@YAHAEAU?$atomic@H@std@@HH@Z| PROC		; cas_weak
; File D:\Temp\cas_weak.cpp
; Line 3
|$LN15@cas_weak|
; File C:\Program Files (x86)\Microsoft Visual Studio\2019\Preview\VC\Tools\MSVC\14.26.28801\include\atomic
; Line 621
	ldaxr       w8,[x0]
	cmp         w8,w1
	bne         |$LN16@cas_weak|
	stlxr       w9,w2,[x0]
	cbnz        w9,|$LN15@cas_weak|
|$LN16@cas_weak|
	dmb         ish
; Line 623
	cmp         w8,w1
	cseleq      w0,w1,w8
	ret

	ENDP  ; |?cas_weak@@YAHAEAU?$atomic@H@std@@HH@Z|, cas_weak

	END

Strong CAS instead of weak, but all inlined.

@AlexGuteniev
Copy link
Contributor Author

Intrinsics requested, DevCom-1015501

@cbezault cbezault added ARM Related to the 32-bit ARM architecture ARM64 Related to the ARM64 architecture labels May 8, 2020
@AlexGuteniev
Copy link
Contributor Author

See #23 discussion, to implement these:

  • P0528R3 Atomic Compare-And-Exchange With Padding Bits
  • P1123R0 Atomic Compare-And-Exchange With Padding Bits For atomic_ref

Especially the last one, I suggest intrinsics:
_InterlockedCompareExchange8Masked, _InterlockedCompareExchange8Masked_acq, etc
and their weak counterparts.

@AlexGuteniev
Copy link
Contributor Author

@pcordes , I have some more thoughts on LL/SC that takes lambda:

1. Usability for P0528R3 / P1123R0

WG21-P1123 is compare_exchange_* for atomic_ref that exclude non-value bits (unused bitfield bits, padding bytes, etc).

For atomic, padding may be zeroed out in constructor, store, and exchange.
For atomic_ref would need atomic zeroing in constructor, this cause extra overhead.

The other way around is to handle it in CAS itself. For platforms with CAS as CAS, it has to retry in case of value bit match. but padding bit mismatch. I'm trying to implement it in #1029

But for LL/SC it can be handled better: just do masked compare after LL.

2. Syntax

MSVC __try{...}__except(...){...} can be an example of more-than-lambda:

  • Limitation to what can be done in braces
  • Ability to return early

So masked CAS could look like

// Weak
auto succeed = __ll(int value, atomic_var) -> bool { 

  if ((value ^ expected) & mask != 0) {
    return false;
  }

  return __sc(desired));
}

// Strong
auto succeed = __ll(int value, atomic_var) -> bool {

  if ((value ^ expected) & mask != 0) {
    return false;
  }

  if (!__sc(desired)) {
    continue; // retry
  }

  return true;
}

@StephanTLavavej StephanTLavavej changed the title <atomic>: make compare_exchange_weak really weak on ARM <atomic>: make compare_exchange_weak really weak on ARM64 Feb 21, 2024
@StephanTLavavej StephanTLavavej removed the ARM Related to the 32-bit ARM architecture label Feb 21, 2024
@StephanTLavavej
Copy link
Member

Updating this issue to no longer mention ARM32; at this time we still need to keep it compiling and working, but we no longer care about optimizing for it. Only ARM64 performance matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARM64 Related to the ARM64 architecture blocked Something is preventing work on this performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants