-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<atomic>: make compare_exchange_weak really weak on ARM64 #775
Comments
What intrinsics do we need? We should file issues on Developer Community for them. |
I see different approaches:
This has the advantage, that in addition it can be used as LL/SC for ABA problem avoidance.
This has advantage that implementation could potentially switch to real CAS from LL/SC, where it is more efficient. I see that intrinsics for real CAS are exposed too for ARM64, see Not sure if there should be runtime CPU detection in Apparently, you'll need an ARM/ARM64 expert even to create Developer Community issues, and this issue should be considered in conjunction with #488 |
IDK if that's viable. The amount of stuff you can do between ldrex and strex and still have the LL/SC commit successfully is very limited, like maybe no other loads or stores (I think?). So debug-builds of code using this couldn't work / would retry forever on strex failing. That's why I suggested a plausible interface being a function that took a lambda as a parameter to represent the arbitrary ALU work to be done between the LL and the SC in an atomic RMW. (The compiler would have to support this directly as an intrinsic so it can optimize the lambda to only use registers. You couldn't do this with just library code written in plain C++, unless maybe there are optimize pragmas to temporarily force-enable optimization?) That would let you roll your own But that lambda idea is still limited and doesn't let you compare and branch to not even attempt the SC, like CAS does.
This is totally separate from what I was suggesting. Yes it would be nice if MSVC for ARM / AArch64 could implement compare_exchange_weak as a single LL/SC without a retry loop, but it's still just implementing CAS. I'm not sure what kind of interface would be needed to let real algorithms take advantage of LL/SC to avoid ABA problems. Usually that means you want to detect an ABA change between initially reading an old pointer, then updating some still-private data with it, then publishing a new pointer to other threads. If that means you need to do other stores between the LL and the SC, that might not be guaranteed to work. As I said I'm not an ARM expert, and I haven't really thought about designing algorithms or lock-free data structures around LL/SC because ISO C++ doesn't expose any way to take advantage of it. Having cas_weak succeed on an LL/SC machine only proves that there was no ABA during the actual CAS attempt itself, nothing about whether there was an ABA between when you originally read the "expected" value and when you attempted the CAS. I'm not going to try to solve that problem in comments here for a compiler I never use, but hopefully that's helpful...
Those might be ARM v8.1 true atomic CAS instructions which are apparently much better in high-contention situations than LL/SC, e.g. this review of the 64 core Graviton2 compares v8.0 LL/SC against v8.1 CAS for pairs of cores spamming CAS attempts https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd/2
Maybe, if you really want to make one binary that tries to run not badly everywhere. You'd really rather do dispatching at a higher level, like 2 versions of a function that uses these operations, not dispatching for every CAS. i.e. multiversion a function that uses these intrinsics, if compiling for ARM. Maybe a compiler option to do that, to make code that can take advantage of ARMv8.1, otherwise just use the LL/SC way. GCC / clang don't do runtime dispatching for Compile and single-step The GCC/clang design model is based around the assumption of static CPU target features, specifying CPU features at compile time. There is some support for automatic function multiversioning with gcc MSVC lets you use intrinsics for instruction-sets you haven't enabled, and it's supposed to be fine as long as execution never reaches that intrinsic on a machine that doesn't support it. IDK if this is why MSVC doesn't optimize intrinsics, e.g. not even doing constant propagation through most intrinsics, like GCC/clang require you to enable Also note that By contrast, Intel's |
All that said, yes, it seems MSVC is bad at CAS_weak on AArch64: https://godbolt.org/z/7jWLoJ
x86-64 GCC is ideal:
x86-64 MSVC wastes a CMOV for no reason but is mostly fine:
AArch64 GCC is non-looping, just one attempt with the branch being the compare part of the CAS itself:
But MSVC calls a
An obvious first step would be some kind of intrinsic that the compile knows about and can compile into an inline ldaxr / cmp/bne / stlxr. That's going to be much easier than designing an LL/SC API for C++, and will actually benefit lots of real-world portable ISO C++ code that uses And yes that same intrinsic could compile to ARMv8.1 CAS depending on compiler options, if you take the same route as with AVX and have an option to make an ISA extension baseline for the compiler to use on its own, instead of only via special intrinsics. I don't have a good suggestion for any kind of runtime dispatching; GCC's model where you get max performance by compiling with |
I see. So if exposing LL/SC should be done in entirely novel way, and it is may not work well for a perfect CAS implementation, I'd suggest adding weak I don't know if |
Yes, agreed. Having headers implement CAS out of LL/SC intrinsics doesn't sound like a good idea at all.
MSVC can't compile Having a dynamically-linked helper function use |
It will. And Preview version has it already (but without CPU feature detection yet) |
MSVC could do dynamic dispatching by rewriting code. It is not considered too bad in Windows world. Remember that whatever Linux solves by PIC (position-independent-code), Windows solves by relocating DLLs if needed. It can also inline popcnt instruction and have a call only for fallback. Will see shortly what it would be. |
ARM64 case there is not relevant, as it uses old atomics library, now it compiles as follows:
Strong CAS instead of weak, but all inlined. |
Intrinsics requested, DevCom-1015501 |
See #23 discussion, to implement these:
Especially the last one, I suggest intrinsics: |
@pcordes , I have some more thoughts on LL/SC that takes lambda: 1. Usability for P0528R3 / P1123R0WG21-P1123 is For The other way around is to handle it in CAS itself. For platforms with CAS as CAS, it has to retry in case of value bit match. but padding bit mismatch. I'm trying to implement it in #1029 But for LL/SC it can be handled better: just do masked compare after LL. 2. SyntaxMSVC
So masked CAS could look like // Weak
auto succeed = __ll(int value, atomic_var) -> bool {
if ((value ^ expected) & mask != 0) {
return false;
}
return __sc(desired));
}
// Strong
auto succeed = __ll(int value, atomic_var) -> bool {
if ((value ^ expected) & mask != 0) {
return false;
}
if (!__sc(desired)) {
continue; // retry
}
return true;
} |
Updating this issue to no longer mention ARM32; at this time we still need to keep it compiling and working, but we no longer care about optimizing for it. Only ARM64 performance matters. |
From #694 (comment):
Apparently it cannot be done without compiler support by having more ARM intrinsics exposed.
The text was updated successfully, but these errors were encountered: