Use SSE for 128-bit atomic load/store on AMD CPU with AVX #49

taiki-e · 2022-12-11T08:04:01Z

As mentioned in #10 (comment), AMD is also going to guarantee this.

Refs: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10

taiki-e · 2022-12-11T08:16:50Z

hmm...

https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c25

We do at least de-facto support atomics on UC memory because the ordering guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned load/store is guaranteed even for non-cacheable memory types since P5 Pentium (and on AMD). (And lock cmpxchg16b is always atomic even on UC memory.)

But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores would be atomic on UC memory. So this change could break that very unwise corner-case on AMD which only guarantees that for cacheable loads/stores, and Zhaoxin only for WB.

But was anyone previously using 16-byte atomics on UC device memory? Do we actually care about supporting that? I'd guess no and no, so it's just a matter of documenting that somewhere.

Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I hope people weren't using __atomic_store_n on device memory. The underlying implementation was never guaranteed.

Refs: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10

taiki-e · 2022-12-14T15:59:08Z

bors r+

Given that GCC has already merged a similar patch (gcc-mirror/gcc@4a7a846), the subsequent discussion in that thread, the, and the scope of atomic operations in ARM and NVPTX, this should be able to be merged as is.

https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c27

e.g. x86 memory-type stuff, and that ARM assumes all cores are in the same inner-shareable cache-coherency domain, thus barriers are dmb ish not dmb sy and so on.

bors · 2022-12-14T17:16:45Z

Build succeeded:

taiki-e mentioned this pull request Dec 11, 2022

Optimize AtomicU128::{load, store} #10

Closed

taiki-e added the O-x86 Target: x86/x64 processors label Dec 11, 2022

taiki-e force-pushed the vmovdqa3 branch from f8c3c5e to 9d55f0a Compare December 11, 2022 08:07

Use SSE for 128-bit atomic load/store on AMD CPU with AVX

d244152

Refs: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10

taiki-e force-pushed the vmovdqa3 branch from 9d55f0a to d244152 Compare December 11, 2022 11:48

bors bot merged commit b733d7c into main Dec 14, 2022

bors bot deleted the vmovdqa3 branch December 14, 2022 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SSE for 128-bit atomic load/store on AMD CPU with AVX #49

Use SSE for 128-bit atomic load/store on AMD CPU with AVX #49

taiki-e commented Dec 11, 2022 •

edited

Loading

taiki-e commented Dec 11, 2022

taiki-e commented Dec 14, 2022 •

edited

Loading

bors bot commented Dec 14, 2022

Use SSE for 128-bit atomic load/store on AMD CPU with AVX #49

Use SSE for 128-bit atomic load/store on AMD CPU with AVX #49

Conversation

taiki-e commented Dec 11, 2022 • edited Loading

taiki-e commented Dec 11, 2022

taiki-e commented Dec 14, 2022 • edited Loading

bors bot commented Dec 14, 2022

taiki-e commented Dec 11, 2022 •

edited

Loading

taiki-e commented Dec 14, 2022 •

edited

Loading