Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SSE for 128-bit atomic load/store on AMD CPU with AVX #49

Merged
merged 1 commit into from
Dec 14, 2022
Merged

Conversation

taiki-e
Copy link
Owner

@taiki-e taiki-e commented Dec 11, 2022

As mentioned in #10 (comment), AMD is also going to guarantee this.

Refs: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10

@taiki-e taiki-e added the O-x86 Target: x86/x64 processors label Dec 11, 2022
@taiki-e
Copy link
Owner Author

taiki-e commented Dec 11, 2022

hmm...

https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c25

We do at least de-facto support atomics on UC memory because the ordering guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned load/store is guaranteed even for non-cacheable memory types since P5 Pentium (and on AMD). (And lock cmpxchg16b is always atomic even on UC memory.)

But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores would be atomic on UC memory. So this change could break that very unwise corner-case on AMD which only guarantees that for cacheable loads/stores, and Zhaoxin only for WB.

But was anyone previously using 16-byte atomics on UC device memory? Do we actually care about supporting that? I'd guess no and no, so it's just a matter of documenting that somewhere.

Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I hope people weren't using __atomic_store_n on device memory. The underlying implementation was never guaranteed.

@taiki-e
Copy link
Owner Author

taiki-e commented Dec 14, 2022

bors r+

Given that GCC has already merged a similar patch (gcc-mirror/gcc@4a7a846), the subsequent discussion in that thread, the, and the scope of atomic operations in ARM and NVPTX, this should be able to be merged as is.

https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c27

e.g. x86 memory-type stuff, and that ARM assumes all cores are in the same inner-shareable cache-coherency domain, thus barriers are dmb ish not dmb sy and so on.

@bors
Copy link
Contributor

bors bot commented Dec 14, 2022

@bors bors bot merged commit b733d7c into main Dec 14, 2022
@bors bors bot deleted the vmovdqa3 branch December 14, 2022 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-x86 Target: x86/x64 processors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant