Optimize `AtomicU128::{load, store}` #10

ibraheemdev · 2022-04-26T13:47:51Z

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

MOVAPD, MOVAPS, and MOVDQA.

VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.

VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

AtomicU128::{load, store} can take advantage of this instead of using the more expensive cmpxchg instruction. See this GCC issue/patch for details: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688.

The text was updated successfully, but these errors were encountered:

ibraheemdev · 2022-04-30T06:40:26Z

As of ARM v8.4 the LDP/STP instructions are guaranteed to be single-copy atomic for 16 byte accesses:

Changes to single-copy atomicity in Armv8.4 In addition to the single-copy atomicity requirements listed above:

Instructions that are introduced in FEAT_LRCPC are single-copy atomic when all of the following conditions are true:

All bytes being accessed are within the same 16-byte quantity aligned to 16 bytes.

Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, all loads and stores are single-copy atomic when all of the following conditions are true:

Accesses are unaligned to their data size but all bytes being accessed are within a 16-byte quantity that is aligned to 16 bytes.

Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy atomic when all of the following conditions are true:

The overall memory access is aligned to 16 bytes.

Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

See also the relevant LLVM patch.

taiki-e · 2022-06-05T04:56:43Z

Btw, recently I learned that powerpc64 (pwr8+) supports 128-bit atomics (llvm patch. although qemu doesn't seem to support some of them) and added support for it to another library. I plan to do the same with this library.

taiki-e · 2022-06-18T05:45:08Z

UPDATE: This table is outdated. See the atomic128 module's readme for the latest version.

Once #16 merged, the list of targets that support 128-bit atomics and the instructions used is as follows.

target_arch	load	store	CAS	note
x86_64	cmpxchg16b or vmovdqa	cmpxchg16b or vmovdqa	cmpxchg16b	cmpxchg16b target feature required. vmovdqa requires Intel or AMD CPU with AVX. Both compile-time and run-time detection are supported for cmpxchg16b. vmovdqa is currently run-time detection only. Requires rustc 1.59+ when cmpxchg16b target feature is enabled at compile-time, otherwise requires nightly
aarch64	ldxp/stxp or ldp	ldxp/stxp or stp	ldxp/stxp or casp	casp requires lse target feature, ldp/stp requires lse2 target feature. Both compile-time and run-time detection are supported for lse. lse2 is currently compile-time detection only. Requires rustc 1.59+
powerpc64	lq	stq	lqarx/stqcx.	Little endian or target CPU pwr8+. Requires nightly
s390x	lpq	stpq	cdsg	Requires nightly

Note: ~~Run-time detections require outline-atomics optional feature of this crate~~ EDIT: since 0.3.19, run-time detections are enabled by default.

16: Use SSE for 128-bit atomic load/store on Intel CPU with AVX r=taiki-e a=taiki-e x86_64 part of #10 The following are the results of a simple microbenchmark: ``` bench_portable_atomic_arch/u128_load time: [1.4598 ns 1.4671 ns 1.4753 ns] change: [-81.510% -81.210% -80.950%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe bench_portable_atomic_arch/u128_store time: [1.3852 ns 1.3937 ns 1.4024 ns] change: [-82.318% -81.989% -81.621%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe bench_portable_atomic_arch/u128_concurrent_load time: [56.422 us 56.767 us 57.204 us] change: [-70.807% -70.143% -69.443%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe bench_portable_atomic_arch/u128_concurrent_load_store time: [136.53 us 139.96 us 145.39 us] change: [-82.570% -81.879% -80.820%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 4 (4.00%) high mild 11 (11.00%) high severe bench_portable_atomic_arch/u128_concurrent_store time: [146.03 us 147.67 us 149.98 us] change: [-90.486% -90.124% -89.483%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) high mild 8 (8.00%) high severe bench_portable_atomic_arch/u128_concurrent_store_swap time: [765.11 us 766.69 us 768.29 us] change: [-51.204% -50.967% -50.745%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe ``` Closes #10 Co-authored-by: Taiki Endo <[email protected]>

taiki-e · 2022-07-27T16:58:51Z

Other optimizations:

(Apart from these, there have also been some minor optimizations regarding inline assembly since #16 was merged.)

taiki-e · 2022-12-11T07:44:35Z

AMD is also going to guarantee atomicity of 128-bit SSE: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10

We would update the AMD APM manuals in the next revision.

For all AMD architectures,

Processors that support AVX extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

which means all 128b instructions, even the *MOVDQU instructions, are atomic if they end up being naturally aligned.

UPDATE: filed #49

49: Use SSE for 128-bit atomic load/store on AMD CPU with AVX r=taiki-e a=taiki-e As mentioned in #10 (comment), AMD is also going to guarantee this. Refs: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10 Co-authored-by: Taiki Endo <[email protected]>

57: Enable outline-atomics by default and provide cfg to disable it r=taiki-e a=taiki-e This enables `outline-atomics` feature by default and provides `portable_atomic_no_outline_atomics` cfg to disable it. (outline-atomics enables several optimizations on x86_64 and aarch64. See [this list](#10 (comment)) for details.) It has previously been pointed out that due to the nature of the cargo feature, controlling this based on the cargo feature does not work well. Since this release, `outline-atomics` feature has been no-op, and outline-atomics is enabled by default. Note: outline-atomics in portable-atomics is currently for 128-bit atomics. outline-atomics for atomics with other sizes is controlled by LLVM's `outline-atomics` target feature. Closes #25 Co-authored-by: Taiki Endo <[email protected]>

From #10 (comment).

taiki-e added C-enhancement Category: A new feature or an improvement for an existing one O-x86 Target: x86/x64 processors labels Apr 26, 2022

ibraheemdev changed the title ~~AtomicU128 can use SSE for loads on supported platforms~~ Optimize AtomicU128::{load, store} Apr 30, 2022

taiki-e added the O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Apr 30, 2022

taiki-e mentioned this issue Apr 30, 2022

aarch64: Use LDP/STP if FEAT_LSE2 is available #11

Merged

taiki-e mentioned this issue Jun 18, 2022

Use SSE for 128-bit atomic load/store on Intel CPU with AVX #16

Merged

bors bot closed this as completed in 10b561a Jun 19, 2022

taiki-e mentioned this issue Aug 7, 2022

Atomics must be mutable rust-lang/miri#2464

Merged

taiki-e mentioned this issue Dec 11, 2022

Use SSE for 128-bit atomic load/store on AMD CPU with AVX #49

Merged

taiki-e mentioned this issue Dec 25, 2022

Enable outline-atomics by default and provide cfg to disable it #57

Merged

taiki-e mentioned this issue Dec 25, 2022

x86_64: Add portable_atomic_vmovdqa_atomic cfg #59

Draft

taiki-e added a commit that referenced this issue Jan 12, 2023

Add 128-bit atomic instructions table from issue comment

c7b309d

From #10 (comment).

taiki-e added a commit that referenced this issue Jan 12, 2023

Add 128-bit atomic instructions table from issue comment

c3c9748

From #10 (comment).

taiki-e added a commit that referenced this issue Jan 12, 2023

Add 128-bit atomic instructions table from issue comment

fda1157

From #10 (comment).

taiki-e mentioned this issue Jan 30, 2023

aarch64: Support FEAT_LSE128 and FEAT_LRCPC3 #68

Merged

taiki-e mentioned this issue Oct 16, 2023

aarch64: Support run-time detection of FEAT_LSE2 #126

Merged

taiki-e added O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode and removed O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state labels Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `AtomicU128::{load, store}` #10

Optimize `AtomicU128::{load, store}` #10

ibraheemdev commented Apr 26, 2022 •

edited by taiki-e

Loading

ibraheemdev commented Apr 30, 2022 •

edited

Loading

taiki-e commented Jun 5, 2022

taiki-e commented Jun 18, 2022 •

edited

Loading

taiki-e commented Jul 27, 2022

taiki-e commented Dec 11, 2022 •

edited

Loading

Optimize AtomicU128::{load, store} #10

Optimize AtomicU128::{load, store} #10

Comments

ibraheemdev commented Apr 26, 2022 • edited by taiki-e Loading

ibraheemdev commented Apr 30, 2022 • edited Loading

taiki-e commented Jun 5, 2022

taiki-e commented Jun 18, 2022 • edited Loading

taiki-e commented Jul 27, 2022

taiki-e commented Dec 11, 2022 • edited Loading

Optimize `AtomicU128::{load, store}` #10

Optimize `AtomicU128::{load, store}` #10

ibraheemdev commented Apr 26, 2022 •

edited by taiki-e

Loading

ibraheemdev commented Apr 30, 2022 •

edited

Loading

taiki-e commented Jun 18, 2022 •

edited

Loading

taiki-e commented Dec 11, 2022 •

edited

Loading