-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize AtomicU128::{load, store}
#10
Comments
AtomicU128
can use SSE for loads on supported platformsAtomicU128::{load, store}
As of ARM v8.4 the LDP/STP instructions are guaranteed to be single-copy atomic for 16 byte accesses:
See also the relevant LLVM patch. |
Btw, recently I learned that powerpc64 (pwr8+) supports 128-bit atomics (llvm patch. although qemu doesn't seem to support some of them) and added support for it to another library. I plan to do the same with this library. |
UPDATE: This table is outdated. See the Once #16 merged, the list of targets that support 128-bit atomics and the instructions used is as follows.
Note: |
16: Use SSE for 128-bit atomic load/store on Intel CPU with AVX r=taiki-e a=taiki-e x86_64 part of #10 The following are the results of a simple microbenchmark: ``` bench_portable_atomic_arch/u128_load time: [1.4598 ns 1.4671 ns 1.4753 ns] change: [-81.510% -81.210% -80.950%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe bench_portable_atomic_arch/u128_store time: [1.3852 ns 1.3937 ns 1.4024 ns] change: [-82.318% -81.989% -81.621%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe bench_portable_atomic_arch/u128_concurrent_load time: [56.422 us 56.767 us 57.204 us] change: [-70.807% -70.143% -69.443%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe bench_portable_atomic_arch/u128_concurrent_load_store time: [136.53 us 139.96 us 145.39 us] change: [-82.570% -81.879% -80.820%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 4 (4.00%) high mild 11 (11.00%) high severe bench_portable_atomic_arch/u128_concurrent_store time: [146.03 us 147.67 us 149.98 us] change: [-90.486% -90.124% -89.483%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) high mild 8 (8.00%) high severe bench_portable_atomic_arch/u128_concurrent_store_swap time: [765.11 us 766.69 us 768.29 us] change: [-51.204% -50.967% -50.745%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe ``` Closes #10 Co-authored-by: Taiki Endo <[email protected]>
Other optimizations:
(Apart from these, there have also been some minor optimizations regarding inline assembly since #16 was merged.) |
AMD is also going to guarantee atomicity of 128-bit SSE: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10
UPDATE: filed #49 |
49: Use SSE for 128-bit atomic load/store on AMD CPU with AVX r=taiki-e a=taiki-e As mentioned in #10 (comment), AMD is also going to guarantee this. Refs: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=104688#c10 Co-authored-by: Taiki Endo <[email protected]>
57: Enable outline-atomics by default and provide cfg to disable it r=taiki-e a=taiki-e This enables `outline-atomics` feature by default and provides `portable_atomic_no_outline_atomics` cfg to disable it. (outline-atomics enables several optimizations on x86_64 and aarch64. See [this list](#10 (comment)) for details.) It has previously been pointed out that due to the nature of the cargo feature, controlling this based on the cargo feature does not work well. Since this release, `outline-atomics` feature has been no-op, and outline-atomics is enabled by default. Note: outline-atomics in portable-atomics is currently for 128-bit atomics. outline-atomics for atomics with other sizes is controlled by LLVM's `outline-atomics` target feature. Closes #25 Co-authored-by: Taiki Endo <[email protected]>
57: Enable outline-atomics by default and provide cfg to disable it r=taiki-e a=taiki-e This enables `outline-atomics` feature by default and provides `portable_atomic_no_outline_atomics` cfg to disable it. (outline-atomics enables several optimizations on x86_64 and aarch64. See [this list](#10 (comment)) for details.) It has previously been pointed out that due to the nature of the cargo feature, controlling this based on the cargo feature does not work well. Since this release, `outline-atomics` feature has been no-op, and outline-atomics is enabled by default. Note: outline-atomics in portable-atomics is currently for 128-bit atomics. outline-atomics for atomics with other sizes is controlled by LLVM's `outline-atomics` target feature. Closes #25 Co-authored-by: Taiki Endo <[email protected]>
57: Enable outline-atomics by default and provide cfg to disable it r=taiki-e a=taiki-e This enables `outline-atomics` feature by default and provides `portable_atomic_no_outline_atomics` cfg to disable it. (outline-atomics enables several optimizations on x86_64 and aarch64. See [this list](#10 (comment)) for details.) It has previously been pointed out that due to the nature of the cargo feature, controlling this based on the cargo feature does not work well. Since this release, `outline-atomics` feature has been no-op, and outline-atomics is enabled by default. Note: outline-atomics in portable-atomics is currently for 128-bit atomics. outline-atomics for atomics with other sizes is controlled by LLVM's `outline-atomics` target feature. Closes #25 Co-authored-by: Taiki Endo <[email protected]>
AtomicU128::{load, store}
can take advantage of this instead of using the more expensive cmpxchg instruction. See this GCC issue/patch for details: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688.The text was updated successfully, but these errors were encountered: