Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a_store operation in atomic.h #403

Merged
merged 2 commits into from
Mar 24, 2023
Merged

Conversation

wenyongh
Copy link
Contributor

The wasm opcodes generated by a_barrier mode are like below:

    atomic.fence
    local.get
    i32.const
    i32.store
    atomic.fence

Which use i32.store and are not atomic operations, and cause some
wasi-threads cases hang in WAMR:
bytecodealliance/wasm-micro-runtime#2024 (comment)

@@ -182,13 +182,26 @@ static inline void a_dec(volatile int *p)
#define a_store a_store
static inline void a_store(volatile int *p, int v)
{
#ifdef __wasilibc_unmodified_upstream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better so not defined a_barrier in wasm32/atomic_arch.h if its not doing the right thing?

* i32.const
* i32.store
* atomic.fence
* which are not atomic operations, so we use a_swap instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly the *p = v operation above is non-atomic since p and v are just normal C types (not atomic types), so that fact that the above sequence contains non-atomic operations should be be surprise right? We would expect that on all platforms.

I guess atomic.fence doesn't seem to be doing what the musl developers think that a_barrier is supposed to do? Do you know why?

Again, maybe better to just not defined a_barrier if we can't make it do what musl expects here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly the *p = v operation above is non-atomic since p and v are just normal C types (not atomic types), so that fact that the above sequence contains non-atomic operations should be be surprise right? We would expect that on all platforms.

maybe it's assumed to be compiled to an atomic-enough instruction. like mov on x86.

or maybe a_store doesn't require to be that atomic.

I guess atomic.fence doesn't seem to be doing what the musl developers think that a_barrier is supposed to do? Do you know why?

Again, maybe better to just not defined a_barrier if we can't make it do what musl expects here?

are you aware of any doc which explains what musl expects?
i'm not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is not easy to just use atomic.fence pairs to make the a_store an atomic operation, especially for interpreter, since interpreter needs to load and dispatch the opcodes one by one and pop/push operands when handling the opcodes, there will be many extra load/store/jump operations inserted among these operations, so it should be not atomic. For JIT/AOT mode, maybe the machine code generated can be atomic after some optimizations are applied, e.g. the code generated may be just like memory barrier + store instruction + memory barrier. Maybe it is why we found the hang issue only in interpreter mode, and haven't found the issue in AOT mode.

And yes, it is better to not define a_barrier, seems it is just converted into atomic.fence wasm opcode, and is not what must expects to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since interpreter needs to load and dispatch the opcodes one by one and pop/push operands when handling the opcodes

instruction fetch doesn't matter because it's read-only.
operand stack doesn't matter as it's local to the thread.
i don't see any fundamental differences between the interpreter and jit/aot here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AOT code is generated dynamically according to the input bytecode, the machine codes generated may be like memory barrier + store instruction + memory barrier, which ensures the memory ordering of access.
The interpreter is to interpret wasm opcode one by one, note that the machine code compiled consists lots of code pieces of opcode handler, these code pieces are un-ordered, for example, there may be code pieces of I32.STORE and ATOMIC.FENCE:

handler_of_I32.STORE:
    ...
    store instruction
    fetch next opcode and jump to its handler
...
...
handler_of_ATOMIC.FENCE:
    memory barrier
    fetch next opcode and jump to its handler

Or:

handler_of_ATOMIC.FENCE:
    memory barrier
    fetch next opcode and jump to its handler
...
...
handler_of_I32.STORE:
    ...
    store instruction
    fetch next opcode and jump to its handler

Note there is only one memory barrier instruction generated and there may be many other instructions between store and memory barrier instruction. The memory order cannot be promised, interpreter may run memory barrier first, then jump forward/backward to handler of I32.store, and then jump forward/backward again to handler of ATOMIC.FENCE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe better to just not defined a_barrier if we can't make it do what musl expects here?

Thanks @sbc100, I removed the a_barrier definition in wasm32/atomic_arch.h.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AOT code is generated dynamically according to the input bytecode, the machine codes generated may be like memory barrier + store instruction + memory barrier, which ensures the memory ordering of access. The interpreter is to interpret wasm opcode one by one, note that the machine code compiled consists lots of code pieces of opcode handler, these code pieces are un-ordered, for example, there may be code pieces of I32.STORE and ATOMIC.FENCE:

handler_of_I32.STORE:
    ...
    store instruction
    fetch next opcode and jump to its handler
...
...
handler_of_ATOMIC.FENCE:
    memory barrier
    fetch next opcode and jump to its handler

Or:

handler_of_ATOMIC.FENCE:
    memory barrier
    fetch next opcode and jump to its handler
...
...
handler_of_I32.STORE:
    ...
    store instruction
    fetch next opcode and jump to its handler

Note there is only one memory barrier instruction generated and there may be many other instructions between store and memory barrier instruction. The memory order cannot be promised, interpreter may run memory barrier first, then jump forward/backward to handler of I32.store, and then jump forward/backward again to handler of ATOMIC.FENCE.

i don't understand your concern.
as far as the interpreter executes handler_of_I32.STORE and handler_of_ATOMIC.FENCE in the order as in the original wasm bytecode, it should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, isn't the memory ordering related to the place of barrier inside the compiled code? And will it prevent the compiler from generation some re-ordered instructions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory barrier is cpu instruction which takes effect when it's executed.

@TerrorJack
Copy link
Contributor

Out of curiosity, it seems a_swap isn't defined in https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/arch/wasm32/atomic_arch.h yet, but it seems to be an oversight? Since wasm atomic opcodes contain xchg which can be used to implement it. And there's also atomic store opcodes that can be used to directly implement a_store...

@sbc100
Copy link
Member

sbc100 commented Mar 22, 2023

Out of curiosity, it seems a_swap isn't defined in https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/arch/wasm32/atomic_arch.h yet, but it seems to be an oversight? Since wasm atomic opcodes contain xchg which can be used to implement it. And there's also atomic store opcodes that can be used to directly implement a_store...

As a point of reference, in emscripten the atomic_arch.h stuff is implemented in terms of __c11_atomic_ intrinsics which allows them to be fairly generic and call back on clang/llvm to do that right thing:

https://github.com/emscripten-core/emscripten/blob/592aeca6a44b20ddeea6948fd9b5793bb8fc36de/system/lib/libc/musl/arch/emscripten/atomic_arch.h#L23-L27

@TerrorJack
Copy link
Contributor

Yeah, I was thinking of the old gnu style __sync_swap and such, but C11 atomics definitely look better.

@yamt
Copy link
Contributor

yamt commented Mar 22, 2023

The wasm opcodes generated by a_barrier mode are like below:

    atomic.fence
    local.get
    i32.const
    i32.store
    atomic.fence

Which use i32.store and are not atomic operations, and cause some wasi-threads cases hang in WAMR: bytecodealliance/wasm-micro-runtime#2024 (comment)

i'm not sure how it can cause the hang.
do you mean that wamr's STORE_U32 happens to be compiled to a non-atomic store?

@yamt
Copy link
Contributor

yamt commented Mar 22, 2023

if my reading of the spec [1] is correct, it seems like a bug in wamr to me.

[1] WebAssembly/threads#197

@wenyongh
Copy link
Contributor Author

Which use i32.store and are not atomic operations, and cause some wasi-threads cases hang in WAMR: bytecodealliance/wasm-micro-runtime#2024 (comment)

i'm not sure how it can cause the hang. do you mean that wamr's STORE_U32 happens to be compiled to a non-atomic store?

It is really complex to explain the whole process of pthread_barrier_wait, somewhat like:
Thread A is doing a_swap to take a lock, and if the lock is taken, it will enter the critical code; if not, it will run into atomic.wait to wait again.
Thread B's non-atomic i32.store 0 to the lock is omitted by the thread A's a_swap, a_swap re-writes 1 to the lock and returns fail, so thread A waits again, but since the lock is written to 1, it waits indefinitely for other thread to notify it. But thread B already finishes its task and doesn't notify it again, it also waits indefinitely.

@wenyongh
Copy link
Contributor Author

if my reading of the spec [1] is correct, it seems like a bug in wamr to me.

[1] WebAssembly/threads#197

Wondering, why i32.store needs to take the lock too? It is non-atomic, and spec doesn't mention that, instead, spec mentions atomic store needs to be seq-cst order.

@yamt
Copy link
Contributor

yamt commented Mar 22, 2023

if my reading of the spec [1] is correct, it seems like a bug in wamr to me.
[1] WebAssembly/threads#197

Wondering, why i32.store needs to take the lock too? It is non-atomic, and spec doesn't mention that, instead, spec mentions atomic store needs to be seq-cst order.

as i stated in the linked issue, it's actually somehow stated "atomic" in the spec.
and these musl-internal locking mechanisms seem to rely on the property.
i agree it's a bit strange. that's why i opened the question issue in the spec repo.

@yamt
Copy link
Contributor

yamt commented Mar 22, 2023

Which use i32.store and are not atomic operations, and cause some wasi-threads cases hang in WAMR: bytecodealliance/wasm-micro-runtime#2024 (comment)

i'm not sure how it can cause the hang. do you mean that wamr's STORE_U32 happens to be compiled to a non-atomic store?

It is really complex to explain the whole process of pthread_barrier_wait, somewhat like: Thread A is doing a_swap to take a lock, and if the lock is taken, it will enter the critical code; if not, it will run into atomic.wait to wait again. Thread B's non-atomic i32.store 0 to the lock is omitted by the thread A's a_swap, a_swap re-writes 1 to the lock and returns fail, so thread A waits again, but since the lock is written to 1, it waits indefinitely for other thread to notify it. But thread B already finishes its task and doesn't notify it again, it also waits indefinitely.

i asked about STORE_U32 because i thought that as far as STORE_U32 is compiled into eg. x86 mov instruction it's still atomic enough.

but i noticed that it's actually more about the cmpxchg implementation than i32.store.
when cmpxchg is implemented with a lock as it is in wamr interpreter, i32.store can be executed in the middle of cmpxchg and effectively break the lock as you say.

while it can be fixed by using an atomic opcode instead of i32.store as in this PR,
i'd like to wait for clarification in the threads repo as the trick to implement unlock with non-atomic store
is somehow common. (note that x86 version of musl a_store is mov + memory barrier, which basically matches the current wasm version.)

@g0djan
Copy link

g0djan commented Mar 22, 2023

Seems like this PR should solve the same problem that I tried to investigate in #402

@yamt
Copy link
Contributor

yamt commented Mar 22, 2023

after reading clarifications by @conrad-watt in WebAssembly/threads#197 , i think this patch is fine.
(i feel it's more straightforward to define a_store in wasm atomic_arch.h though.)

@g0djan
Copy link

g0djan commented Mar 22, 2023

I checked this PR solely fixes #402 for me, but I still observe data race warnings between some LOAD/STORE operations.
@wenyongh thank you for the fix!

Copy link
Collaborator

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is consensus that this is the right fix. I'll merge it and upstream it to wasi-sdk next week (along the lines of WebAssembly/wasi-sdk#412) but I had a quick question: should we wrap this in #ifdef __wasilibc_unmodified_upstream? My sense would be "no," since these WebAssembly arch files won't be in upstream MUSL, but I just wanted to check.

Thanks, @wenyongh!

@wenyongh
Copy link
Contributor Author

Looks like there is consensus that this is the right fix. I'll merge it and upstream it to wasi-sdk next week (along the lines of WebAssembly/wasi-sdk#412) but I had a quick question: should we wrap this in #ifdef __wasilibc_unmodified_upstream? My sense would be "no," since these WebAssembly arch files won't be in upstream MUSL, but I just wanted to check.

Thanks, @wenyongh!

Welcome and thanks, @abrown!

From the README.md under libc-top-half, it mentions that Changes to upstream code are wrapped in preprocessor directives controlled by the macro __wasilibc_unmodified_upstream, my understanding is that the macro is only used when modifying the original musl source code, since the folder arch/wasm32 was newly added, it isn't the original source code of musl, we shouldn't need to add the macro wrapper:
https://github.com/bminor/musl/tree/master/arch

And instead, if it is needed, we should also add the wrapping for other macro definitions in arch/wasm/atomic_arch.h, like a_cas, a_clz_32 and so on.

@abrown abrown merged commit 1dfe5c3 into WebAssembly:main Mar 24, 2023
abrown added a commit to abrown/wasi-sdk that referenced this pull request Mar 27, 2023
WebAssembly/wasi-libc#403 fixed an issue with `a_barrier` that should be
included in the next release of wasi-sdk. This change updates wasi-libc
to the latest `HEAD` of `main` to include it.
abrown added a commit to WebAssembly/wasi-sdk that referenced this pull request Mar 28, 2023
WebAssembly/wasi-libc#403 fixed an issue with `a_barrier` that should be
included in the next release of wasi-sdk. This change updates wasi-libc
to the latest `HEAD` of `main` to include it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants