-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sync primitives instead of using pthread ones #2028
Implement sync primitives instead of using pthread ones #2028
Conversation
e47659c
to
d8f1e88
Compare
After this PR I don't see the load/store race conditions anymore. I need to fix the failure on the specification test for atomics (that's breaking in the CI) and then do some more testing. |
6dd4b53
to
8ef3797
Compare
Me neither, no TSAN warnings |
@eloparco Wondering why we replace the BTW, I tried to debug the hang issue and found that the last thread didn't notify others after it entered |
This PR is part of the investigation to understand if the hang problems are coming from the usage of pthread primitives from
I need to check, but it may just be the consequence of the data race since they cause undefined behavior? |
Yes, no need to test both, keeping
I am not sure, I tried to understand the source code the ...
/* Otherwise we need a lock on the barrier object */
while (a_swap(&b->_b_lock, 1)) => wait and try to get the lock
__wait(&b->_b_lock, &b->_b_waiters, 1, 1);
inst = b->_b_inst; => lock is gotten, the below operations is locked
/* First thread to enter the barrier becomes the "instance owner" */
if (!inst) { => the first thread enters into this branch to setup the instance,
struct instance new_inst = { 0 }; seems always OK
int spins = 200;
b->_b_inst = inst = &new_inst;
a_store(&b->_b_lock, 0);
if (b->_b_waiters) __wake(&b->_b_lock, 1, 1);
while (spins-- && !inst->finished)
a_spin();
a_inc(&inst->finished);
while (inst->finished == 1)
#ifdef __wasilibc_unmodified_upstream
__syscall(SYS_futex,&inst->finished,FUTEX_WAIT|FUTEX_PRIVATE,1,0) != -ENOSYS
|| __syscall(SYS_futex,&inst->finished,FUTEX_WAIT,1,0);
#else
__futexwait(&inst->finished, 1, 0);
#endif
return PTHREAD_BARRIER_SERIAL_THREAD;
}
/* Last thread to enter the barrier wakes all non-instance-owners */
if (++inst->count == limit) { => the last(fourth) thread enters into this branch,
b->_b_inst = 0; Fails when hang: thread didn't enter into this branch
a_store(&b->_b_lock, 0);
if (b->_b_waiters) __wake(&b->_b_lock, 1, 1);
a_store(&inst->last, 1);
if (inst->waiters)
__wake(&inst->last, -1, 1);
} else { => the second and third thread enters into this branch,
a_store(&b->_b_lock, 0); seems always OK
if (b->_b_waiters) __wake(&b->_b_lock, 1, 1);
__wait(&inst->last, &inst->waiters, 0, 1);
} Not easy to know the behavior of last thread, since when I added more printf, it doesn't hang again.. |
We can check the mutex lock/unlock code, since that one gives TSAN warnings as well and it should be easier to debug. |
Didn't manage to build wasmtime with thread sanitizer but managed to run tests that hang for wamr on So it might mean that problem is not with |
Yes, I also found that it doesn't hang when running with WAMR AOT mode, so it should doesn't matter with The root cause might be the CPU runtime memory ordering: For example, the data written by thread A is in cache, but not actually written into memory yet, and the data read by thread B is from memory and may be invalid. We may need to add some memory barrier operations in interpreter like what AOT does: I tried to added some, and it didn't hang so often, I uploaded a rough patch, maybe you can have a further investigation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
#if USE_PTHREAD_SYNC_PRIMITIVES != 0 | ||
#include <pthread.h> | ||
#else | ||
#include "sync_primitives.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to use this primitives to easier find the other errors, but do we want to use it as default one though?
Pros:
- stable CI, if people break something they will be sure it's because of their changes
Cons:
- now
pthreads
spot the problems that need to be fixed anyway but we would hide it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, had better use the pthread_barrier_wait
after the hang issue is resolved. But note that there will be some normal load/store data races in wasi-libc implementation, e.g. a_swap:
https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/src/internal/atomic.h#L106-L115
int old;
do old = *p;
while (a_cas(p, old, v) != old);
return old;
Here old = *p
is not atomic, and a_cas(p, old, v) is atomic, if two threads operate on the same address p
, data races may be reported by sanitizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, had better use the pthread_barrier_wait after the hang issue is resolved.
So should we merge this one as it is then? Until we get it fixed in wasi-libc.
Here old = *p is not atomic, and a_cas(p, old, v) is atomic, if two threads operate on the same address p, data races may be reported by sanitizer.
Nice catch, that's probably what the thread sanitizer is complaining about for the load/store data races. It may be the same for other operations as well, I see the a_fetch_add
a few lines later doing something similar.
How much effort would be to fix wasi-libc? I personally would prefer to do it there rather than re-implementing those primitives; this is extra maintenance effort + WASI community won't benefit from your changes here as they're WAMR internal. |
Just to remove a line in |
The question can be raised in general: which things should be part of wasi and which should be part of wamr. Often this results in a chicken & egg problem: no one is implementing it as long as it is not standardized, but on the other hand no one pushes standardization if no one wants it to have/implement it. So I would propose to go forward and push it/make it visible at the community (with the risk of doing some waste). |
I don't think the question is which bits should be a part of wasi-libc and which should not be - those primitives are already part of wasi-libc, but they don't work (at least according to this PR). From what I understand, we re-implement those sync primitives to not use the (probably) broken ones from wasi-libc, but what I'm suggesting is that we should rather fix wasi-libc instead. |
@eloparco The PR for wasi-libc to fix the hang issue was merged, I think we may update CI to use that commit for both Ubuntu-20.04 and Ubuntu-22.04, and here use |
0392599
to
dae6923
Compare
…c commit that fixes atomic operation
dae6923
to
6b2999f
Compare
git fetch https://github.com/WebAssembly/wasi-libc \ | ||
8f5275796a82f8ecfd0833a4f3f444fa37ed4546 | ||
1dfe5c302d1c5ab621f7abf04620fae92700fd22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had better also compile wasi-libc in Ubuntu-22.04 instead using the pre-release wasi-sdk-thread version since there is issue in it:
Merge L337 to L346, and remove L366 and
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of that, but it would mean that now we remove the usage wasi-sdk 20 pre-release from the ci, basically reverting all the changes previously added in #2021. When wasi-sdk 20 final release comes out we will have to add those changes again.
I was leaving the wasi-sdk 20 part as it is since in practice we didn't encounter problems in the main branch, the hanging issues, only reproducible by running each test many times, always passed unnoticed in the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so let's keep it.
git fetch https://github.com/WebAssembly/wasi-libc \ | ||
8f5275796a82f8ecfd0833a4f3f444fa37ed4546 | ||
1dfe5c302d1c5ab621f7abf04620fae92700fd22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
…ytecodealliance#2028) Update wasi-libc version to resolve the hang issue when running wasi-threads cases. Implement custom sync primitives as a counterpart of `pthread_barrier_wait` to attempt to replace pthread sync primitives since they seem to cause data races when running with the thread sanitizer.
Attempt to replace pthread sync primitives since they seem to cause data races when running with the thread sanitizer. Those data races appear as concurrent accesses to load/store operations when running in classic interpreter mode.
In particular, this PR implements
mutex
andbarrier
to replacepthread_mutex_t
andpthread_barrier_t
in the tests with threads.