-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opal_fifo check hanging on aarch64 / PowerPC Big Endian #4563
Comments
@hppritcha I cannot observe the hang in my environment. Is there more information to reproduce it? My environment:
|
@hppritcha I can observe the hang in Jenkins. |
@kawashima-fj how many cpus does your aarch64 system have? |
Lock implementation can depend on compiler and uarch. |
Does it use gcc AMOs implementation or OMPI's ? |
Hmm...If I take the patch from #4566 and configure with --disable-builtin-atomics, the opal_fifo doesn't hang on my aarch64 system. gcc 4.8.5, cpuinfo:
I'm running opal_fifo in a loop, letting it go for hundreds of iterations. using default configure options, the test typically hangs after a few iterations. |
@shamisp When using gcc builtins we end up using a bad compare-exchange 128 implementation (lock-based). I think we need to improve the configury and make ompi use the ompi atomic implementations for aarch64. |
What version of GCC ? |
All recent versions (5.x, 6.x, 7.x) advertise 128-bit compare-exchange support. It gets past our current configure check. |
Easy enough to hard-code the builtins off until we have a better way to deal with this. |
This problem exists with other LL/SC architectures. Let me finish my debug mode fix for the LL/SC implementation of the lifo and fifo and I will make sure we disable the builtins by default on those architectures. |
I don't see this 128 -bit compare exchange thing. Here's what's in my opal/include/opal_config.h:
when building on my cortexa57 box using either gcc 4.8.5 or gcc 7.3.0 |
Huh, odd. Ok. That is different than on power. I assumed it would be the same here. How about OPAL_HAVE_SYNC_BUILTIN_CSWAP_INT128? |
IN opal/include/opal_config.h I see
|
anyway, the jenkins scripts allow setting an optional configure option so i've defined that to add in
as that appears to be a reliable way to get opal_fifo to pass on this system. |
AFAIK gcc 7 and latest clang are doing pretty good job on AMOs, including support for Arm v8.1 atomics. Do we know exactly what operation is broken ? |
As documented in open-mpi#4563 and open-mpi#3697, there is an issue on ARM and POWER platforms when the atomic fifo assembly isn't inlined, which manifests as a hang. Document the issue and the work-around until a proper fix is committed. Signed-off-by: Brian Barrett <[email protected]>
As documented in #4563 and #3697, there is an issue on ARM and POWER platforms when the atomic fifo assembly isn't inlined, which manifests as a hang. Document the issue and the work-around until a proper fix is committed. Signed-off-by: Brian Barrett <[email protected]>
As documented in open-mpi#4563 and open-mpi#3697, there is an issue on ARM and POWER platforms when the atomic fifo assembly isn't inlined, which manifests as a hang. Document the issue and the work-around until a proper fix is committed. Signed-off-by: Brian Barrett <[email protected]> (cherry picked from commit 4658422)
Discussion in the room about Power is that when we re-enable Power BE because this doesn't happen anymore is that we should have a NEWS item that says despite fixing the error message, we still don't actually support Power BE. And that we should remove the block now that we know it wasn't a silent data corruption problem. |
@hjelmn Have you had a chance to finish this yet, perchance? |
I note that there's a README bullet that will need to be updated once this issue is fixed:
|
Also note that ARM and POWER users may experience hangs (until open-mpi#4563 is fixed). Signed-off-by: Jeff Squyres <[email protected]>
Also note that ARM and POWER users may experience hangs (until open-mpi#4563 is fixed). Signed-off-by: Jeff Squyres <[email protected]>
Also note that ARM and POWER users may experience hangs (until open-mpi#4563 is fixed). Signed-off-by: Jeff Squyres <[email protected]>
Also note that ARM and POWER users may experience hangs (until open-mpi#4563 is fixed). Signed-off-by: Jeff Squyres <[email protected]>
resolved several years ago. closing. |
At least on master at b160cf6 opal_fifo appears to be regularly hanging on aarch64. I'm using gcc 4.8.5. This test had been passing regularly with jenkins CI PR testing until sometime in the last several days/week.
The test does not appear to hang when Open MPI is configured with --enable-debug. Well actually it does sometimes. Bullet proof way to avoid the problem is to configure with
--disable-builtin-atomics.
The text was updated successfully, but these errors were encountered: