-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures #2398
Comments
@bartlettroscoe - the rhel7F nodes only have 16 cores. MIght we just be able to set the |
Okay, I will try that. But I am confused because |
@bartlettroscoe - you may also want to add |
@bartlettroscoe - the processor layout is dual socket, each socket has 8 cores, each core has up to 8 threads, so 2 * 8 * 8 = 128 (hardware threads). The POWER8 can usually easily handle two threads per core without any trouble at all. |
With the new set of flags --map-by socket:PE=4 (no --oversubscribe), it is critical that you pass in -n 16 to run on the rhel7F nodes.
Okay, when I changed the mpiexec comamnd to run with:
and I run with:
I get:
(the failing test Wtih |
Okay, I just pushed the commit 114ca53. We will see if these errors clear up tomorrow on 'ride'. |
@bartlettroscoe - if you are requesting 4 cores per MPI rank ( |
@nmhamster, setting
Leaving these unset gives:
Therefore, I will not commit this change just yet. Let's see what happens with automated testing tomorrow morning and then we ca play with this more locally later. |
@bartlettroscoe I think that |
I can provide exact instructions to reproduce if anyone is interested to play with this. |
…1, TRIL-198, trilinos#2398) This was the suggestion by Si Hammond (see trilinos#2398)
…#2398) With the new set of flags --map-by socket:PE=4 (no --oversubscribe), it is critical that you pass in -n 16 to run on the rhel7F nodes.
Looks like that change to the mpiexec options in commit 114ca53 fixed all of the and Now the only failing Teuchos tests on ride/white are Looking over the other failing tests for that build there are 8 test failures for Belos and 50 test falures for Anazazi but those are all segfaults (which I think is the long-known problem reported in #1208 and #1191 which has never been resolved. Therefore, I think this issue is resolved so I will close it. Thanks @nmhamster! |
For a Kokkos PR that may make oversubscription less bad, see kokkos/kokkos#1475 . |
The problem with your ctest -j 16 while settting the OMP_PROC_BIND etc is that you bind 16 threads to the same core, since everyone of the 16 tests you try to run simultaneously is using the same process mask. That means OpenMP thread 0 of each of those tests is binding to hardware thread 0 on the node, and OpenMP thread k is binding to hardware thread k. If ctest tries to run 16 tests at the same time it will bind 16 threads to each hardware thread up to OMP_NUM_THREADS. If the latter is two you are getting 16 OpenMP threads bound to hardware thread 0 and 16 OpenMP threads bound to hardware thread 1, while all th eother hardware threads are idle. |
@crtrott, when not setting I about to post a new Trilinos GitHub issue to try to address problems with threaded testing of Trilinos so we can discuss how improve things. |
CC: @fryeguy52, @nmhamster
Next Action Status
Changed mpiexec options from "--map-by socket:PE=8 --oversubscribe" to "--map-by socket:PE=4". All of these errors went away on 'ride' builds on 3/17/2018
Description:
There are several tests on 'ride' and 'white' that are failing in our automated ATDM builds being posted to:
like the test
TeuchosComm_Comm_test_MPI_4
run onride
in the buildTrilinos-atdm-white-ride-gnu-opt-openmp
shown at:that shows the the failure:
I looked at most of the failing Teuchos tests on 'ride' as shown at:
and many off them (but not all) show this same error message.
The mpiexec command on this system is run using:
This set of mpiexec options
-map-by;socket:PE=8;--oversubscribe
was taken from the EMPIRE configuration for Trilinos and it seems to work with the Panzer test suite as shown by that same build yesterday:Is this set of options the cause of this problem and how do we fix this?
Steps to Reproduce
Following the instructions at:
I tried to reproduce these failures on 'ride' using:
and I got:
The only failing test was:
(which I will create another GitHub issue for).
Therefore, I can't seem to reproduce this error locally it we may have to guess at how ti fix this and let the nightly builds run.
The text was updated successfully, but these errors were encountered: