Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures #2398

Closed
bartlettroscoe opened this issue Mar 16, 2018 · 14 comments
Closed
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 16, 2018

CC: @fryeguy52, @nmhamster

Next Action Status

Changed mpiexec options from "--map-by socket:PE=8 --oversubscribe" to "--map-by socket:PE=4". All of these errors went away on 'ride' builds on 3/17/2018

Description:

There are several tests on 'ride' and 'white' that are failing in our automated ATDM builds being posted to:

like the test TeuchosComm_Comm_test_MPI_4 run on ride in the build Trilinos-atdm-white-ride-gnu-opt-openmp shown at:

that shows the the failure:

--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that has less cpus than
requested ones:

  #cpus-per-proc:  8
  number of cpus:  7
  map-by:          BYSOCKET:OVERSUBSCRIBE

Please specify a mapping level that has more cpus, or else let us
define a default mapping that will allow multiple cpus-per-proc.
--------------------------------------------------------------------------

I looked at most of the failing Teuchos tests on 'ride' as shown at:

and many off them (but not all) show this same error message.

The mpiexec command on this system is run using:

<full-path>/mpiexec \"-np\" \"4\" \"-map-by\" \"socket:PE=8\" \"--oversubscribe\" \"/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-gnu-opt-openmp/SRC_AND_BUILD/BUILD/packages/teuchos/comm/test/Comm/TeuchosComm_Comm_test.exe\

This set of mpiexec options -map-by;socket:PE=8;--oversubscribe was taken from the EMPIRE configuration for Trilinos and it seems to work with the Panzer test suite as shown by that same build yesterday:

Is this set of options the cause of this problem and how do we fix this?

Steps to Reproduce

Following the instructions at:

I tried to reproduce these failures on 'ride' using:

$ bsub -x -I -q rhel7F ./checkin-test-atdm.sh gnu-opt-openmp --enable-packages=Teuchos --local-do-all

and I got:

FAILED (NOT READY TO PUSH): Trilinos: ride12

Fri Mar 16 09:38:43 MDT 2018

Enabled Packages: Teuchos

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) gnu-opt-openmp => FAILED: passed=130,notpassed=1 => Not ready to push! (7.11 min)

The only failing test was:

  99% tests passed, 1 tests failed out of 131
  
  Subproject Time Summary:
  Teuchos    = 434.10 sec*proc (131 tests)
  
  Total Test time (real) =  31.92 sec
  
  The following tests FAILED:
         49 - TeuchosCore_TypeConversions_UnitTest_MPI_1 (Failed)
  Errors while running CTest

(which I will create another GitHub issue for).

Therefore, I can't seem to reproduce this error locally it we may have to guess at how ti fix this and let the nightly builds run.

@bartlettroscoe bartlettroscoe added the client: ATDM Any issue primarily impacting the ATDM project label Mar 16, 2018
@nmhamster
Copy link
Contributor

nmhamster commented Mar 16, 2018

@bartlettroscoe - the rhel7F nodes only have 16 cores. MIght we just be able to set the PE=8 to PE=4 and then removing the oversubscribe?

@bartlettroscoe
Copy link
Member Author

the rhel7F nodes only have 16 cores. MIght be just be able to set the PE=8 to PE=4 and then removing the oversubscribe?

Okay, I will try that.

But I am confused because numactl --hardware seems to report that the rhel7F nodes have a total of 128 cores. How can that be?

@nmhamster
Copy link
Contributor

@bartlettroscoe - you may also want to add -n 16 to the bsub command, this creates a useful mask as the scheduler loads.

@nmhamster
Copy link
Contributor

@bartlettroscoe - the processor layout is dual socket, each socket has 8 cores, each core has up to 8 threads, so 2 * 8 * 8 = 128 (hardware threads). The POWER8 can usually easily handle two threads per core without any trouble at all.

bartlettroscoe added a commit that referenced this issue Mar 16, 2018
…1, TRIL-198, #2398)

This was the suggestion by Si Hammond (see #2398)
bartlettroscoe added a commit that referenced this issue Mar 16, 2018
With the new set of flags --map-by socket:PE=4 (no --oversubscribe), it is
critical that you pass in -n 16 to run on the rhel7F nodes.
@bartlettroscoe
Copy link
Member Author

Okay, when I changed the mpiexec comamnd to run with:

    <full-path>/mpiexec -n <NP> -map-by socket:PE=4 <exec-name>

and I run with:

$ set | grep ^OMP_
OMP_NUM_THREADS=2
$ bsub -x -I -n 16 -q rhel7F ctest -j16

I get:

99% tests passed, 1 tests failed out of 131

Subproject Time Summary:
Teuchos    = 255.73 sec*proc (131 tests)

Total Test time (real) =  19.74 sec

The following tests FAILED:
         49 - TeuchosCore_TypeConversions_UnitTest_MPI_1 (Failed)
Errors while running CTest

(the failing test TeuchosCore_TypeConversions_UnitTest_MPI_1 is a different issue).

Wtih OMP_NUM_THREADS=2, what is a good level of <N> to give to ctest -j <N> with this setup?

@bartlettroscoe
Copy link
Member Author

Okay, I just pushed the commit 114ca53. We will see if these errors clear up tomorrow on 'ride'.

@nmhamster
Copy link
Contributor

@bartlettroscoe - if you are requesting 4 cores per MPI rank (PE=4) then you can go up to OMP_NUM_THREADS=32. The benchmarking we have done shows different levels of benefits when using the hyper threads, in general 2 threads per core is good, 4 is useful for some kernels and 8 is useful when the codes become heavily memory operation bound. My take on it would be to do OMP_NUM_THREADS=16 (i.e. 4 threads for each of the 4 cores), if you use OMP_PROC_BIND=spread and OMP_PLACES=threads then you should get this behavior.

@bartlettroscoe
Copy link
Member Author

@nmhamster, setting OMP_PROC_BIND=spread and OMP_PLACES=threads causes several KokkosCore tests to timeout:

$ bsub -x -I -n 16 -q rhel7F ctest -j16
...
21/26 Test #18: KokkosCore_UnitTest_DefaultInit_12_MPI_1 .........   Passed    1.42 sec
22/26 Test #26: KokkosAlgorithms_UnitTest_MPI_1 ..................***Timeout 300.12 sec
23/26 Test  #1: KokkosCore_UnitTest_Serial_MPI_1 .................***Timeout 300.26 sec
24/26 Test #24: KokkosContainers_UnitTest_Serial_MPI_1 ...........***Timeout 300.39 sec
25/26 Test #25: KokkosContainers_UnitTest_OpenMP_MPI_1 ...........***Timeout 300.49 sec
26/26 Test  #2: KokkosCore_UnitTest_OpenMP_MPI_1 .................***Timeout 300.58 sec

81% tests passed, 5 tests failed out of 26

Subproject Time Summary:
Kokkos    = 1553.23 sec*proc (26 tests)

Total Test time (real) = 300.71 sec

The following tests FAILED:
          1 - KokkosCore_UnitTest_Serial_MPI_1 (Timeout)
          2 - KokkosCore_UnitTest_OpenMP_MPI_1 (Timeout)
         24 - KokkosContainers_UnitTest_Serial_MPI_1 (Timeout)
         25 - KokkosContainers_UnitTest_OpenMP_MPI_1 (Timeout)
         26 - KokkosAlgorithms_UnitTest_MPI_1 (Timeout)
Errors while running CTest

Leaving these unset gives:

$ bsub -x -I -n 16 -q rhel7F ctest -j16
...
21/26 Test #20: KokkosCore_UnitTest_DefaultInit_14_MPI_1 .........   Passed    1.40 sec
22/26 Test  #2: KokkosCore_UnitTest_OpenMP_MPI_1 .................   Passed   53.67 sec
23/26 Test  #1: KokkosCore_UnitTest_Serial_MPI_1 .................   Passed   62.98 sec
24/26 Test #26: KokkosAlgorithms_UnitTest_MPI_1 ..................   Passed   89.67 sec
25/26 Test #25: KokkosContainers_UnitTest_OpenMP_MPI_1 ...........   Passed  102.24 sec
26/26 Test #24: KokkosContainers_UnitTest_Serial_MPI_1 ...........   Passed  185.06 sec

100% tests passed, 0 tests failed out of 26

Subproject Time Summary:
Kokkos    = 524.82 sec*proc (26 tests)

Total Test time (real) = 185.10 sec

Therefore, I will not commit this change just yet.

Let's see what happens with automated testing tomorrow morning and then we ca play with this more locally later.

@ndellingwood
Copy link
Contributor

@bartlettroscoe I think that OMP_PROC_BIND=spread and OMP_PLACES=threads should not be set for Kokkos unit tests, adding @dsunder to confirm.

@bartlettroscoe
Copy link
Member Author

I think that OMP_PROC_BIND=spread and OMP_PLACES=threads should not be set for Kokkos unit tests, adding @dsunder to confirm.

I can provide exact instructions to reproduce if anyone is interested to play with this.

kyungjoo-kim pushed a commit to kyungjoo-kim/Trilinos that referenced this issue Mar 16, 2018
kyungjoo-kim pushed a commit to kyungjoo-kim/Trilinos that referenced this issue Mar 16, 2018
…#2398)

With the new set of flags --map-by socket:PE=4 (no --oversubscribe), it is
critical that you pass in -n 16 to run on the rhel7F nodes.
@bartlettroscoe
Copy link
Member Author

Looks like that change to the mpiexec options in commit 114ca53 fixed all of the BYSOCKET:OVERSUBSCRIBE errors we saw with the Teuchos test suite and especially for the build Trilinos-atdm-white-ride-gnu-opt-openmp on ride as can be seen in:

and

Now the only failing Teuchos tests on ride/white are TeuchosCore_TypeConversions_UnitTest_MPI_1 and TeuchosNumerics_LAPACK_test_MPI_1 which show different failures.

Looking over the other failing tests for that build Trilinos-atdm-white-ride-gnu-opt-openmp on ride this morning (which only completed through Teko tests before crashing) at:

there are 8 test failures for Belos and 50 test falures for Anazazi but those are all segfaults (which I think is the long-known problem reported in #1208 and #1191 which has never been resolved.

Therefore, I think this issue is resolved so I will close it.

Thanks @nmhamster!

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 19, 2018

For a Kokkos PR that may make oversubscription less bad, see kokkos/kokkos#1475 .

@crtrott
Copy link
Member

crtrott commented Mar 19, 2018

The problem with your ctest -j 16 while settting the OMP_PROC_BIND etc is that you bind 16 threads to the same core, since everyone of the 16 tests you try to run simultaneously is using the same process mask. That means OpenMP thread 0 of each of those tests is binding to hardware thread 0 on the node, and OpenMP thread k is binding to hardware thread k.

If ctest tries to run 16 tests at the same time it will bind 16 threads to each hardware thread up to OMP_NUM_THREADS. If the latter is two you are getting 16 OpenMP threads bound to hardware thread 0 and 16 OpenMP threads bound to hardware thread 1, while all th eother hardware threads are idle.

@bartlettroscoe
Copy link
Member Author

@crtrott, when not setting OMP_PROC_BIND at all, it does not seem to have this problem. I did a simple scaling study on 'shiller' that did not show this problem anyway. Is not setting OMP_PROC_BIND letting mpiexec spread out the processes and the cores better?

I about to post a new Trilinos GitHub issue to try to address problems with threaded testing of Trilinos so we can discuss how improve things.

@bartlettroscoe bartlettroscoe added the ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams label Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project
Projects
None yet
Development

No branches or pull requests

5 participants