Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures #2398

bartlettroscoe · 2018-03-16T16:05:48Z

Next Action Status

Changed mpiexec options from "--map-by socket:PE=8 --oversubscribe" to "--map-by socket:PE=4". All of these errors went away on 'ride' builds on 3/17/2018

Description:

There are several tests on 'ride' and 'white' that are failing in our automated ATDM builds being posted to:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-03-16&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=-atdm-&field2=buildname&compare2=63&value2=white-ride

like the test TeuchosComm_Comm_test_MPI_4 run on ride in the build Trilinos-atdm-white-ride-gnu-opt-openmp shown at:

https://testing.sandia.gov/cdash/testDetails.php?test=45490082&build=3441246

that shows the the failure:

--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that has less cpus than
requested ones:

  #cpus-per-proc:  8
  number of cpus:  7
  map-by:          BYSOCKET:OVERSUBSCRIBE

Please specify a mapping level that has more cpus, or else let us
define a default mapping that will allow multiple cpus-per-proc.
--------------------------------------------------------------------------

I looked at most of the failing Teuchos tests on 'ride' as shown at:

https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-16&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=Passed&field3=testname&compare3=65&value3=Teuchos&field4=site&compare4=61&value4=ride

and many off them (but not all) show this same error message.

The mpiexec command on this system is run using:

<full-path>/mpiexec \"-np\" \"4\" \"-map-by\" \"socket:PE=8\" \"--oversubscribe\" \"/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-gnu-opt-openmp/SRC_AND_BUILD/BUILD/packages/teuchos/comm/test/Comm/TeuchosComm_Comm_test.exe\

This set of mpiexec options -map-by;socket:PE=8;--oversubscribe was taken from the EMPIRE configuration for Trilinos and it seems to work with the Panzer test suite as shown by that same build yesterday:

https://testing-vm.sandia.gov/cdash/viewTest.php?buildid=3369707

Is this set of options the cause of this problem and how do we fix this?

Steps to Reproduce

Following the instructions at:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite

I tried to reproduce these failures on 'ride' using:

$ bsub -x -I -q rhel7F ./checkin-test-atdm.sh gnu-opt-openmp --enable-packages=Teuchos --local-do-all

and I got:

FAILED (NOT READY TO PUSH): Trilinos: ride12

Fri Mar 16 09:38:43 MDT 2018

Enabled Packages: Teuchos

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) gnu-opt-openmp => FAILED: passed=130,notpassed=1 => Not ready to push! (7.11 min)

The only failing test was:

  99% tests passed, 1 tests failed out of 131
  
  Subproject Time Summary:
  Teuchos    = 434.10 sec*proc (131 tests)
  
  Total Test time (real) =  31.92 sec
  
  The following tests FAILED:
         49 - TeuchosCore_TypeConversions_UnitTest_MPI_1 (Failed)
  Errors while running CTest

(which I will create another GitHub issue for).

Therefore, I can't seem to reproduce this error locally it we may have to guess at how ti fix this and let the nightly builds run.

The text was updated successfully, but these errors were encountered:

nmhamster · 2018-03-16T16:12:25Z

@bartlettroscoe - the rhel7F nodes only have 16 cores. MIght we just be able to set the PE=8 to PE=4 and then removing the oversubscribe?

bartlettroscoe · 2018-03-16T16:19:49Z

the rhel7F nodes only have 16 cores. MIght be just be able to set the PE=8 to PE=4 and then removing the oversubscribe?

Okay, I will try that.

But I am confused because numactl --hardware seems to report that the rhel7F nodes have a total of 128 cores. How can that be?

nmhamster · 2018-03-16T16:19:53Z

@bartlettroscoe - you may also want to add -n 16 to the bsub command, this creates a useful mask as the scheduler loads.

nmhamster · 2018-03-16T16:20:54Z

@bartlettroscoe - the processor layout is dual socket, each socket has 8 cores, each core has up to 8 threads, so 2 * 8 * 8 = 128 (hardware threads). The POWER8 can usually easily handle two threads per core without any trouble at all.

…1, TRIL-198, #2398) This was the suggestion by Si Hammond (see #2398)

With the new set of flags --map-by socket:PE=4 (no --oversubscribe), it is critical that you pass in -n 16 to run on the rhel7F nodes.

bartlettroscoe · 2018-03-16T16:54:28Z

Okay, when I changed the mpiexec comamnd to run with:

    <full-path>/mpiexec -n <NP> -map-by socket:PE=4 <exec-name>

and I run with:

$ set | grep ^OMP_
OMP_NUM_THREADS=2
$ bsub -x -I -n 16 -q rhel7F ctest -j16

I get:

99% tests passed, 1 tests failed out of 131

Subproject Time Summary:
Teuchos    = 255.73 sec*proc (131 tests)

Total Test time (real) =  19.74 sec

The following tests FAILED:
         49 - TeuchosCore_TypeConversions_UnitTest_MPI_1 (Failed)
Errors while running CTest

(the failing test TeuchosCore_TypeConversions_UnitTest_MPI_1 is a different issue).

Wtih OMP_NUM_THREADS=2, what is a good level of <N> to give to ctest -j <N> with this setup?

bartlettroscoe · 2018-03-16T16:54:45Z

Okay, I just pushed the commit 114ca53. We will see if these errors clear up tomorrow on 'ride'.

nmhamster · 2018-03-16T16:56:35Z

@bartlettroscoe - if you are requesting 4 cores per MPI rank (PE=4) then you can go up to OMP_NUM_THREADS=32. The benchmarking we have done shows different levels of benefits when using the hyper threads, in general 2 threads per core is good, 4 is useful for some kernels and 8 is useful when the codes become heavily memory operation bound. My take on it would be to do OMP_NUM_THREADS=16 (i.e. 4 threads for each of the 4 cores), if you use OMP_PROC_BIND=spread and OMP_PLACES=threads then you should get this behavior.

bartlettroscoe · 2018-03-16T19:15:00Z

@nmhamster, setting OMP_PROC_BIND=spread and OMP_PLACES=threads causes several KokkosCore tests to timeout:

$ bsub -x -I -n 16 -q rhel7F ctest -j16
...
21/26 Test #18: KokkosCore_UnitTest_DefaultInit_12_MPI_1 .........   Passed    1.42 sec
22/26 Test #26: KokkosAlgorithms_UnitTest_MPI_1 ..................***Timeout 300.12 sec
23/26 Test  #1: KokkosCore_UnitTest_Serial_MPI_1 .................***Timeout 300.26 sec
24/26 Test #24: KokkosContainers_UnitTest_Serial_MPI_1 ...........***Timeout 300.39 sec
25/26 Test #25: KokkosContainers_UnitTest_OpenMP_MPI_1 ...........***Timeout 300.49 sec
26/26 Test  #2: KokkosCore_UnitTest_OpenMP_MPI_1 .................***Timeout 300.58 sec

81% tests passed, 5 tests failed out of 26

Subproject Time Summary:
Kokkos    = 1553.23 sec*proc (26 tests)

Total Test time (real) = 300.71 sec

The following tests FAILED:
          1 - KokkosCore_UnitTest_Serial_MPI_1 (Timeout)
          2 - KokkosCore_UnitTest_OpenMP_MPI_1 (Timeout)
         24 - KokkosContainers_UnitTest_Serial_MPI_1 (Timeout)
         25 - KokkosContainers_UnitTest_OpenMP_MPI_1 (Timeout)
         26 - KokkosAlgorithms_UnitTest_MPI_1 (Timeout)
Errors while running CTest

Leaving these unset gives:

$ bsub -x -I -n 16 -q rhel7F ctest -j16
...
21/26 Test #20: KokkosCore_UnitTest_DefaultInit_14_MPI_1 .........   Passed    1.40 sec
22/26 Test  #2: KokkosCore_UnitTest_OpenMP_MPI_1 .................   Passed   53.67 sec
23/26 Test  #1: KokkosCore_UnitTest_Serial_MPI_1 .................   Passed   62.98 sec
24/26 Test #26: KokkosAlgorithms_UnitTest_MPI_1 ..................   Passed   89.67 sec
25/26 Test #25: KokkosContainers_UnitTest_OpenMP_MPI_1 ...........   Passed  102.24 sec
26/26 Test #24: KokkosContainers_UnitTest_Serial_MPI_1 ...........   Passed  185.06 sec

100% tests passed, 0 tests failed out of 26

Subproject Time Summary:
Kokkos    = 524.82 sec*proc (26 tests)

Total Test time (real) = 185.10 sec

Therefore, I will not commit this change just yet.

Let's see what happens with automated testing tomorrow morning and then we ca play with this more locally later.

ndellingwood · 2018-03-16T19:31:05Z

@bartlettroscoe I think that OMP_PROC_BIND=spread and OMP_PLACES=threads should not be set for Kokkos unit tests, adding @dsunder to confirm.

bartlettroscoe · 2018-03-16T19:35:16Z

I think that OMP_PROC_BIND=spread and OMP_PLACES=threads should not be set for Kokkos unit tests, adding @dsunder to confirm.

I can provide exact instructions to reproduce if anyone is interested to play with this.

…1, TRIL-198, trilinos#2398) This was the suggestion by Si Hammond (see trilinos#2398)

…#2398) With the new set of flags --map-by socket:PE=4 (no --oversubscribe), it is critical that you pass in -n 16 to run on the rhel7F nodes.

…nos#2398)

bartlettroscoe · 2018-03-17T15:46:24Z

Looks like that change to the mpiexec options in commit 114ca53 fixed all of the BYSOCKET:OVERSUBSCRIBE errors we saw with the Teuchos test suite and especially for the build Trilinos-atdm-white-ride-gnu-opt-openmp on ride as can be seen in:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-03-17&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=-atdm-&field2=buildname&compare2=63&value2=white-ride&field3=subprojects&compare3=93&value3=Teuchos

and

https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-17&limit=0&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=Passed&field3=testname&compare3=65&value3=Teuchos&field4=site&compare4=61&value4=ride

Now the only failing Teuchos tests on ride/white are TeuchosCore_TypeConversions_UnitTest_MPI_1 and TeuchosNumerics_LAPACK_test_MPI_1 which show different failures.

Looking over the other failing tests for that build Trilinos-atdm-white-ride-gnu-opt-openmp on ride this morning (which only completed through Teko tests before crashing) at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3373731

there are 8 test failures for Belos and 50 test falures for Anazazi but those are all segfaults (which I think is the long-known problem reported in #1208 and #1191 which has never been resolved.

Therefore, I think this issue is resolved so I will close it.

Thanks @nmhamster!

mhoemmen · 2018-03-19T00:32:39Z

For a Kokkos PR that may make oversubscription less bad, see kokkos/kokkos#1475 .

crtrott · 2018-03-19T21:13:18Z

The problem with your ctest -j 16 while settting the OMP_PROC_BIND etc is that you bind 16 threads to the same core, since everyone of the 16 tests you try to run simultaneously is using the same process mask. That means OpenMP thread 0 of each of those tests is binding to hardware thread 0 on the node, and OpenMP thread k is binding to hardware thread k.

If ctest tries to run 16 tests at the same time it will bind 16 threads to each hardware thread up to OMP_NUM_THREADS. If the latter is two you are getting 16 OpenMP threads bound to hardware thread 0 and 16 OpenMP threads bound to hardware thread 1, while all th eother hardware threads are idle.

bartlettroscoe · 2018-03-20T01:26:42Z

@crtrott, when not setting OMP_PROC_BIND at all, it does not seem to have this problem. I did a simple scaling study on 'shiller' that did not show this problem anyway. Is not setting OMP_PROC_BIND letting mpiexec spread out the processes and the cores better?

I about to post a new Trilinos GitHub issue to try to address problems with threaded testing of Trilinos so we can discuss how improve things.

bartlettroscoe added the client: ATDM Any issue primarily impacting the ATDM project label Mar 16, 2018

bartlettroscoe added this to the Initial cleanup of new ATDM builds of Trilinos milestone Mar 16, 2018

bartlettroscoe added a commit that referenced this issue Mar 16, 2018

Change socket:PE=8 to socket:PE=4 and remove --oversubscribe (TRIL-17…

114ca53

…1, TRIL-198, #2398) This was the suggestion by Si Hammond (see #2398)

bartlettroscoe added a commit that referenced this issue Mar 16, 2018

Update instrutions for running on 'ride' (TRIL-171, TRIL-98, #2398)

de6b314

With the new set of flags --map-by socket:PE=4 (no --oversubscribe), it is critical that you pass in -n 16 to run on the rhel7F nodes.

bartlettroscoe added a commit that referenced this issue Mar 16, 2018

Unset OMP_PROC_BIND and OMP_PLACES to allow settting (TRIL-198, #2398)

93bfe1e

kyungjoo-kim pushed a commit to kyungjoo-kim/Trilinos that referenced this issue Mar 16, 2018

Change socket:PE=8 to socket:PE=4 and remove --oversubscribe (TRIL-17…

285316f

…1, TRIL-198, trilinos#2398) This was the suggestion by Si Hammond (see trilinos#2398)

kyungjoo-kim pushed a commit to kyungjoo-kim/Trilinos that referenced this issue Mar 16, 2018

Unset OMP_PROC_BIND and OMP_PLACES to allow settting (TRIL-198, trili…

5b54aea

…nos#2398)

bartlettroscoe closed this as completed Mar 17, 2018

bartlettroscoe mentioned this issue Mar 20, 2018

Define better strategy for managing threaded testing with Trilinos #2422

Open

bartlettroscoe added the ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams label Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures #2398

Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures #2398

bartlettroscoe commented Mar 16, 2018 •

edited

Loading

nmhamster commented Mar 16, 2018 •

edited

Loading

bartlettroscoe commented Mar 16, 2018

nmhamster commented Mar 16, 2018

nmhamster commented Mar 16, 2018

bartlettroscoe commented Mar 16, 2018

bartlettroscoe commented Mar 16, 2018

nmhamster commented Mar 16, 2018

bartlettroscoe commented Mar 16, 2018

ndellingwood commented Mar 16, 2018

bartlettroscoe commented Mar 16, 2018

bartlettroscoe commented Mar 17, 2018

mhoemmen commented Mar 19, 2018 •

edited

Loading

crtrott commented Mar 19, 2018

bartlettroscoe commented Mar 20, 2018

Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures #2398

Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures #2398

Comments

bartlettroscoe commented Mar 16, 2018 • edited Loading

Next Action Status

Description:

Steps to Reproduce

nmhamster commented Mar 16, 2018 • edited Loading

bartlettroscoe commented Mar 16, 2018

nmhamster commented Mar 16, 2018

nmhamster commented Mar 16, 2018

bartlettroscoe commented Mar 16, 2018

bartlettroscoe commented Mar 16, 2018

nmhamster commented Mar 16, 2018

bartlettroscoe commented Mar 16, 2018

ndellingwood commented Mar 16, 2018

bartlettroscoe commented Mar 16, 2018

bartlettroscoe commented Mar 17, 2018

mhoemmen commented Mar 19, 2018 • edited Loading

crtrott commented Mar 19, 2018

bartlettroscoe commented Mar 20, 2018

bartlettroscoe commented Mar 16, 2018 •

edited

Loading

nmhamster commented Mar 16, 2018 •

edited

Loading

mhoemmen commented Mar 19, 2018 •

edited

Loading