Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: PROCESSES proof of concept #5598

Closed
wants to merge 1 commit into from

Conversation

KyleFromKitware
Copy link

@trilinos/

Description

Motivation and Context

How Has This Been Tested?

Checklist

  • My commit messages mention the appropriate GitHub issue numbers.
  • My code follows the code style of the affected package(s).
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the code contribution guidelines for this project.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • No new compiler warnings were introduced.
  • These changes break backwards compatibility.

@KyleFromKitware KyleFromKitware requested a review from a team as a code owner July 30, 2019 18:01
@KyleFromKitware
Copy link
Author

Cc: @bartlettroscoe

@trilinos-autotester
Copy link
Contributor

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
WARNING: NO REVIEWERS HAVE BEEN REQUESTED FOR THIS PULL REQUEST!

@jhux2
Copy link
Member

jhux2 commented Jul 30, 2019

@KyleFromKitware Can you provide some context for this PR?

@bartlettroscoe
Copy link
Member

CC: @trilinos/kokkos

@jhux2 asked:

@KyleFromKitware Can you provide some context for this PR?

The context is the Kitware contract work to help address #2422, in this case for running out test suites faster and more robustly on GPUs. This would be a big deal for ATDM.

NOTE: I am hoping that we can add support to TriBITS to now require any changes to CMakeLists.txt files in any Trilinos package so this is just a proof of concept, not a PR we want to actually merge.

@trilinos/kokkos,

This proof-of-concept PR contains some changes to the way that Kokkos allocates GPUs. Can one of the Kokkos developers attend a meeting with @KyleFromKitware and myself to discuss these changes and what the goals are where to go from here?

@KyleFromKitware
Copy link
Author

When running the test suite on a multi-GPU machine, Kokkos tests currently only use GPU 0 while leaving all of the other GPUs untouched. This creates a major bottleneck which limits the amount of tests that can be run at once while leaving large resources untouched and wasted. There is a WIP CMake branch which makes it possible to describe to CTest what hardware is available and evenly distribute it among tests. This PR takes advantage of the new functionality to ensure that tests utilize all of the GPUs. When running with these new settings, we were able to cut the total testing time almost in half with a select set of tests.

This is still a proof of concept and is not ready to be merged. We need to come up with a way to teach TriBITS how to automatically apply this setting to all Kokkos-enabled tests, thus bringing this improvement to all of Trilinos. We also need to wait for the upstream CMake branch to land and be released.

@bartlettroscoe bartlettroscoe added AT: WIP Causes the PR autotester to not test the PR. (Remove to allow testing to occur.) ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project type: enhancement Issue is an enhancement, not a bug labels Jul 30, 2019
@jjellio
Copy link
Contributor

jjellio commented Jul 30, 2019

Depending on the machine, you will get different behavior for how GPUs are chosen. E.g., Sierra type platforms are different from our Power8/9 testbeds (Waterman/white/ride)

I can elaborate. But I'll need to do that later.

@mhoemmen mhoemmen changed the base branch from master to develop July 30, 2019 21:27
@mhoemmen
Copy link
Contributor

@KyleFromKitware All Trilinos PRs must be made against develop, not against master. I changed the base to develop.

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit goofy that you have to add the same command-line argument to all tests. If the build enables GPUs, the default behavior should be to use the GPUs for tests. Why does Kokkos need another command-line argument? Kokkos knows if CUDA was enabled (KOKKOS_ENABLE_CUDA).

@@ -98,6 +98,7 @@
#include <Kokkos_CopyViews.hpp>
#include <functional>
#include <iosfwd>
#include <string>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trilinos periodically gets a snapshot of Kokkos:master. Changes to Kokkos need to go into Kokkos' repository. It's OK to make changes to Trilinos' snapshot, but they also need to make it into Kokkos around the same time, so that the next Kokkos snapshot into Trilinos won't clobber your changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. This is just a proof-of-concept, and will certainly be more fleshed out later on. I definitely don't intend for it to be merged as-is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to get changes into TriBITS upstream as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to submit the PR to Kokkos upstream until we've fleshed out exactly what we want this to look like. Once we've done that, I will then proceed to submit PRs to all the upstream projects that I modify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, just wanted to make sure :-)

@KyleFromKitware
Copy link
Author

It seems a bit goofy that you have to add the same command-line argument to all tests. If the build enables GPUs, the default behavior should be to use the GPUs for tests. Why does Kokkos need another command-line argument? Kokkos knows if CUDA was enabled (KOKKOS_ENABLE_CUDA).

I am following the same convention as --kokkos-device and --kokkos-ndevices. Doing it this way also keeps us from having to hard-code the gpus resource name into Kokkos (if that matters at all.)

@bartlettroscoe
Copy link
Member

FYI: I created kokkos/kokkos#2227 to try to see if a Kokkos developer can volunteer to meet with us to discuss the proposed Kokkos changes present in this PR.

@bartlettroscoe
Copy link
Member

@mhoemmen said and asked:

It seems a bit goofy that you have to add the same command-line argument to all tests. If the build enables GPUs, the default behavior should be to use the GPUs for tests. Why does Kokkos need another command-line argument?

Not a problem to pass command-line arguments into every test that uses Kokkos. We can put in a injection point in TriBITS to add whatever commandline arguments or set env vars for any test for an SE packages that depends on Kokkos. This is just one proposed approach. It is up to the SNL Kitware developers and the Kitware ctest developers to decide how to do this dance. TriBITS will be the glue to make it work. CTest is running the show in terms of deciding how many tests to run, knowing how many MPI ranks each test has, etc. We just need to get Kokkos to say "yes sir" when ctest tell it what GPUs to run on. Kokkos alone does not have the context to know all of this by itself (and I don't think you want to try to make Kokkos that smart).

But @KyleFromKitware, can't we just set the env var CUDA_VISIBLE_DEVICES as described here to get Kokkos to run on the right GPU? We can get TriBITS to generate whatever mpi commandl-ine we want at runtime (using a cmake -P script that ctest runs that reads env vars and then runs mpiexec, or srun, or whatever). Why do we need any changes to Kokkos at all?

@KyleFromKitware
Copy link
Author

But @KyleFromKitware, can't we just set the env var CUDA_VISIBLE_DEVICES as described here to get Kokkos to run on the right GPU? We can get TriBITS to generate whatever mpi commandl-ine we want at runtime (using a cmake -P script that ctest runs that reads env vars and then runs mpiexec, or srun, or whatever). Why do we need any changes to Kokkos at all?

That is one way to do it, but I didn't particularly like the idea of wrapping the whole command in another script, adding another layer of indirection. I'm also not sure what the implications are for having four different MPI ranks all in the same mpirun instance that are all using CUDA device 0 but this "device 0" has a different meaning in each rank. (FWIW, in my first pass at this, I did use CUDA_VISIBLE_DEVICES and saw a bunch of errors that said something like "failed to remote allocate CUDA memory" or something, which is what makes me wonder about this issue.) On the other hand, your idea does have the advantage of being agnostic to Kokkos.

@KyleFromKitware
Copy link
Author

Also, Nvidia doesn't recommend using CUDA_VISIBLE_DEVICES for production, only for quick debugging and testing. From the Nvidia blog:

As Chris points out, robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide. But the CUDA_VISIBLE_DEVICES environment variable is handy for restricting execution to a specific device or set of devices for debugging and testing.

@jjellio
Copy link
Contributor

jjellio commented Jul 31, 2019

On Sierra platforms (Sierra the machine), the LSF/IBM stack (bsub/jsrun) set CUDA_VISIBLE_DEVICES for you when you 'jsrun' a binary.

The above is why I say you will get different behavior on different machines.

The catch with CUDA_VISIBLE_DEVICES, is that your cuda card needs to be used by a process on the same NUMA node (effectively the same socket). You can run across NUMA, but I believe your process mask needs to set wide-open, so you can allocate memory on the 'remote' memory (that is the memory closest to the other socket) - this processing binding + cuda device thing has caused a number of issues on Oak Ridge Summit.

A more robust approach may be to rely on the job submission system to handle binding + card allocation for you.

On Vortex (SNL's sierra testbed), you get some ENVs set for you, that is the OMP variables are set based on the allocation I requested. They also set OMPI_* variables because IBM's mpi is a derivative of OpenMPI.

JSM_NAMESPACE_RANK=0
OMPI_COMM_WORLD_RANK=0
JSM_NAMESPACE_SIZE=1
OMPI_COMM_WORLD_SIZE=1
JSM_NAMESPACE_LOCAL_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
JSM_NAMESPACE_LOCAL_SIZE=1
OMPI_COMM_WORLD_LOCAL_SIZE=1
JSM_SMPI_SHARP_ID=866648071
OMP_PROC_BIND=true
OMP_PLACES={0:4},{4:4},{8:4},{12:4}
OMP_NUM_THREADS=16
PMIX_NAMESPACE=7
PMIX_RANK=0
PMIX_SERVER_URI2=pmix-server.155718;tcp4://127.0.0.1:38231
PMIX_SERVER_URI21=pmix-server.155718;tcp4://127.0.0.1:38231
PMIX_SECURITY_MODE=native,none
PMIX_PTL_MODULE=tcp
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_DSTORE_21_BASE_PATH=/var/tmp/jsm.lassen17.30690/406440/pmix_dstor_ds21_155718
PMIX_DSTORE_ESH_BASE_PATH=/var/tmp/jsm.lassen17.30690/406440/pmix_dstor_ds12_155718
CUDA_VISIBLE_DEVICES=0
JSM_GPU_ASSIGNMENTS=0
165173-CPU_ASSIGNMENTS=0,1,2,3
JSM_SMT_ASSIGNMENTS=0-15

Next, I looked at how jsrun would behave if I put multiple 'jsrun' commands inside the same allocation concurrently. E.g.,

# 4 jobs launched at once
jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh &
jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh &
jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh &
jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh &

# since these are effectively independent things, I expect the MPI rank to be 0
# if Sierra's toolchain is smart, then it will give each job a different device... it does!

[jjellio@lassen17:lassen]$ Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 0
  CPU list:  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
  Sockets:  0   NUMA list:  0 
jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh &
[2] 170019
[jjellio@lassen17:lassen]$ Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 1
  CPU list:  16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
  Sockets:  0   NUMA list:  0 
jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh &
[3] 170076
[jjellio@lassen17:lassen]$ Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 2
  CPU list:  88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 
  Sockets:  8   NUMA list:  8 
jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh &
[4] 170125
[jjellio@lassen17:lassen]$ Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 3
  CPU list:  104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 
  Sockets:  8   NUMA list:  8 

Next, I tried 10 jsrun in a loop... Imagine CMake effectively telling Ctest to 'jsrun' every test. Atleast on Sierra machines, you don't need to worry with -j, because jsrun is not going to spawn more jobs than the resource set allows. E.g., above I use jsrun -r1 -a1 -c4 -g1 -brs

jsrun -r1 -a1 -c4 -g1 -brs
-r1 = 1 resource set
-a1 = 1 task per resource set. (so 1 resource set ... I get 1 tasks total)
-c4 = 4 cores per task
-g1 = 1 GPU per task
-brs = bind to resource set (so you get a process mask that isolates resource sets)

@jjellio
Copy link
Contributor

jjellio commented Jul 31, 2019

TLDR, a robust forward solution would be for CTest to support the idea of batch systems. If that interface is rich enough, then we could express different batch systems with it. E.g., SLURM, LSF, PBS. We then rely on the batch system to handle 'task' concurrency... they typically do this fairly well, since some of their customers use batch systems to spawn multiple binaries in parallel (think NOT MPI use cases).

A challenge is how to use testbeds at SNL that provide a resource manager for getting nodes, but may not have a batch system (e..g, White/Ride - you use bsub to get a node, but your execute MPIRUN directly). I think what you are doing is trying to solve the problem with direct MPIRUN usage, which is harder. But in doing that, perhaps seeing how IBM/SLURM do it may be a useful approach (that is, LSF is doing stuff with CUDA_VISIBLE_DEVICES and handing out cuda_device_ids based on the CPU set)

@jjellio
Copy link
Contributor

jjellio commented Jul 31, 2019

Oops, in my prior post, I didn't show how their jsrun handled a bunch of concurrent jobs.

I effectively put jsrun inside a loop of 20 tasks. Each tasks randomly sleeps between 5 and 30 seconds, then prints a message. This lets jobs start/finish at various times, but all jsrun commands were effectively issued at once. This would a change for CTest, as it would say that CTest needs to 'run' a bunch of jobs, then wait for their completion, rather than have a synchronous model of launch -> finish.

Here is the test run:

# starting 20 tasks at once
[jjellio@lassen17:lassen]$ for i in $(seq 1 20); do jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh & done
[1] 6511
[2] 6512
[3] 6513
[4] 6514
[5] 6515
[6] 6516
[7] 6517
[8] 6518
[9] 6519
[10] 6520
[11] 6521
[12] 6522
[13] 6523
[14] 6524
[15] 6525
[16] 6526
[17] 6527
[18] 6528
[19] 6529
[20] 6530
# output starts after a little. Recongnize the start time of the first 4 jobs are all the same
[jjellio@lassen17:lassen]$ Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 3
  CPU list:  104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 
  Sockets:  8   NUMA list:  8 
  Job will sleep 15 seconds to waste time
Job jsrun started at Wed Jul 31 08:12:58 PDT 2019
            ended at Wed Jul 31 08:13:13 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 0
  CPU list:  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
  Sockets:  0   NUMA list:  0 
  Job will sleep 17 seconds to waste time
Job jsrun started at Wed Jul 31 08:12:58 PDT 2019
            ended at Wed Jul 31 08:13:15 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 1
  CPU list:  16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
  Sockets:  0   NUMA list:  0 
  Job will sleep 26 seconds to waste time
Job jsrun started at Wed Jul 31 08:12:58 PDT 2019
            ended at Wed Jul 31 08:13:24 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 2
  CPU list:  88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 
  Sockets:  8   NUMA list:  8 
  Job will sleep 27 seconds to waste time
Job jsrun started at Wed Jul 31 08:12:58 PDT 2019
            ended at Wed Jul 31 08:13:25 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 3
  CPU list:  104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 
  Sockets:  8   NUMA list:  8 
  Job will sleep 12 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:14 PDT 2019
            ended at Wed Jul 31 08:13:26 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 0
  CPU list:  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
  Sockets:  0   NUMA list:  0 
  Job will sleep 25 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:15 PDT 2019
            ended at Wed Jul 31 08:13:40 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 3
  CPU list:  104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 
  Sockets:  8   NUMA list:  8 
  Job will sleep 15 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:26 PDT 2019
            ended at Wed Jul 31 08:13:41 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 1
  CPU list:  16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
  Sockets:  0   NUMA list:  0 
  Job will sleep 30 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:24 PDT 2019
            ended at Wed Jul 31 08:13:54 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 2
  CPU list:  88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 
  Sockets:  8   NUMA list:  8 
  Job will sleep 29 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:26 PDT 2019
            ended at Wed Jul 31 08:13:55 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 3
  CPU list:  104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 
  Sockets:  8   NUMA list:  8 
  Job will sleep 22 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:41 PDT 2019
            ended at Wed Jul 31 08:14:03 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 2
  CPU list:  88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 
  Sockets:  8   NUMA list:  8 
  Job will sleep 10 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:55 PDT 2019
            ended at Wed Jul 31 08:14:05 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 0
  CPU list:  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
  Sockets:  0   NUMA list:  0 
  Job will sleep 25 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:41 PDT 2019
            ended at Wed Jul 31 08:14:06 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 1
  CPU list:  16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
  Sockets:  0   NUMA list:  0 
  Job will sleep 21 seconds to waste time
Job jsrun started at Wed Jul 31 08:13:55 PDT 2019
            ended at Wed Jul 31 08:14:16 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 3
  CPU list:  104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 
  Sockets:  8   NUMA list:  8 
  Job will sleep 14 seconds to waste time
Job jsrun started at Wed Jul 31 08:14:03 PDT 2019
            ended at Wed Jul 31 08:14:17 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 2
  CPU list:  88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 
  Sockets:  8   NUMA list:  8 
  Job will sleep 22 seconds to waste time
Job jsrun started at Wed Jul 31 08:14:05 PDT 2019
            ended at Wed Jul 31 08:14:27 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 1
  CPU list:  16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
  Sockets:  0   NUMA list:  0 
  Job will sleep 13 seconds to waste time
Job jsrun started at Wed Jul 31 08:14:16 PDT 2019
            ended at Wed Jul 31 08:14:29 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 0
  CPU list:  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
  Sockets:  0   NUMA list:  0 
  Job will sleep 30 seconds to waste time
Job jsrun started at Wed Jul 31 08:14:06 PDT 2019
            ended at Wed Jul 31 08:14:36 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 3
  CPU list:  104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 
  Sockets:  8   NUMA list:  8 
  Job will sleep 23 seconds to waste time
Job jsrun started at Wed Jul 31 08:14:18 PDT 2019
            ended at Wed Jul 31 08:14:41 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 1
  CPU list:  16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
  Sockets:  0   NUMA list:  0 
  Job will sleep 12 seconds to waste time
Job jsrun started at Wed Jul 31 08:14:29 PDT 2019
            ended at Wed Jul 31 08:14:41 PDT 2019

Rank: 00 Local Rank: 0 lassen17 Cuda Devices: 2
  CPU list:  88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 
  Sockets:  8   NUMA list:  8 
  Job will sleep 24 seconds to waste time
Job jsrun started at Wed Jul 31 08:14:27 PDT 2019
            ended at Wed Jul 31 08:14:51 PDT 2019

# bash spam at the end
[1]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[2]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[3]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[4]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[5]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[6]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[7]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[8]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[9]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[10]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[11]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[12]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[13]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[14]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[15]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[16]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[17]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[18]   Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[19]-  Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh
[20]+  Done                    jsrun -r1 -a1 -c4 -g1 -brs ./env_dump.sh

# env dump
#!/bin/bash

cpu_list=$(numactl -s | grep '^physcpubind:' | cut -d':' -f2)
sockets_list=$(numactl -s | grep '^cpubind:' | cut -d':' -f2)
numa_list=$(numactl -s | grep '^membind:' | cut -d':' -f2)

start_time=$(date)
glb_rank=$(printf '%02d' ${OMPI_COMM_WORLD_RANK})
lcl_rank=$(printf '%d' ${OMPI_COMM_WORLD_LOCAL_RANK})

sleep_time=$(shuf -i 10-30 -n 1)
sleep ${sleep_time}

end_time=$(date)


msg="Rank: $glb_rank Local Rank: $lcl_rank $(hostname) Cuda Devices: $CUDA_VISIBLE_DEVICES\n"
msg+="  CPU list: $cpu_list\n"
msg+="  Sockets: $sockets_list"
msg+="  NUMA list: $numa_list\n"
msg+="  Job will sleep ${sleep_time} seconds to waste time\n"
msg+="Job jsrun started at $start_time\n"
msg+="            ended at $end_time\n"
>&2 echo -e "$msg"

@bartlettroscoe
Copy link
Member

@jjellio, ctest can already utilize batch systems. CTest has a way for a test to return its real runtime (and not wall-clock time). See TIMEOUT_AFTER_MATCH some discussion on this in:

(If you don't have access to that, we can provide it.)

This is how we essentially use srun on 'mutrino'. Actually, we need to try:

again to see if that will now work after recent 'mutrino' update.

@jjellio
Copy link
Contributor

jjellio commented Jul 31, 2019

'srun' and 'jsrun' are not batch systems though.

They are used inside a batch system. Currently, we can define the 'run command' and some arbitrary arguments that go before 'np', but the piece that is missing and causing problems is that the 'run command' is used outside of an informed 'batch' environment.

I think what could allow this to work, would be to annotate the test markup so that those 'arbitrary' arguments turn into something more generic that allow you to express what an 'sbatch + srun' test should look like. E.g.,

add_test(NAME <name>
         [CONFIGURATIONS [Debug|Release|...]]
         [WORKING_DIRECTORY dir]
         # command is just the binary + args (what goes infront of this is generated by a suitable module)
         COMMAND <command> [arg1 [arg2 ...]]
         # allow arbitrary properties to be attached to the test (e.g., could call set_properties)
         TEST_PROPERTIES num_nodes=2;cores_per_proc=4;devices=gpu;gpu
         MAX_EXECUTION_TIME hh:mm:ss
          # The following actually makes sense a global configuration time setting
         BATCH_SYSTEM LSF
)

Suppose Cmake then looks for a module that can provide a suitable execution
cmake/batch_systems/lsf.cmake

lsf.cmake then supports various properties and is able to utilize the property list passed in (ignoring properties it doesn't understand, and using defaults for those not provided)

lsf.cmake can then test for and set a run command (most likely jsrun, but on our testbeds it would be MPI).

Unfortunately, running with something like 'mpirun' doesn't solve the problem of oversubscribing a node. A 'batch-aware' run tool does, and if those are used the process binding and utilization things become a non-problem.

With the above PROPERTIES and MAX_EXECUTION_TIME, the module can pack jobs provided it can understand num_nodes, create a batch script, submit the jobs, and generate a followup job that runs after all submitted jobs have returned to finish the testing output. (e.g., gather data and report).

In the above sbatch + srun is used, but clearly the markup would allow expression of other batch schemes. Testing then happens in a properly configured environment and so the run command can actually be written to ensure the bindings you want.

@bartlettroscoe
Copy link
Member

@jjellio, you should look at:

It would be great to get your input on this and see if we can address this without requiring changes to ctest.

(Again, if you don't have access then Kitware can provide it to you.)

@jjellio
Copy link
Contributor

jjellio commented Jul 31, 2019

No I do not have access.

Copy link
Member

@bartlettroscoe bartlettroscoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleFromKitware,

For the purposes of experimenting with this on this branch and on 'ride' and 'waterman', can we define another env var like KOKKOS_USE_GPU_CTEST_INFO and if set to 1, then read in these vars? That would avoid needing to pass --kokkos-device-ctest=gpus into every function and it would allow us to test this strategy on the entire Trilinos test suite (not just a few MueLu tests).

auto local_rank_str = std::getenv("OMPI_COMM_WORLD_LOCAL_RANK"); //OpenMPI
if (!local_rank_str) local_rank_str = std::getenv("MV2_COMM_WORLD_LOCAL_RANK"); //MVAPICH2
if (!local_rank_str) local_rank_str = std::getenv("SLURM_LOCALID"); //SLURM
if (local_rank_str) {

auto ctest_process_count_str = std::getenv("CTEST_PROCESS_COUNT"); //CTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of experimenting with this, can we define another env var like KOKKOS_USE_GPU_CTEST_INFO and if set to 1, then read in these vars? That would avoid needing to pass --kokkos-device-ctest=gpus into every function and it would allow us to test this strategy on the entire Trilinos test suite (not just a few MueLu tests).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TriBITS will also need a way to set the `PROCESSES`` property of each test to list the GPU requirements. (The value will also be different for each test, since they have different numbers of processes.) What is your recommended strategy for doing this?

Copy link
Member

@bartlettroscoe bartlettroscoe Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleFromKitware,

Seems like thy all look like:

TRIBITS_ADD_TEST( ...
    NUM_MPI_PROCS <n>
    PROCESSES "<n>,gpus:512"
    ...
    )

where <n> is just the number of MPI processes/ranks, right?

What is the significance of the 512 in gpu:512 in all of these? That was the estimate of memory (which we have no idea what it actually is)?

How do we want CTest to actually limit the number of tests that can run using the GPUs? Trying to estimate memory requirements is a complete guessing game.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it by memory was just a quick example to show how it might work. I think doing it by threads would ultimately be better (no more than X threads on one GPU at a time... we will need to come up with a way to define how many threads one GPU can handle.)

As far as applying this to tests on a global level, I'm thinking we could set a variable like the following:

SET(TRIBITS_GPU_THREADS_PER_PROCESS 1)

and then TRIBITS_ADD_TEST() will set this property appropriately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleFromKitware, I don't think you can control the number of threads run on the GPU, can you? I think it just uses as many as it wants, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory has the same problem.

Either way, we need a way to define how much capacity a GPU has, and how much of that capacity is consumed by a single process.

If that number is the same (or roughly the same) for every test-process, then we can set a global TriBITS variable. If not, then every test has to have this property individually set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if the vast majority uses the same capacity, but there are a few outlier tests, then exceptions could be made for those individual tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleFromKitware, I think we need to know more about how a GPU managed multiple processes that run kernels on them. Do the requests from different processes to run kernels on a GPU run at the same time? Do they run in serial? What about the RAM used by each process that runs things on the GPU? We need to understand all of that before we can come with with a good strategy for having ctest limit what runs on the GPUs.

I think we need another meeting with some Kokkos developers to understand these constraints with how things run on a GPU.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

In the meantime, do we still want to proceed with modifying Kokkos for CTest awareness, or would you prefer the wrapper script strategy that we discussed yesterday?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleFromKitware, I think the easiest thing to try is:

  • Add an env var to Kokkos to like KOKKOS_USE_GPU_CTEST_INFO that if set to 1, has the same impact as the current command-line argument --kokkos-device-ctest=gpus.

  • Update TriBITS on this branch to set the env var ``KOKKOS_USE_GPU_CTEST_INFO=1for each test (using ctest propertyENVIRONMENT`).

  • Update TriBITS on this branch to set the ctest property PROCESSES "<n>,gpus:512" where <n> is taken from the NUM_MPI_PROCS <n> argument for every test. (Don't need a PROCESSES property exposed in TRIBITS_ADD_TEST().)

I think if we do that, then I think this will work and then we can run the entire Trilinos test suite with no modifications to any Trilinos CMakeLists.txt files.

NOTE: This is not quite what we want for production because ctest will limit running tests that may not even require a GPU but that may not be a big issue either.

@KyleFromKitware
Copy link
Author

KyleFromKitware commented Sep 13, 2019

Where would be the best place to add tests for get_ctest_gpu()? I added a unit test to packages/kokkos/core/unit_test.

@KyleFromKitware KyleFromKitware force-pushed the processes branch 4 times, most recently from 58b639a to e9a2892 Compare September 18, 2019 18:21
@KyleFromKitware
Copy link
Author

Closing in favor of #6840.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AT: WIP Causes the PR autotester to not test the PR. (Remove to allow testing to occur.) ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants