-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zoltan2: Add quotient algorithm and communication graph model #8576
Conversation
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Failure: Timed out waiting for job Trilinos_pullrequest_intel_17.0.1 to start: Total Wait = 603
|
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0_serial
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0_debug
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_10.1.105
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: seheracer |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0_serial
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0_debug
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_10.1.105
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_8.3.0 # 3249 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_7.2.0_serial # 876 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_7.2.0_debug # 1370 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 8723 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_cuda_10.1.105 # 199 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_clang_10.0.0 # 1565 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_3 # 4293 (click to expand)
|
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0_serial
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0_debug
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_10.1.105
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: seheracer |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0_serial
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0_debug
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_10.1.105
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MueLu changes look good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Zoltan2 code looks fine. (I did not check MueLu.)
As Seher mentioned, any graph partitioner can be used to partition the quotient graph. There is really no need to require ParMetis. A future extension is to allow PHG.
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ egboman cgcgcg ]! |
Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged |
Merge on Pull Request# 8576: IS A SUCCESS - Pull Request successfully merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interface changes requested:
-
Return partListView of quotient algorithm does not match that of other
algorithms. -
Currently returning:
quotient: one part assignment per MPI rank
on each processor, Teuchos::ArrayView(1) with part(0) =
single value in [0,nparts)everything else: one part assignment per matrix row
Teuchos::ArrayView(nrows) with part(i) = some value in [0,nparts) -
We should return:
quotient: one part assignment per matrix row; all rows assigned to same
part (until we implement chunking version)
on each processor, Teuchos::ArrayView(nrows) with part(i) =
const A in [0,nparts) for every row i (same A for every row)everything else: one part assignment per matrix row
Teuchos::ArrayView(nrows) with part(i) = some value in [0,nparts) -
For quotient method, modifying partListView to return part assignment per
matrix row allows
common user interface for all methods;
matrix redistribution in the partitioning1.cpp test will work for
better testing; and
with quotient, can also check that part(i) = A for all i
Algorithm changes requested:
-
To accomplish change above, might need to
have quotient algorithm take Adapter as input (as in AlgPuLP),
construct its CommGraphModel itself from adapter,
and then use GIDs from adapter to setParts with result. -
Make threshold_ a parameter so that for testing, we can set it 2 for some of
the PR tests (runs on 4, but migrates to 2) -
Remove untested option to not migrate.
User can run on 4 and migrate to 4 if he doesn't want to migrate
Test this case -
Remove code that disables distributeInput for quotient method.
Tests:
-
test both distributeInput=false, true
(--no-distribute, --distribute (default)) -
set threshold to run on 4, migrate to 2
-
set threshold to run on 4, migrate to 4
@trilinos/zoltan2
Motivation
This PR adds two major contributions in the following files:
model/Zoltan2_CommGraphModel.hpp
CommGraphModel (communication graph model) creates a graph representing the communication topology of the MPI ranks for a given XpetraCrsGraphAdapter object. If there are n MPI ranks in the given communicator, then this model contains n vertices so that each vertex represents an MPI rank. If rank i sends a message to rank j (during the mat-vec on the matrix corresponding to the given adapter), then there is a directed edge from vertex i to vertex j in the graph. The weight of the edge is the number of nonzeros that cause that message. The weight of vertex i is the number of nonzeros
currently residing at rank i. Since the communication graph is too small, we migrate it into a subset of ranks, which we call activeRanks. nActiveRanks_ denotes the number of active ranks and is computed as n/threshold_. For now, this migration is mandatory but we can make it optional (by setting a parameter and migrated_ flag). The threshold_ value can also
be parameterized.
algorithm/partition/Zoltan2_AlgQuotient.hpp:
This algorithm partitions the CommGraphModel and transfers the solution (which is only stored on active ranks of CommGraphModel) to all MPI ranks. For now, it uses ParMETIS to partition the graph. Support for other partitioners can be added if needed.
Stakeholder Feedback
Testing
All tests are passing on CUDA builds on Summit (with and without ParMETIS enabled).