-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dist1 coloring PPS improvements #578
Conversation
instead of magic numbers 0,1,2
PPS worklist construction (VB and EB) now happens in a single scan, without using a temporary array to store indices. Faster and saves memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brian-kelley ! Please see comments below.
Kokkos::parallel_for ("KokkosGraph::GraphColoring::CreateNewWorkArray", | ||
my_exec_space(0, current_vertexListLength_), | ||
create_new_work_array<nnz_lno_temp_work_view_t>(current_vertexList_, next_iteration_recolorList_, pps_work_view)); | ||
ppsWorklistFunctorVB<nnz_lno_temp_work_view_t>(this->nv, current_vertexList_, next_iteration_recolorList_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the values in recolocList need to be reinitialized ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srajama1 It doesn't need to be initialized because the exact size of the next worklist is produced by a parallel_reduce of functorFindConflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant reinitialized, from second iteration to third iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brian-kelley gave me a lesson on why we don't need the parallel prefix array.
Thanks @brian-kelley
Distance-1 coloring with parallel prefix sum worklist used to require 2 passes: a scan to get new worklist index view, and a for to populate the worklist. The original paper about it even says that the 2-pass approach is a downside of PPS compared to atomics.
This makes PPS use a single scan pass that scatters directly to the new worklist without an index view. This saves some time, one kernel launch per iteration and a significant amount of memory:
(num_edges / 2) * sizeof(nnz_lno_t)
for EB, andnum_rows * sizeof(nnz_lno_t)
for VB.Also, renamed
_conflictlist
member to_conflictlist_scheme
for clarity - it's not a list, it's an enum representing how the worklist is built. I also use those actual enum constants (COLORING_NOCONFLICT
,COLORING_ATOMIC
,COLORING_PPS
) instead of the raw integer values (0,1,2) throughout the code.Bowman checks:
#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=724 run_time=1027
intel-16.4.258-Pthread_Serial-release build_time=1034 run_time=1998
intel-16.4.258-Serial-release build_time=706 run_time=963
intel-17.2.174-OpenMP-release build_time=869 run_time=566
intel-17.2.174-OpenMP_Serial-release build_time=1225 run_time=1444
intel-17.2.174-Pthread-release build_time=814 run_time=899
intel-17.2.174-Pthread_Serial-release build_time=1131 run_time=1803
intel-17.2.174-Serial-release build_time=778 run_time=891
RIDE checks:
#######################################################
PASSED TESTS
#######################################################
cuda-9.2.88-Cuda_OpenMP-release build_time=517 run_time=418
cuda-9.2.88-Cuda_Serial-release build_time=507 run_time=527
gcc-6.4.0-OpenMP_Serial-release build_time=194 run_time=396
gcc-7.2.0-OpenMP-release build_time=120 run_time=128
gcc-7.2.0-OpenMP_Serial-release build_time=214 run_time=357
gcc-7.2.0-Serial-release build_time=112 run_time=226