Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dist1 coloring PPS improvements #578

Merged
merged 4 commits into from
Jan 29, 2020
Merged

Conversation

brian-kelley
Copy link
Contributor

Distance-1 coloring with parallel prefix sum worklist used to require 2 passes: a scan to get new worklist index view, and a for to populate the worklist. The original paper about it even says that the 2-pass approach is a downside of PPS compared to atomics.

This makes PPS use a single scan pass that scatters directly to the new worklist without an index view. This saves some time, one kernel launch per iteration and a significant amount of memory: (num_edges / 2) * sizeof(nnz_lno_t) for EB, and num_rows * sizeof(nnz_lno_t) for VB.

Also, renamed _conflictlist member to _conflictlist_scheme for clarity - it's not a list, it's an enum representing how the worklist is built. I also use those actual enum constants (COLORING_NOCONFLICT, COLORING_ATOMIC, COLORING_PPS) instead of the raw integer values (0,1,2) throughout the code.

Bowman checks:
#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=724 run_time=1027
intel-16.4.258-Pthread_Serial-release build_time=1034 run_time=1998
intel-16.4.258-Serial-release build_time=706 run_time=963
intel-17.2.174-OpenMP-release build_time=869 run_time=566
intel-17.2.174-OpenMP_Serial-release build_time=1225 run_time=1444
intel-17.2.174-Pthread-release build_time=814 run_time=899
intel-17.2.174-Pthread_Serial-release build_time=1131 run_time=1803
intel-17.2.174-Serial-release build_time=778 run_time=891

RIDE checks:
#######################################################
PASSED TESTS
#######################################################
cuda-9.2.88-Cuda_OpenMP-release build_time=517 run_time=418
cuda-9.2.88-Cuda_Serial-release build_time=507 run_time=527
gcc-6.4.0-OpenMP_Serial-release build_time=194 run_time=396
gcc-7.2.0-OpenMP-release build_time=120 run_time=128
gcc-7.2.0-OpenMP_Serial-release build_time=214 run_time=357
gcc-7.2.0-Serial-release build_time=112 run_time=226

instead of magic numbers 0,1,2
PPS worklist construction (VB and EB) now happens in a single scan,
without using a temporary array to store indices. Faster and saves
memory.
Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brian-kelley ! Please see comments below.

Kokkos::parallel_for ("KokkosGraph::GraphColoring::CreateNewWorkArray",
my_exec_space(0, current_vertexListLength_),
create_new_work_array<nnz_lno_temp_work_view_t>(current_vertexList_, next_iteration_recolorList_, pps_work_view));
ppsWorklistFunctorVB<nnz_lno_temp_work_view_t>(this->nv, current_vertexList_, next_iteration_recolorList_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the values in recolocList need to be reinitialized ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srajama1 It doesn't need to be initialized because the exact size of the next worklist is produced by a parallel_reduce of functorFindConflicts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant reinitialized, from second iteration to third iteration

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-kelley gave me a lesson on why we don't need the parallel prefix array.

Thanks @brian-kelley

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants