-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Libpnbc - persistent collectives for Open MPI #4515
Libpnbc - persistent collectives for Open MPI #4515
Conversation
…t Open MPI master branch Signed-off-by: Dan Holmes <[email protected]>
Can one of the admins verify this patch? |
test this please |
First thing that needs to be done is all commits must be signed off. See https://github.com/open-mpi/ompi/wiki/Admistrative-rules#contributors-declaration It also looks like pineighbor_alltoallw_init.c is failing to compile on the Mellanox CI. Doesn't look specific to Mellanox setup. |
All Tests Passed! |
Few quick comments before looking thoughtfully at the code:
|
@dholmes-epcc-ed-ac-uk Also note that there's some commits from Thanks! |
How does this PR relate to Furjitsu's PR #2758? |
@jjhursey this is an alternative to #2758 which was discussed at SC'17 after the BOF. @kawashima-fj FYI at first glance, i noted the following differences
My first impression is that for the first two points, Fujitsu's PR is currently more "Open MPI ready". Just to be clear, the future standard will explicitly prohibit mixing non blocking collectives and persistent collectives (just like mixing (blocking) collectives and non blocking collectives is currently prohibited), right ? |
Thanks @ggouaillardet, I (author of #2758) didn't notice this PR. About @ggouaillardet's second point. About @ggouaillardet's third point. About @ggouaillardet's fourth point. I'll take a look at this PR. |
@ggouaillardet Yes, the current persistent collectives draft explicitly prohibits mixing non blocking collectives and persistent collectives. From the current draft:
|
I've updated my PR (#2758) to reflect the latest MPI Standard (addition of The libpnbc component in this PR seems to be based on libnbc. The nbpreq component in my PR calls libnbc component. Therefore both are essentially same regarding communication. As @ggouaillardet analyzed,
How about taking the following procedure?
But I have one concern. The existing libnbc component and this libpnbc component have similar code and we will have to maintain both. Fortunately the request creation part (code before calling |
The minimal maintenance route would appear to be to make nonblocking call persistent:
NB: the above pseudo-code uses the profiling interface for illustration only. Nonblocking functions (all of them) could become trivial to maintain. |
At this stage, i'd rather merge #2758 (since is is very ready to be merged), and see how to move forwardswith the valuable bits from this PR. it seems
|
:bot:mellanox:retest |
@ggouaillardet Whilst #2758 is close in syntax to the persistent collectives proposal currently being considered by the MPI Forum, it contains at least one semantic difference because of its implementation choice. Layering PNBC on top of NBC (as in #2758) forces the collective ordering requirement to be applied to the "starts", whereas the MPI Forum proposal mandates that the "inits" must be collectively ordered but allows the "starts" to be in any order. NB: layering the other way around works semantically. Thus, option (1) "handle NBC as volatile/one-off/disposable PNBC" works almost trivially, but option (2) "generate PNBC from NBC" seems likely to be problematic/fragile. There is an option (3) "separate components for PNBC and NBC, with NBC delegating to PNBC in the manner described in #4515 (comment)". tl;drSingle/duplicate implementation seems like an easy software engineering choice. queryI would like to understand the comment "it seems |
@dholmes-epcc-ed-ac-uk thanks for the lengthy explanation, it will take me some time to fully understand some subtle points. Meanwhile, you can refer to https://github.com/open-mpi/ompi/commits/master/ompi/mca/coll/libnbc in order to get the history of the |
@dholmes-epcc-ed-ac-uk let me see if i got that right with a few examples MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Start(&req[0]);
MPI_Waitall(2, req, MPI_STATUSES_IGNORE); my understanding of the current draft is that the barrier is performed before the broadcast if i am correct so far, what about MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Waitall(&req[1], MPI_STATUS_IGNORE); should the program above hang (since the barrier was not started) ? if the answer is yes, what about MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Ibcast(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Waitall(&req[1], MPI_STATUS_IGNORE); my understanding is that this program hangs if non blocking collectives are (naively) implemented on top of persistent collectives. but is that the behavior mandated by the current draft ? |
@dholmes-epcc-ed-ac-uk Now I understand the ordering issue of #2758. @ggouaillardet The current draft says only ordering of From 5.13 Persistent Collective Operations in the current draft:
Your first example is correct and the communication order is not defined. I think a problematic case is the following code. This code must run correctly but may not run correctly with my #2758. MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
if (rank == 0) {
MPI_Start(&req[1]);
MPI_Start(&req[0]);
} else {
MPI_Start(&req[0]);
MPI_Start(&req[1]);
}
MPI_Waitall(2, req, MPI_STATUSES_IGNORE); |
@dholmes-epcc-ed-ac-uk I prefer your third option "separate components for PNBC and NBC, with NBC delegating to PNBC". By doing so, when another optimized PNBC component is added in the future, the delegating NBC component can call both libpnbc component and the added PNBC component. I'll throw away my nbpreq compoent in #2758 once libpnbc component complete. |
@ggouaillardet my understanding is that all your examples are correct. First, the different types of collectives don't match, so the relative order of collectives from different classes is irrelevant. As it has been pinpointed above what matters is the order in which the persistent collective are declared (initialized) as this will allow the underlying runtime to "name" them (so that once they are started it knows how to match them). |
@ggouaillardet I agree with the assessment from @kawashima-fj - excepting, of course, the deliberate syntax errors in your examples :) @bosilca whilst it is apparent to a human user that "different types of collectives don't match", this observation is not used in MPI, hence the collective ordering rule in the MPI Standard. See, for example, MPI-3.1 page 197 lines 40-45:
Specifically, example 5.30 on page 218:
The 4th example (from @kawashima-fj) is also correct and permissible - both operations should complete. Section 5.12 states:
That section (3.7.4) specifically shows an analogous example to the 4th example (from @kawashima-fj) but using point-to-point send and receive operations. For nonblocking collective operations, the example would be erroneous because the collective operations would be started in the wrong order. However, for persistent collective operations, we allow the operations to be started in any order - as long as they were initiated in the correct order. |
@ggouaillardet Having investigated the providence of our code, it seems that we took our original snapshot of I agree with your preference to "do it again" rather than to "update to the latest |
@dholmes-epcc-ed-ac-uk it is my understanding that the sentence you cite (page 197 line 39) also states that "different types of collectives don't match", albeit it only refers to blocking and non-blocking. |
@bosilca the preceding sentence, which covers the point you are alluding to, reads:
The "do not match" statement is specifically related to blocking and nonblocking not matching each other, and does not generically refer to "types" of collective operation. The subsequent rationale and advice to users (p199) gives the reason and a workaround for this specific restriction. The "ordering rules for blocking collective operations in threaded environments" referred to in the chapter 5 text can be found in chapter 12, p486, lines 18-22:
This also does not mention "types" of collective calls |
#4618 was merged. Close. |
Following the publication of our paper at EuroMPI/USA 2017, we would like to offer our reference implementation of MPI persistent collective operations to the Open MPI community.
We intend to pursue further work on this implementation to optimise these new operations.
Known issues: we have introduced a new (temporary) function MPIX_START that performs the start functionality for persistent collective requests but this should be integrated into the existing MPI_START function, which is defined for point-to-point persistent requests. We believe that the community should discuss how to achieve this integration, in particular, should that function be relocated to a more general position in the directory structure?
This reference implementation is not coded as an extension - it is intended to be merged only once the MPI Forum adds these operations to the MPI Standard.
Please contact the member of the Persistence working group (using [email protected]) with questions/suggestions/additional work needed to make this ready for acceptance into Open MPI.