Libpnbc - persistent collectives for Open MPI #4515

dholmes-epcc-ed-ac-uk · 2017-11-18T00:51:40Z

Following the publication of our paper at EuroMPI/USA 2017, we would like to offer our reference implementation of MPI persistent collective operations to the Open MPI community.

We intend to pursue further work on this implementation to optimise these new operations.

Known issues: we have introduced a new (temporary) function MPIX_START that performs the start functionality for persistent collective requests but this should be integrated into the existing MPI_START function, which is defined for point-to-point persistent requests. We believe that the community should discuss how to achieve this integration, in particular, should that function be relocated to a more general position in the directory structure?

This reference implementation is not coded as an extension - it is intended to be merged only once the MPI Forum adds these operations to the MPI Standard.

Please contact the member of the Persistence working group (using [email protected]) with questions/suggestions/additional work needed to make this ready for acceptance into Open MPI.

…r first round

…t Open MPI master branch Signed-off-by: Dan Holmes <[email protected]>

ompiteam-bot · 2017-11-18T00:51:42Z

Can one of the admins verify this patch?

hppritcha · 2017-11-18T14:15:56Z

test this please

hppritcha · 2017-11-18T14:26:22Z

First thing that needs to be done is all commits must be signed off. See https://github.com/open-mpi/ompi/wiki/Admistrative-rules#contributors-declaration

It also looks like pineighbor_alltoallw_init.c is failing to compile on the Mellanox CI. Doesn't look specific to Mellanox setup.

ompiteam-bot · 2017-11-18T15:03:01Z

All Tests Passed!

bosilca · 2017-11-18T17:05:07Z

Few quick comments before looking thoughtfully at the code:

remove all unnecessary files (as an example .cproject)
all newly added functions should be prepended by MPIX_. The code should be converted to an OMPI extension (take a look in ompi/mpiext or feel free to ping us on ompi-devel for more details).
I don't think it is necessary to add the entire copyright of all institutions on all files (this also apply to the license). Please follow the examples in the other files.

jsquyres · 2017-11-19T14:00:28Z

@dholmes-epcc-ed-ac-uk Also note that there's some commits from [email protected], which don't provide any accountability. I think you (and/or @BradleyMorgan?) might need to clean up this PR a bit before we can evaluate it properly.

Thanks!

jjhursey · 2017-11-20T14:33:26Z

How does this PR relate to Furjitsu's PR #2758?

ggouaillardet · 2017-11-22T07:51:47Z

@jjhursey this is an alternative to #2758 which was discussed at SC'17 after the BOF.

@kawashima-fj FYI

at first glance, i noted the following differences

MPI_* vs MPIX_* Fujitsu uses the MPIX_* names so this PR could be merged now. This PR uses the MPI_* names (and in all fairness, made it clear this PR should only be merged after persistent collectives are part of the standard)
MPI_Start() vs MPIX_Start() This PR requires the use of MPIX_Start() whereas Fujitsu does use the standard MPI_Start(). i understand this is kind of work in progress here, and the goal is to use the unique MPI_Start()
MPI_Info this PR adds a MPI_Info argument to all subroutines but Fujitsu does not. i guess this reflects the latest direction taken by the Persistence Working Group
default module. whereas Fujitsu provides a naive component that stores the parameters and relies on the (non persistent) non blocking collectives implementation to do the communication , the component in this PR looks self sufficient.

My first impression is that for the first two points, Fujitsu's PR is currently more "Open MPI ready".
That being said, cf my third point, it might have to be updated with respect to the MPI_Info parameter.
Finally, cf my fourth point, it could make sense to keep the component of this PR if it brings some value
(e.g. component is more efficient and/or algorithms are a better fit than simply relying on (non persistent) non blocking collectives).

Just to be clear, the future standard will explicitly prohibit mixing non blocking collectives and persistent collectives (just like mixing (blocking) collectives and non blocking collectives is currently prohibited), right ?

kawashima-fj · 2017-11-22T09:15:35Z

Thanks @ggouaillardet, I (author of #2758) didn't notice this PR.

About @ggouaillardet's second point.
My commit 0cbdbe3, which was merged into master in July, enabled non-PML persistent requests. So now persistent collectives can be started using MPI_START without my PR.

About @ggouaillardet's third point.
I am aware of lacking a MPI_Info argument. I can update my PR next week.

About @ggouaillardet's fourth point.
The default component of my PR is not efficient. If @dholmes-epcc-ed-ac-uk's component is efficient, it should be used. Combination of my framework and @dholmes-epcc-ed-ac-uk's component may be possible.

I'll take a look at this PR.

kawashima-fj · 2017-11-24T00:56:46Z

@ggouaillardet Yes, the current persistent collectives draft explicitly prohibits mixing non blocking collectives and persistent collectives.

From the current draft:

Persistent collective calls cannot be
matched with blocking or nonblocking collective calls.

kawashima-fj · 2017-11-24T10:08:11Z

I've updated my PR (#2758) to reflect the latest MPI Standard (addition of info argument).

The libpnbc component in this PR seems to be based on libnbc. The nbpreq component in my PR calls libnbc component. Therefore both are essentially same regarding communication.

As @ggouaillardet analyzed,

My PR is ready to be merged regarding MPI_START(ALL) and MPIX_ prefix.
This PR has a performance merit regarding a request creation but needs some works regarding MPI_START(ALL), signed-off, and unnecessary files etc.

How about taking the following procedure?

merge my PR to master
integrate the libpnbc component in this PR into master (and remove the nbpreq component in my PR)
switch to MPI_ prefix from MPIX_ prefix (when it is standardized)

But I have one concern. The existing libnbc component and this libpnbc component have similar code and we will have to maintain both. Fortunately the request creation part (code before calling NBC_Start) and communication part (NBC_Start and after it) can be split in the current libnbc component. Therefore refactoring the libnbc component to handle both nonbloking and persistent will be better.

dholmes-epcc-ed-ac-uk · 2017-11-24T13:34:33Z

The existing libnbc component and this libpnbc component have similar code and we will have to maintain both.

The minimal maintenance route would appear to be to make nonblocking call persistent:

MPI_I<thing> (params) {
  PMPI_<thing>_init(params);
  request.really_persistent = 0;
  PMPI_Start(request);
}
MPI_Wait(request) {
  if (request.really_persistent) {
    PMPI_Wait(request);
    PMPI_Request_free(request);
  } else {
    PMPI_Wait(request);
  }
}
MPI_Test(request) {
  if (request.really_persistent) {
    PMPI_Test(request,flag);
    if (flag) {
      PMPI_Request_free(request);
    }
  } else {
    PMPI_Test(request, flag);
  }

NB: the above pseudo-code uses the profiling interface for illustration only.

Nonblocking functions (all of them) could become trivial to maintain.

ggouaillardet · 2017-11-27T04:40:12Z

At this stage, i'd rather merge #2758 (since is is very ready to be merged), and see how to move forwardswith the valuable bits from this PR.

it seems coll/libpnbc is based on an outdated version of coll/libnbc, so i'd rather do it again than update it to the latest coll/libnbc. i can see two ways of moving forward

use a single coll/libnbc component that implements both persistent and non blocking collectives. as suggested by @dholmes-epcc-ed-ac-uk, non blocking collectives can be handled as volatile persistent collectives.
create a dedicated coll/libpnbc component only for persistent collectives, but have its most part automatically generated (pre-processed ?) from coll/libnbc in order to ease maintenance.

ggouaillardet · 2017-11-27T04:45:51Z

:bot:mellanox:retest

ggouaillardet · 2017-11-27T06:46:40Z

@jsquyres @bosilca any thoughts on how to move forward with respect to upcoming persistent collectives ?

dholmes-epcc-ed-ac-uk · 2017-11-27T13:28:35Z

@ggouaillardet Whilst #2758 is close in syntax to the persistent collectives proposal currently being considered by the MPI Forum, it contains at least one semantic difference because of its implementation choice. Layering PNBC on top of NBC (as in #2758) forces the collective ordering requirement to be applied to the "starts", whereas the MPI Forum proposal mandates that the "inits" must be collectively ordered but allows the "starts" to be in any order. NB: layering the other way around works semantically.

Thus, option (1) "handle NBC as volatile/one-off/disposable PNBC" works almost trivially, but option (2) "generate PNBC from NBC" seems likely to be problematic/fragile. There is an option (3) "separate components for PNBC and NBC, with NBC delegating to PNBC in the manner described in #4515 (comment)".

tl;dr

Single/duplicate implementation seems like an easy software engineering choice.
Single/separate component is a coding style choice for the Open MPI community.
Layering/dependencies depends on semantics, which is the purview of the MPI Forum.

query

I would like to understand the comment "it seems coll/libpnbc is based on an outdated version of coll/libnbc" more deeply. What differences did you notice that resulted in this comment? I am not aware of any development work that happened to coll/libnbc since we took our snapshot (but that may just be me not paying sufficient attention - my apologies if that is the case).

ggouaillardet · 2017-11-27T15:03:54Z

@dholmes-epcc-ed-ac-uk thanks for the lengthy explanation, it will take me some time to fully understand some subtle points.

Meanwhile, you can refer to https://github.com/open-mpi/ompi/commits/master/ompi/mca/coll/libnbc in order to get the history of the coll/libnbc component. I am pretty sure you took a snapshot before Jun 20.

ggouaillardet · 2017-11-28T01:42:49Z

@dholmes-epcc-ed-ac-uk let me see if i got that right with a few examples

MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Start(&req[0]);
MPI_Waitall(2, req, MPI_STATUSES_IGNORE);

my understanding of the current draft is that the barrier is performed before the broadcast
(e.g. collectives are ordered by MPI_*_init and not MPI_Start()) is that correct ?

if i am correct so far, what about

MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Waitall(&req[1], MPI_STATUS_IGNORE);

should the program above hang (since the barrier was not started) ?

if the answer is yes, what about

MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Ibcast(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Waitall(&req[1], MPI_STATUS_IGNORE);

my understanding is that this program hangs if non blocking collectives are (naively) implemented on top of persistent collectives. but is that the behavior mandated by the current draft ?

kawashima-fj · 2017-11-28T02:07:58Z

@dholmes-epcc-ed-ac-uk Now I understand the ordering issue of #2758.

@ggouaillardet The current draft says only ordering of MPI_*_INIT and does not say ordering of communications started by MPI_START or MPI_STARTALL.

From 5.13 Persistent Collective Operations in the current draft:

Initialization calls for MPI persistent collective operations follow all the existing rules
for nonblocking collective operations, in particular ordering; programs that do not conform
to these restrictions are erroneous.
Once initialized, persistent collective operations can be started in any order.

Your first example is correct and the communication order is not defined.
Your second example is correct and only the broadcast is performed.
Your third example is also correct and only the broadcast is performed.

I think a problematic case is the following code. This code must run correctly but may not run correctly with my #2758.

MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
if (rank == 0) {
    MPI_Start(&req[1]);
    MPI_Start(&req[0]);
} else {
    MPI_Start(&req[0]);
    MPI_Start(&req[1]);
}
MPI_Waitall(2, req, MPI_STATUSES_IGNORE);

kawashima-fj · 2017-11-28T02:43:06Z

@dholmes-epcc-ed-ac-uk I prefer your third option "separate components for PNBC and NBC, with NBC delegating to PNBC". By doing so, when another optimized PNBC component is added in the future, the delegating NBC component can call both libpnbc component and the added PNBC component.

I'll throw away my nbpreq compoent in #2758 once libpnbc component complete.

bosilca · 2017-11-28T02:51:47Z

@ggouaillardet my understanding is that all your examples are correct. First, the different types of collectives don't match, so the relative order of collectives from different classes is irrelevant. As it has been pinpointed above what matters is the order in which the persistent collective are declared (initialized) as this will allow the underlying runtime to "name" them (so that once they are started it knows how to match them).

dholmes-epcc-ed-ac-uk · 2017-11-28T10:42:47Z

@ggouaillardet I agree with the assessment from @kawashima-fj - excepting, of course, the deliberate syntax errors in your examples :)

@bosilca whilst it is apparent to a human user that "different types of collectives don't match", this observation is not used in MPI, hence the collective ordering rule in the MPI Standard. See, for example, MPI-3.1 page 197 lines 40-45:

All processes must call collective operations (blocking and nonblocking) in the same order per communicator. In particular, once a process calls a collective operation, all other processes in the communicator must eventually call the same collective operation, and no other collective operation with the same communicator in between. This is consistent with the ordering rules for blocking collective operations in threaded environments.

Specifically, example 5.30 on page 218:

The starting order of collective operations on a particular communicator defines their matching. The following example shows an erroneous matching of different collective operations on the same communicator.

The 4th example (from @kawashima-fj) is also correct and permissible - both operations should complete. Section 5.12 states:

Progression rules for nonblocking collective operations are similar to progression of nonblocking point-to-point operations, refer to Section 3.7.4.

That section (3.7.4) specifically shows an analogous example to the 4th example (from @kawashima-fj) but using point-to-point send and receive operations. For nonblocking collective operations, the example would be erroneous because the collective operations would be started in the wrong order. However, for persistent collective operations, we allow the operations to be started in any order - as long as they were initiated in the correct order.

dholmes-epcc-ed-ac-uk · 2017-11-28T11:31:47Z

@ggouaillardet Having investigated the providence of our code, it seems that we took our original snapshot of coll/libNBC approximately 1 year 8 months ago - from commit d30911f - although there have been several re-bases and merges since then.

I agree with your preference to "do it again" rather than to "update to the latest coll/libNBC".

bosilca · 2017-11-28T14:43:27Z

@dholmes-epcc-ed-ac-uk it is my understanding that the sentence you cite (page 197 line 39) also states that "different types of collectives don't match", albeit it only refers to blocking and non-blocking.

dholmes-epcc-ed-ac-uk · 2017-11-28T16:18:22Z

@bosilca the preceding sentence, which covers the point you are alluding to, reads:

Unlike point-to-point operations, nonblocking collective operations do not match with blocking collective operations, and collective operations do not have a tag argument.

The "do not match" statement is specifically related to blocking and nonblocking not matching each other, and does not generically refer to "types" of collective operation. The subsequent rationale and advice to users (p199) gives the reason and a workaround for this specific restriction.

The "ordering rules for blocking collective operations in threaded environments" referred to in the chapter 5 text can be found in chapter 12, p486, lines 18-22:

Matching of collective calls on a communicator, window, or file handle is done according to the order in which the calls are issued at each process. If concurrent threads issue such calls on the same communicator, window or file handle, it is up to the user to make sure the calls are correctly ordered, using interthread synchronization.

This also does not mention "types" of collective calls

kawashima-fj · 2018-06-27T00:33:33Z

#4618 was merged. Close.

Sysop and others added 17 commits November 17, 2017 15:15

initial commit for development of libpnbc

4d6650e

added _init functions and pointers

ee0dc04

added remaining _inits and pnbc_start

80c194a

added missing PMPI mappings, other hooks for LibPNBC

e257296

added missing _start mappings, currently a schedule memory error afte…

c5d3edc

…r first round

majority of collective ops working, debugging allreduce

c19c6ef

corrected bug in allreduce, most collective ops working

15e5760

disabled debugging output

efc5922

Dan code review - part 1 - interface

31ed17f

Dan code review - part 2 - framework

bafc734

Dan code review - part 3 - implementation

0ec9b0b

added plumbing for MPI_Info flags in pnbc_<coll>_init functions

8018bca

changed MPIX_Start signature, removed comm object parameter

e72c24c

added draft copyright text to coll_libpnbc.h

4c6ebb1

added copyright text

c56b2c8

Fix use-undefined warning and fix FIXME for start during init

7d6163d

Update persistent collectives to current MPI Forum proposal and lates…

af25bea

…t Open MPI master branch Signed-off-by: Dan Holmes <[email protected]>

jsquyres added the pushed-back label Nov 19, 2017

bwbarrett added the Target: main label Nov 21, 2017

kawashima-fj mentioned this pull request Nov 24, 2017

Persistent collective communication request #2758

Closed

ggouaillardet mentioned this pull request Dec 13, 2017

persistent collectives #4618

Merged

kawashima-fj closed this Jun 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Libpnbc - persistent collectives for Open MPI #4515

Libpnbc - persistent collectives for Open MPI #4515

dholmes-epcc-ed-ac-uk commented Nov 18, 2017

ompiteam-bot commented Nov 18, 2017

hppritcha commented Nov 18, 2017

hppritcha commented Nov 18, 2017

ompiteam-bot commented Nov 18, 2017

bosilca commented Nov 18, 2017

jsquyres commented Nov 19, 2017

jjhursey commented Nov 20, 2017

ggouaillardet commented Nov 22, 2017

kawashima-fj commented Nov 22, 2017

kawashima-fj commented Nov 24, 2017

kawashima-fj commented Nov 24, 2017

dholmes-epcc-ed-ac-uk commented Nov 24, 2017

ggouaillardet commented Nov 27, 2017 •

edited

Loading

ggouaillardet commented Nov 27, 2017

ggouaillardet commented Nov 27, 2017 •

edited

Loading

dholmes-epcc-ed-ac-uk commented Nov 27, 2017

ggouaillardet commented Nov 27, 2017

ggouaillardet commented Nov 28, 2017

kawashima-fj commented Nov 28, 2017

kawashima-fj commented Nov 28, 2017

bosilca commented Nov 28, 2017

dholmes-epcc-ed-ac-uk commented Nov 28, 2017

dholmes-epcc-ed-ac-uk commented Nov 28, 2017

bosilca commented Nov 28, 2017

dholmes-epcc-ed-ac-uk commented Nov 28, 2017

kawashima-fj commented Jun 27, 2018

Libpnbc - persistent collectives for Open MPI #4515

Libpnbc - persistent collectives for Open MPI #4515

Conversation

dholmes-epcc-ed-ac-uk commented Nov 18, 2017

ompiteam-bot commented Nov 18, 2017

hppritcha commented Nov 18, 2017

hppritcha commented Nov 18, 2017

ompiteam-bot commented Nov 18, 2017

bosilca commented Nov 18, 2017

jsquyres commented Nov 19, 2017

jjhursey commented Nov 20, 2017

ggouaillardet commented Nov 22, 2017

kawashima-fj commented Nov 22, 2017

kawashima-fj commented Nov 24, 2017

kawashima-fj commented Nov 24, 2017

dholmes-epcc-ed-ac-uk commented Nov 24, 2017

ggouaillardet commented Nov 27, 2017 • edited Loading

ggouaillardet commented Nov 27, 2017

ggouaillardet commented Nov 27, 2017 • edited Loading

dholmes-epcc-ed-ac-uk commented Nov 27, 2017

tl;dr

query

ggouaillardet commented Nov 27, 2017

ggouaillardet commented Nov 28, 2017

kawashima-fj commented Nov 28, 2017

kawashima-fj commented Nov 28, 2017

bosilca commented Nov 28, 2017

dholmes-epcc-ed-ac-uk commented Nov 28, 2017

dholmes-epcc-ed-ac-uk commented Nov 28, 2017

bosilca commented Nov 28, 2017

dholmes-epcc-ed-ac-uk commented Nov 28, 2017

kawashima-fj commented Jun 27, 2018

ggouaillardet commented Nov 27, 2017 •

edited

Loading

ggouaillardet commented Nov 27, 2017 •

edited

Loading