Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libpnbc - persistent collectives for Open MPI #4515

Closed

Conversation

dholmes-epcc-ed-ac-uk
Copy link

Following the publication of our paper at EuroMPI/USA 2017, we would like to offer our reference implementation of MPI persistent collective operations to the Open MPI community.

We intend to pursue further work on this implementation to optimise these new operations.

Known issues: we have introduced a new (temporary) function MPIX_START that performs the start functionality for persistent collective requests but this should be integrated into the existing MPI_START function, which is defined for point-to-point persistent requests. We believe that the community should discuss how to achieve this integration, in particular, should that function be relocated to a more general position in the directory structure?

This reference implementation is not coded as an extension - it is intended to be merged only once the MPI Forum adds these operations to the MPI Standard.

Please contact the member of the Persistence working group (using [email protected]) with questions/suggestions/additional work needed to make this ready for acceptance into Open MPI.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@hppritcha
Copy link
Member

test this please

@hppritcha
Copy link
Member

First thing that needs to be done is all commits must be signed off. See https://github.com/open-mpi/ompi/wiki/Admistrative-rules#contributors-declaration

It also looks like pineighbor_alltoallw_init.c is failing to compile on the Mellanox CI. Doesn't look specific to Mellanox setup.

@ompiteam-bot
Copy link

All Tests Passed!

@bosilca
Copy link
Member

bosilca commented Nov 18, 2017

Few quick comments before looking thoughtfully at the code:

  • remove all unnecessary files (as an example .cproject)
  • all newly added functions should be prepended by MPIX_. The code should be converted to an OMPI extension (take a look in ompi/mpiext or feel free to ping us on ompi-devel for more details).
  • I don't think it is necessary to add the entire copyright of all institutions on all files (this also apply to the license). Please follow the examples in the other files.

@jsquyres
Copy link
Member

@dholmes-epcc-ed-ac-uk Also note that there's some commits from [email protected], which don't provide any accountability. I think you (and/or @BradleyMorgan?) might need to clean up this PR a bit before we can evaluate it properly.

Thanks!

@jjhursey
Copy link
Member

How does this PR relate to Furjitsu's PR #2758?

@ggouaillardet
Copy link
Contributor

@jjhursey this is an alternative to #2758 which was discussed at SC'17 after the BOF.

@kawashima-fj FYI

at first glance, i noted the following differences

  1. MPI_* vs MPIX_* Fujitsu uses the MPIX_* names so this PR could be merged now. This PR uses the MPI_* names (and in all fairness, made it clear this PR should only be merged after persistent collectives are part of the standard)
  2. MPI_Start() vs MPIX_Start() This PR requires the use of MPIX_Start() whereas Fujitsu does use the standard MPI_Start(). i understand this is kind of work in progress here, and the goal is to use the unique MPI_Start()
  3. MPI_Info this PR adds a MPI_Info argument to all subroutines but Fujitsu does not. i guess this reflects the latest direction taken by the Persistence Working Group
  4. default module. whereas Fujitsu provides a naive component that stores the parameters and relies on the (non persistent) non blocking collectives implementation to do the communication , the component in this PR looks self sufficient.

My first impression is that for the first two points, Fujitsu's PR is currently more "Open MPI ready".
That being said, cf my third point, it might have to be updated with respect to the MPI_Info parameter.
Finally, cf my fourth point, it could make sense to keep the component of this PR if it brings some value
(e.g. component is more efficient and/or algorithms are a better fit than simply relying on (non persistent) non blocking collectives).

Just to be clear, the future standard will explicitly prohibit mixing non blocking collectives and persistent collectives (just like mixing (blocking) collectives and non blocking collectives is currently prohibited), right ?

@kawashima-fj
Copy link
Member

Thanks @ggouaillardet, I (author of #2758) didn't notice this PR.

About @ggouaillardet's second point.
My commit 0cbdbe3, which was merged into master in July, enabled non-PML persistent requests. So now persistent collectives can be started using MPI_START without my PR.

About @ggouaillardet's third point.
I am aware of lacking a MPI_Info argument. I can update my PR next week.

About @ggouaillardet's fourth point.
The default component of my PR is not efficient. If @dholmes-epcc-ed-ac-uk's component is efficient, it should be used. Combination of my framework and @dholmes-epcc-ed-ac-uk's component may be possible.

I'll take a look at this PR.

@kawashima-fj
Copy link
Member

@ggouaillardet Yes, the current persistent collectives draft explicitly prohibits mixing non blocking collectives and persistent collectives.

From the current draft:

Persistent collective calls cannot be
matched with blocking or nonblocking collective calls.

@kawashima-fj
Copy link
Member

I've updated my PR (#2758) to reflect the latest MPI Standard (addition of info argument).

The libpnbc component in this PR seems to be based on libnbc. The nbpreq component in my PR calls libnbc component. Therefore both are essentially same regarding communication.

As @ggouaillardet analyzed,

  • My PR is ready to be merged regarding MPI_START(ALL) and MPIX_ prefix.
  • This PR has a performance merit regarding a request creation but needs some works regarding MPI_START(ALL), signed-off, and unnecessary files etc.

How about taking the following procedure?

  1. merge my PR to master
  2. integrate the libpnbc component in this PR into master (and remove the nbpreq component in my PR)
  3. switch to MPI_ prefix from MPIX_ prefix (when it is standardized)

But I have one concern. The existing libnbc component and this libpnbc component have similar code and we will have to maintain both. Fortunately the request creation part (code before calling NBC_Start) and communication part (NBC_Start and after it) can be split in the current libnbc component. Therefore refactoring the libnbc component to handle both nonbloking and persistent will be better.

@dholmes-epcc-ed-ac-uk
Copy link
Author

The existing libnbc component and this libpnbc component have similar code and we will have to maintain both.

The minimal maintenance route would appear to be to make nonblocking call persistent:

MPI_I<thing> (params) {
  PMPI_<thing>_init(params);
  request.really_persistent = 0;
  PMPI_Start(request);
}
MPI_Wait(request) {
  if (request.really_persistent) {
    PMPI_Wait(request);
    PMPI_Request_free(request);
  } else {
    PMPI_Wait(request);
  }
}
MPI_Test(request) {
  if (request.really_persistent) {
    PMPI_Test(request,flag);
    if (flag) {
      PMPI_Request_free(request);
    }
  } else {
    PMPI_Test(request, flag);
  }

NB: the above pseudo-code uses the profiling interface for illustration only.

Nonblocking functions (all of them) could become trivial to maintain.

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Nov 27, 2017

At this stage, i'd rather merge #2758 (since is is very ready to be merged), and see how to move forwardswith the valuable bits from this PR.

it seems coll/libpnbc is based on an outdated version of coll/libnbc, so i'd rather do it again than update it to the latest coll/libnbc. i can see two ways of moving forward

  1. use a single coll/libnbc component that implements both persistent and non blocking collectives. as suggested by @dholmes-epcc-ed-ac-uk, non blocking collectives can be handled as volatile persistent collectives.
  2. create a dedicated coll/libpnbc component only for persistent collectives, but have its most part automatically generated (pre-processed ?) from coll/libnbc in order to ease maintenance.

@ggouaillardet
Copy link
Contributor

:bot:mellanox:retest

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Nov 27, 2017

@jsquyres @bosilca any thoughts on how to move forward with respect to upcoming persistent collectives ?

@dholmes-epcc-ed-ac-uk
Copy link
Author

@ggouaillardet Whilst #2758 is close in syntax to the persistent collectives proposal currently being considered by the MPI Forum, it contains at least one semantic difference because of its implementation choice. Layering PNBC on top of NBC (as in #2758) forces the collective ordering requirement to be applied to the "starts", whereas the MPI Forum proposal mandates that the "inits" must be collectively ordered but allows the "starts" to be in any order. NB: layering the other way around works semantically.

Thus, option (1) "handle NBC as volatile/one-off/disposable PNBC" works almost trivially, but option (2) "generate PNBC from NBC" seems likely to be problematic/fragile. There is an option (3) "separate components for PNBC and NBC, with NBC delegating to PNBC in the manner described in #4515 (comment)".

tl;dr

Single/duplicate implementation seems like an easy software engineering choice.
Single/separate component is a coding style choice for the Open MPI community.
Layering/dependencies depends on semantics, which is the purview of the MPI Forum.

query

I would like to understand the comment "it seems coll/libpnbc is based on an outdated version of coll/libnbc" more deeply. What differences did you notice that resulted in this comment? I am not aware of any development work that happened to coll/libnbc since we took our snapshot (but that may just be me not paying sufficient attention - my apologies if that is the case).

@ggouaillardet
Copy link
Contributor

@dholmes-epcc-ed-ac-uk thanks for the lengthy explanation, it will take me some time to fully understand some subtle points.

Meanwhile, you can refer to https://github.com/open-mpi/ompi/commits/master/ompi/mca/coll/libnbc in order to get the history of the coll/libnbc component. I am pretty sure you took a snapshot before Jun 20.

@ggouaillardet
Copy link
Contributor

@dholmes-epcc-ed-ac-uk let me see if i got that right with a few examples

MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Start(&req[0]);
MPI_Waitall(2, req, MPI_STATUSES_IGNORE);

my understanding of the current draft is that the barrier is performed before the broadcast
(e.g. collectives are ordered by MPI_*_init and not MPI_Start()) is that correct ?

if i am correct so far, what about

MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Waitall(&req[1], MPI_STATUS_IGNORE);

should the program above hang (since the barrier was not started) ?

if the answer is yes, what about

MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Ibcast(..., MPI_COMM_WORLD, &req[1]);
MPI_Start(&req[1]);
MPI_Waitall(&req[1], MPI_STATUS_IGNORE);

my understanding is that this program hangs if non blocking collectives are (naively) implemented on top of persistent collectives. but is that the behavior mandated by the current draft ?

@kawashima-fj
Copy link
Member

@dholmes-epcc-ed-ac-uk Now I understand the ordering issue of #2758.

@ggouaillardet The current draft says only ordering of MPI_*_INIT and does not say ordering of communications started by MPI_START or MPI_STARTALL.

From 5.13 Persistent Collective Operations in the current draft:

Initialization calls for MPI persistent collective operations follow all the existing rules
for nonblocking collective operations, in particular ordering; programs that do not conform
to these restrictions are erroneous.
Once initialized, persistent collective operations can be started in any order.

Your first example is correct and the communication order is not defined.
Your second example is correct and only the broadcast is performed.
Your third example is also correct and only the broadcast is performed.

I think a problematic case is the following code. This code must run correctly but may not run correctly with my #2758.

MPI_Barrier_init(MPI_COMM_WORLD, &req[0]);
MPI_Bcast_init(..., MPI_COMM_WORLD, &req[1]);
if (rank == 0) {
    MPI_Start(&req[1]);
    MPI_Start(&req[0]);
} else {
    MPI_Start(&req[0]);
    MPI_Start(&req[1]);
}
MPI_Waitall(2, req, MPI_STATUSES_IGNORE);

@kawashima-fj
Copy link
Member

@dholmes-epcc-ed-ac-uk I prefer your third option "separate components for PNBC and NBC, with NBC delegating to PNBC". By doing so, when another optimized PNBC component is added in the future, the delegating NBC component can call both libpnbc component and the added PNBC component.

I'll throw away my nbpreq compoent in #2758 once libpnbc component complete.

@bosilca
Copy link
Member

bosilca commented Nov 28, 2017

@ggouaillardet my understanding is that all your examples are correct. First, the different types of collectives don't match, so the relative order of collectives from different classes is irrelevant. As it has been pinpointed above what matters is the order in which the persistent collective are declared (initialized) as this will allow the underlying runtime to "name" them (so that once they are started it knows how to match them).

@dholmes-epcc-ed-ac-uk
Copy link
Author

@ggouaillardet I agree with the assessment from @kawashima-fj - excepting, of course, the deliberate syntax errors in your examples :)

@bosilca whilst it is apparent to a human user that "different types of collectives don't match", this observation is not used in MPI, hence the collective ordering rule in the MPI Standard. See, for example, MPI-3.1 page 197 lines 40-45:

All processes must call collective operations (blocking and nonblocking) in the same order per communicator. In particular, once a process calls a collective operation, all other processes in the communicator must eventually call the same collective operation, and no other collective operation with the same communicator in between. This is consistent with the ordering rules for blocking collective operations in threaded environments.

Specifically, example 5.30 on page 218:

The starting order of collective operations on a particular communicator defines their matching. The following example shows an erroneous matching of different collective operations on the same communicator.

The 4th example (from @kawashima-fj) is also correct and permissible - both operations should complete. Section 5.12 states:

Progression rules for nonblocking collective operations are similar to progression of nonblocking point-to-point operations, refer to Section 3.7.4.

That section (3.7.4) specifically shows an analogous example to the 4th example (from @kawashima-fj) but using point-to-point send and receive operations. For nonblocking collective operations, the example would be erroneous because the collective operations would be started in the wrong order. However, for persistent collective operations, we allow the operations to be started in any order - as long as they were initiated in the correct order.

@dholmes-epcc-ed-ac-uk
Copy link
Author

@ggouaillardet Having investigated the providence of our code, it seems that we took our original snapshot of coll/libNBC approximately 1 year 8 months ago - from commit d30911f - although there have been several re-bases and merges since then.

I agree with your preference to "do it again" rather than to "update to the latest coll/libNBC".

@bosilca
Copy link
Member

bosilca commented Nov 28, 2017

@dholmes-epcc-ed-ac-uk it is my understanding that the sentence you cite (page 197 line 39) also states that "different types of collectives don't match", albeit it only refers to blocking and non-blocking.

@dholmes-epcc-ed-ac-uk
Copy link
Author

@bosilca the preceding sentence, which covers the point you are alluding to, reads:

Unlike point-to-point operations, nonblocking collective operations do not match with blocking collective operations, and collective operations do not have a tag argument.

The "do not match" statement is specifically related to blocking and nonblocking not matching each other, and does not generically refer to "types" of collective operation. The subsequent rationale and advice to users (p199) gives the reason and a workaround for this specific restriction.

The "ordering rules for blocking collective operations in threaded environments" referred to in the chapter 5 text can be found in chapter 12, p486, lines 18-22:

Matching of collective calls on a communicator, window, or file handle is done according to the order in which the calls are issued at each process. If concurrent threads issue such calls on the same communicator, window or file handle, it is up to the user to make sure the calls are correctly ordered, using interthread synchronization.

This also does not mention "types" of collective calls

@kawashima-fj
Copy link
Member

#4618 was merged. Close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants