Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ulfm: add MPIX_Comm_get_failed #5923

Merged
merged 3 commits into from
Apr 4, 2022
Merged

ulfm: add MPIX_Comm_get_failed #5923

merged 3 commits into from
Apr 4, 2022

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Apr 2, 2022

Pull Request Description

Add MPIX_Comm_get_failed. This function is in the current ULFM working proposal.
Ref: mpi-forum/mpi-issues#20 (comment)

This is a discovery function, the base of most ULFM applications. Let's support it.

Fixes: #5788

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2204_get_failed branch 2 times, most recently from 8231dfd to ac5c20a Compare April 2, 2022 14:49
@hzhou
Copy link
Contributor Author

hzhou commented Apr 2, 2022

A key here is when a sending process is killed, a recv from that process should return MPIX_ERR_PROC_FAILED. That is not working with ch4 as I tested. Wonder how did it work originally.

EDIT: ch3 works! So whatever mechanism we designed will work. Just need add it to ch4...

EDIT: I guess ch3's connection events goes to the same progress loop, thus it is able to abort the receive request on disconnect. It is specific to connection based protocol -- true for TCP, not true for datagram in general

EDIT: I guess the fundamental difference is in ch4 we do not always have a posted message queue. We maintain one with debugger support -- how expensive if we always do that?

@hzhou
Copy link
Contributor Author

hzhou commented Apr 2, 2022

test:mpich/ch3/most ✔️

@hzhou hzhou force-pushed the 2204_get_failed branch from ac5c20a to b53e4e0 Compare April 3, 2022 02:49
@hzhou
Copy link
Contributor Author

hzhou commented Apr 3, 2022

test:mpich/ch4/most
test:mpich/ch3/most

@hzhou hzhou requested a review from raffenet April 4, 2022 19:20
hzhou added 3 commits April 4, 2022 15:16
This routine --
   int MPI_Comm_get_failed(MPI_Comm comm, MPI_Group *failed_group)
, which returns a group of failed process, is in the current working
ULFM proposal.
Gather ULFM related functions together. These functions are related and
are unstable. Gather them together to reduce conflicts when we change
them.
This is adapted from ft/failure_ack.c. The current ft tests are disabled
by wholesale. Add to impls to get it tested. MPIX functions need be in
impls folder anyway.

Unlike ft/failure_ack.c, this test only test the functionality of
MPIX_Comm_get_failed. It does not test other fault-tolerance behaviors
such as MPI_Recv and MPI_Finalize when some of the processes abruptly
exit/fail.
@hzhou hzhou force-pushed the 2204_get_failed branch from b53e4e0 to 635a7a9 Compare April 4, 2022 20:16
@hzhou hzhou merged commit 0b4cfd6 into pmodels:main Apr 4, 2022
@hzhou hzhou deleted the 2204_get_failed branch April 4, 2022 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

supporting Fortran coarrays
2 participants