Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spawn: better error handling when connection time out #5816

Merged
merged 5 commits into from
Jul 15, 2022

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Jan 30, 2022

Pull Request Description

When user set timeout for MPI_Comm_connect and MPI_Comm_accept, we need
cancel the pending operations and return the proper error code so user can
check and handle appropriately.

Depends on #5819
Fixes #5815

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2201_trecv_cancel branch 5 times, most recently from 933ae45 to 9809762 Compare January 30, 2022 23:03
@hzhou
Copy link
Contributor Author

hzhou commented Jan 30, 2022

test:mpich/ch3/tcp
test:mpich/ch4/ofi

@hzhou hzhou force-pushed the 2201_trecv_cancel branch from 9809762 to 262f2a9 Compare February 1, 2022 21:07
@hzhou
Copy link
Contributor Author

hzhou commented Feb 1, 2022

test:mpich/ch3/tcp
test:mpich/ch4/ofi
✔️

@hzhou hzhou force-pushed the 2201_trecv_cancel branch from 262f2a9 to a658bdb Compare February 2, 2022 04:29
@hzhou
Copy link
Contributor Author

hzhou commented Feb 2, 2022

test:mpich/warnings/auto

@hzhou hzhou requested a review from raffenet February 4, 2022 14:47
@hzhou hzhou force-pushed the 2201_trecv_cancel branch from a658bdb to 4054b46 Compare February 9, 2022 02:10
@hzhou
Copy link
Contributor Author

hzhou commented Feb 9, 2022

Rebased. One commit (maint: remove unused MPIR_MAX_ERROR_CLASS_INDEX) was missed in #5819.

hzhou added 5 commits July 14, 2022 21:56
We give user an option to set timeout in MPI_Comm_connect and
MPI_Comm_accept. User will need check the error return when time out
occurs. Create MPIX_ERR_TIMEOUT for this purpose.
When user set timeout for MPI_Comm_connect and MPI_Comm_accept, we need
cancel the pending operations and return proper error code so user can
check and handle appropriately.
When MPI_Comm_connect timeout, by convention we return error class
MPI_ERR_PORT.
The code to check valid error class was imprecise. Add macros
is_valid_err_class and clean it up.
@hzhou hzhou force-pushed the 2201_trecv_cancel branch from 4054b46 to 70978d0 Compare July 15, 2022 03:07
@hzhou
Copy link
Contributor Author

hzhou commented Jul 15, 2022

test:mpich/ch3/most
test:mpich/ch4/most

@hzhou hzhou requested a review from raffenet July 15, 2022 03:07
Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look OK to me. Is the segfault in ch3 configurations in Jenkins a known issue?

@hzhou
Copy link
Contributor Author

hzhou commented Jul 15, 2022

Changes look OK to me. Is the segfault in ch3 configurations in Jenkins a known issue?

I've seen that failure in unrelated PRs before, but it seems more frequent here, let me re-run the test to confirm.

@hzhou
Copy link
Contributor Author

hzhou commented Jul 15, 2022

test:mpich/ch3/most

All clean ✔️ .

@hzhou hzhou merged commit e8d9438 into pmodels:main Jul 15, 2022
@hzhou hzhou deleted the 2201_trecv_cancel branch July 15, 2022 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ch4/ofi: better error handling in MPI_Comm_{accept,connect}
2 participants