Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/libnbc: Fix error path on internal error #2245

Merged
merged 1 commit into from
Oct 21, 2016

Conversation

jjhursey
Copy link
Member

@bosilca
Copy link
Member

bosilca commented Oct 18, 2016

Forgive me for my light understanding of the internal working of libnbc. If I correctly remember between 2 NBC scheduling barriers several operations might have been posted. As far as I see in this patch you are not even trying to cancel or clean them up, instead entierly focusing on propagating the error code to the user. As your request will be eventually removed, if any of the pending requests complete it will automatically lead to segfaults, basically prohibiting the usage of any error handled other than the default ABORT one. Right ?

@jjhursey
Copy link
Member Author

Yes, with the current version of the PR I'm not trying to cleanup the other possible outstanding internal requests. This version just makes sure that we propagate an error on the user request, so that the default ABORT error handler can be triggered at the MPI API level. That seems to be working now.

What we should do is cancel/complete the outstanding internal requests before marking the external request as complete.

The part I'm pondering at the moment is that we would need to call cancel on all of the other outstanding internal requests, then wait on them to complete. Since we are in a progress function, we can't wait at this location so we would need to add a 'draining'/'canceling' status to the handle. So that on successive calls we keep checking for completion until all of the requests are complete. Since not all PMLs support cancel (I believe that they return OMPI_SUCCESS regardless), is it possible that we could hang here?

@bosilca
Copy link
Member

bosilca commented Oct 19, 2016

For the PML that do not support cancel, the request is not internally marked as cancelled (the field _cancelled in the request status), so this should provide a hint on what you should expect from the request. However, if we assume that the send cancel will be removed from the next version of the MPI standard, this makes the problem almost insolvable through the MPI concepts, as all the sends must now have a matching receive (and if they are not correctly matched they will match the next collective).

@jjhursey
Copy link
Member Author

jjhursey commented Oct 20, 2016

@bosilca I added a new commit that attempts to cleanup the outstanding requests. It also improves the error message printed. What do you think of that?

It did improve the error message displayed as part of Issue #2256

MPI Error in MPI_Testall() (req 0 = 15)

Error code 15 is MPI_ERR_TRUNCATE which is coming from the PML.

@bosilca
Copy link
Member

bosilca commented Oct 21, 2016

The error message suggest that all non-blocking communications are unsafe when in fact only the non-blocking on this communicator might be affected. Otherwise it looks good.

 * If an error is detected internal to libnbc (e.g., PML truncation error)
   this patch makes sure that the request is completed and the `MPI_ERROR`
   field is set approprately.
 * Make an attempt to cleanup outstanding requests before returning.
   - This is a "best attempt" since not all PMLs support canceling requests.
@jjhursey jjhursey force-pushed the topic/libnbc-error-path branch from 4595eaf to 8748e54 Compare October 21, 2016 15:42
@jjhursey
Copy link
Member Author

@bosilca Good point, I updated the error message to clarify that it is specific to that communicator.

Travis CI failure and Mellanox CI hang are a result of the github DNS issues. This is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants