-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/libnbc: Fix error path on internal error #2245
Conversation
Forgive me for my light understanding of the internal working of libnbc. If I correctly remember between 2 NBC scheduling barriers several operations might have been posted. As far as I see in this patch you are not even trying to cancel or clean them up, instead entierly focusing on propagating the error code to the user. As your request will be eventually removed, if any of the pending requests complete it will automatically lead to segfaults, basically prohibiting the usage of any error handled other than the default ABORT one. Right ? |
Yes, with the current version of the PR I'm not trying to cleanup the other possible outstanding internal requests. This version just makes sure that we propagate an error on the user request, so that the default ABORT error handler can be triggered at the MPI API level. That seems to be working now. What we should do is cancel/complete the outstanding internal requests before marking the external request as complete. The part I'm pondering at the moment is that we would need to call cancel on all of the other outstanding internal requests, then wait on them to complete. Since we are in a progress function, we can't wait at this location so we would need to add a 'draining'/'canceling' status to the handle. So that on successive calls we keep checking for completion until all of the requests are complete. Since not all PMLs support cancel (I believe that they return |
For the PML that do not support cancel, the request is not internally marked as cancelled (the field _cancelled in the request status), so this should provide a hint on what you should expect from the request. However, if we assume that the send cancel will be removed from the next version of the MPI standard, this makes the problem almost insolvable through the MPI concepts, as all the sends must now have a matching receive (and if they are not correctly matched they will match the next collective). |
0c5878d
to
4595eaf
Compare
@bosilca I added a new commit that attempts to cleanup the outstanding requests. It also improves the error message printed. What do you think of that? It did improve the error message displayed as part of Issue #2256 MPI Error in MPI_Testall() (req 0 = 15) Error code |
The error message suggest that all non-blocking communications are unsafe when in fact only the non-blocking on this communicator might be affected. Otherwise it looks good. |
* If an error is detected internal to libnbc (e.g., PML truncation error) this patch makes sure that the request is completed and the `MPI_ERROR` field is set approprately. * Make an attempt to cleanup outstanding requests before returning. - This is a "best attempt" since not all PMLs support canceling requests.
4595eaf
to
8748e54
Compare
@bosilca Good point, I updated the error message to clarify that it is specific to that communicator. Travis CI failure and Mellanox CI hang are a result of the github DNS issues. This is ready to merge. |
this patch makes sure that the request is completed and the
MPI_ERROR
field is set appropriately.