-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/verbs: Flush any pending send requests on EP destruction #7291
Conversation
Shrink the size of struct vrb_context by combining disjointly used fields into a union. Replace the use of generic OFI flags with a verbs specific enum. And remove setting unnecessary fields in call paths. The vrb_context structure will be expanded to track send operations in a follow on patch. These changes keep the structure size to castable with struct fi_context. Even though there is currently no requirement to restrict vrb_context to that size. Signed-off-by: Sean Hefty <[email protected]>
Add transmit operations to a software SQ when posting, and remove them when handling the completion. Generate flushed completions for any outstanding transmit operations when shutting down an endpoint. This update ensures that all posted sends will generate a completion to the application. This fixes an issue with DAOS where a send operation never completes in the case that the peer has crashed, resulting in the local application hanging. Signed-off-by: Sean Hefty <[email protected]>
ea2d7df
to
12c3e17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of questions...
if (ctx->op_ctx == VRB_POST_SQ) { | ||
assert(ctx->ep); | ||
assert(!slist_empty(&ctx->ep->sq_list)); | ||
assert(ctx->ep->sq_list.head == &ctx->entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an order guarantee while polling? It might not be the first one right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, completion order is guaranteed by the HW. All operations posted to the send queue will complete in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in ofi manpage -
For transmit requests, completion ordering depends on the endpoint communication type. For unreliable communication, completion ordering applies to all data transfer requests submitted to an endpoint. For reliable communication, completion ordering only applies to requests that target a single destination endpoint. Completion ordering of requests that target different endpoints over a reliable transport is not defined.
And if user set FI_ORDER_NONE, then "No ordering is defined for completed operations. Requests submitted to the transmit context may complete in any order"
So in verbs provider it internally actually works in "FI_ORDER_STRICT" mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - verbs HW implements the equivalent of FI_ORDER_STRICT.
wc.wr_id = (uintptr_t) ctx->user_ctx; | ||
cq->credits++; | ||
ctx->ep->sq_credits++; | ||
ofi_buf_free(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ctx is freed at here, but the completion event is eventually returned by the later call of ibv_poll_cq(), then the ctx can be already reused and vrb_poll_cq() can check wrong thing correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's see sean's comment, in my understanding the ctx memory seems will not be used any more, here it just write an event (vrb_save_wc) and later call of ibv_poll_cq() need not do anything for that event any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, a problem is if it freed here, if it is possible (in any case), the same event being pop-up at path of vrb_poll_cq()->ibv_poll_cq()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ctx memory is not accessed beyond the free call. The saved work completion (wc.wr_id) references ctx->user_ctx, which is what gets returned.
The higher-level flow is this on the send path:
wr.wr_id = user_ctx
vrb_post_send(wr)
/* swap in our context immediately before posting */
ctx = alloc_ctx()
ctx->user_ctx = wr.wr_id
wr.wr_id = ctx
ibv_post_send(wr)
The completion handling does the opposite:
/* swap back to the user's context */
ctx = wc.wr_id
wc.wr_id = ctx->user_ctx
free(ctx)
report completion to user...
The reason for the context swapping is that verbs does not return certain information in the work completion (like if the operation was a send or receive) if the operation failed. The ctx structure used here stores that information, so that we have it in order to recover from errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry my question is, if flush_sq() has freed a ctx attached on ep->sq_list, and vrb_poll_cq() actually received the event of it after calling ibv_poll_cq(), then in vrb_poll_cq(): "ctx = (struct vrb_context *) (uintptr_t) wc->wr_id;" is accessing a freed ctx?
II know we cannot receive the completion and this is the reason we need this patch, but because we don't know the reason yet, so I'm worrying what is going to happen if we actually received the completion event in some corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chatted with Xuezhao, so there is no CQ anymore (no polling) after flush correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vrb_flush_sq() only being called inside vrb_ep_close() -
vrb_ep_close() {
...
vrb_cleanup_cq();
vrb_flush_sq();
vrb_close_free_ep();
}
So seems yes, after vrb_flush_sq() it will never pool the CQ any more.
This real problem looks like is a IB verbs bug, because the transmit OP submitted to IB QP (vrb_post_send -> ibv_post_send), but IB verbs never reports its completion by ibv_poll_cq().
This patch in OFI verbs provider workaround it by adding per EP tracking queue (sq_list), I think it is safer than workaround it in mercury.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, the CQ is still there, but the QP has been freed (shown above in the '...' portion of the code). After freeing the QP, we flush the CQ of all completions. Only then do we generate completions for any unmatched sends.
There should only be an issue if the verbs HW suddenly starts generating completions for QPs that no longer exist.
Fixes #7287 |
evaluating for DAOS reported issue