Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: Flush any pending send requests on EP destruction #7291

Merged
merged 2 commits into from
Dec 9, 2021

Conversation

shefty
Copy link
Member

@shefty shefty commented Dec 8, 2021

evaluating for DAOS reported issue

Shrink the size of struct vrb_context by combining disjointly
used fields into a union.  Replace the use of generic OFI flags
with a verbs specific enum.  And remove setting unnecessary
fields in call paths.

The vrb_context structure will be expanded to track send operations
in a follow on patch.  These changes keep the structure size to
castable with struct fi_context.  Even though there is currently no
requirement to restrict vrb_context to that size.

Signed-off-by: Sean Hefty <[email protected]>
Add transmit operations to a software SQ when posting, and
remove them when handling the completion.

Generate flushed completions for any outstanding transmit
operations when shutting down an endpoint.

This update ensures that all posted sends will generate a
completion to the application.  This fixes an issue with DAOS
where a send operation never completes in the case that the
peer has crashed, resulting in the local application hanging.

Signed-off-by: Sean Hefty <[email protected]>
Copy link

@gnailzenh gnailzenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions...

if (ctx->op_ctx == VRB_POST_SQ) {
assert(ctx->ep);
assert(!slist_empty(&ctx->ep->sq_list));
assert(ctx->ep->sq_list.head == &ctx->entry);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an order guarantee while polling? It might not be the first one right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, completion order is guaranteed by the HW. All operations posted to the send queue will complete in order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ofi manpage -
For transmit requests, completion ordering depends on the endpoint communication type. For unreliable communication, completion ordering applies to all data transfer requests submitted to an endpoint. For reliable communication, completion ordering only applies to requests that target a single destination endpoint. Completion ordering of requests that target different endpoints over a reliable transport is not defined.

And if user set FI_ORDER_NONE, then "No ordering is defined for completed operations. Requests submitted to the transmit context may complete in any order"

So in verbs provider it internally actually works in "FI_ORDER_STRICT" mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - verbs HW implements the equivalent of FI_ORDER_STRICT.

wc.wr_id = (uintptr_t) ctx->user_ctx;
cq->credits++;
ctx->ep->sq_credits++;
ofi_buf_free(ctx);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ctx is freed at here, but the completion event is eventually returned by the later call of ibv_poll_cq(), then the ctx can be already reused and vrb_poll_cq() can check wrong thing correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see sean's comment, in my understanding the ctx memory seems will not be used any more, here it just write an event (vrb_save_wc) and later call of ibv_poll_cq() need not do anything for that event any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, a problem is if it freed here, if it is possible (in any case), the same event being pop-up at path of vrb_poll_cq()->ibv_poll_cq()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctx memory is not accessed beyond the free call. The saved work completion (wc.wr_id) references ctx->user_ctx, which is what gets returned.

The higher-level flow is this on the send path:

wr.wr_id = user_ctx
vrb_post_send(wr)
/* swap in our context immediately before posting */
ctx = alloc_ctx()
ctx->user_ctx = wr.wr_id
wr.wr_id = ctx
ibv_post_send(wr)

The completion handling does the opposite:

/* swap back to the user's context */
ctx = wc.wr_id
wc.wr_id = ctx->user_ctx
free(ctx)
report completion to user...

The reason for the context swapping is that verbs does not return certain information in the work completion (like if the operation was a send or receive) if the operation failed. The ctx structure used here stores that information, so that we have it in order to recover from errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry my question is, if flush_sq() has freed a ctx attached on ep->sq_list, and vrb_poll_cq() actually received the event of it after calling ibv_poll_cq(), then in vrb_poll_cq(): "ctx = (struct vrb_context *) (uintptr_t) wc->wr_id;" is accessing a freed ctx?

II know we cannot receive the completion and this is the reason we need this patch, but because we don't know the reason yet, so I'm worrying what is going to happen if we actually received the completion event in some corner cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted with Xuezhao, so there is no CQ anymore (no polling) after flush correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vrb_flush_sq() only being called inside vrb_ep_close() -
vrb_ep_close() {
...
vrb_cleanup_cq();
vrb_flush_sq();
vrb_close_free_ep();
}
So seems yes, after vrb_flush_sq() it will never pool the CQ any more.

This real problem looks like is a IB verbs bug, because the transmit OP submitted to IB QP (vrb_post_send -> ibv_post_send), but IB verbs never reports its completion by ibv_poll_cq().
This patch in OFI verbs provider workaround it by adding per EP tracking queue (sq_list), I think it is safer than workaround it in mercury.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the CQ is still there, but the QP has been freed (shown above in the '...' portion of the code). After freeing the QP, we flush the CQ of all completions. Only then do we generate completions for any unmatched sends.

There should only be an issue if the verbs HW suddenly starts generating completions for QPs that no longer exist.

@shefty shefty removed the evaluating label Dec 9, 2021
@shefty
Copy link
Member Author

shefty commented Dec 9, 2021

Fixes #7287

@shefty shefty merged commit ee06e44 into ofiwg:main Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants