-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/rxm: Rework CM #6778
prov/rxm: Rework CM #6778
Conversation
@ooststep @swelch - If you get a chance, please look at these changes, particularly the updated rxm_conn.c file. Nearly all of the code in that file was re-written. The code is mostly untested. I'm not worried about CI failures at the moment, as there are changes which are still needed. However, I wanted to post the changes early for review. |
Apparently adding "---" to a comment makes the text above and below it big and bold. The emphasis above isn't intentional |
0279662
to
17239e4
Compare
In tcpx_cm_recv_req, we duplicate the passive endpoint's fi_info structure. The dest_addr is then replaced without the original dup'ed address being freed. Free it first. Fixes asan memory leak report. Signed-off-by: Sean Hefty <[email protected]>
This can be used by higher level code to indicate if an insertion should be done at the head or tail of a list. Signed-off-by: Sean Hefty <[email protected]>
This patch greatly simplifies the management of msg endpoints by removing the cmap indirection. Unfortunately, there's no straightforward way to replace cmap and all of its complexities with set set of smaller, more manageable changes. The new scheme has the following behavior: The AV manages 2 sets of addresses. The first is an array of addresses which were inserted into the AV. This is the standard AV implementation. The second is an array of known peer addresses (peer array). That array includes addresses where were inserted into the AV, plus the addresses of peers to which there is an active connection. Along with these 2 arrays is an rbtree, which allows searching for addresses that are in the peer array. The peer array is implemented using the indexed buffer pool. RxM ep's will maintain an array of connections to each peer that it is communicating with. Those connections will be stored in an indexer. The index that each connection uses will match the index of the AV peer array corresponding to the remote address. So if address 10.1.2.3 is stored in the AV's peer array at index 8, then the rxm ep will store its connection to 10.1.2.3 in its indexer also at index 8. This structure makes lookups consistent for all rxm ep's and eliminiates the need for the complex cmap. In order to remove the cmap, the entire CM state machine needs to be rewritten. Signed-off-by: Sean Hefty <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about half way through the review. so far so good
96561e8
to
1e54309
Compare
These changes exposed a few bugs in other parts of the code. All identified have been fixed, with most of the fixes already merged. The code needs testing for verification, but the only left-over work that I can recall is to handle an app that uses an rxm endpoint to talk with itself. There's a FIXME or TODO label in the code for that. With all connections to the same peer address being store at the same index, the two sides of a loopback connection will want to go into the same index. We'll either need to drop support for this or somehow flag this as a special case. If we go the route of making this a special case, we'll need to do so in a way that doesn't add overhead for the 99.999999% of apps that don't use the network to talk to themselves. |
I believe some non-MPI applications that I debugged over the rxm/verbs provider were using a loopback connection when they used libfabric software emulated atomics; so I think this TODO will need to be done before it could be merged. |
We'll want a new fabtest for self-communication then, as I'm not aware of any other tests that do this. I have some ideas for handling this; trying to figure out what's the least messy. |
bot:aws:retest |
Leaving notes for implementing loopback connections:
|
Allow an rxm endpoint to talk with itself. This requires setting up a connection through the msg provider. Signed-off-by: Sean Hefty <[email protected]>
Updated with loopback support. But there's currently no way to verify that support actually works. |
This test communicates with itself. No need for client/server setup. This tests provider support for loopback communication. Signed-off-by: Sean Hefty <[email protected]>
Added new fi_loopback fabtest to verify if a provider can send a message to itself. |
Intel CI failure:
fi_unexpected_msg test hangs running over verbs. Server side is spinning trying to send initial set of messages. Client is blocking in ft_sync -- after posting initial send buffers. |
Added fix to fabtests oob socket initialization |
We need to continue to make progress on the main ep even when we're using the OOB socket. Call ft_sock_setup() to initialize the socket properly. This fixes a hang in fi_unexpected_msg test that shows up with rxm CM changes. However, the hang could happen with any test that uses OOB exchanges and blocks on recv(). Signed-off-by: Sean Hefty <[email protected]>
Fix to fabtests breaks all sort of other tests... sigh... removed here (returning to the regularly scheduled CI failures). Will work on fabtest fix in the separate PR. |
@shefty I'm confused on the state of the RXM rework, this ticket is closed, but the header says the PRs haven't been merged. |
The main patch here replaces the connection management component of rxm. The main objective is to make the CM code more readable and maintainable by removing the code complexity that results from the cmap implementation. Multi-threaded synchronization should be easier to understand, with the rxm ep's lock used to avoid race conditions. From the commit msg:
This patch greatly simplifies the management of msg
endpoints by removing the cmap indirection. Unfortunately,
there's no straightforward way to replace cmap and all of
its complexities with set of smaller, more manageable
changes.
The new scheme has the following behavior:
The AV manages 2 sets of addresses. The first is an
array of addresses which were inserted into the AV. This
is the standard AV implementation. The second is an array
of known peer addresses (peer array). That array includes
addresses where were inserted into the AV, plus the addresses of
peers to which there is an active connection. Along with
these 2 arrays is an rbtree, which allows searching for
addresses that are in the peer array.
The peer array is implemented using the indexed buffer pool.
RxM ep's will maintain an array of connections to each peer
that it is communicating with. Those connections will be
stored in an indexer. The index that each connection uses
will match the index of the AV peer array corresponding to
the remote address. So if address 10.1.2.3 is stored in the
AV's peer array at index 8, then the rxm ep will store its
connection to 10.1.2.3 in its indexer also at index 8.
This structure makes lookups consistent for all rxm ep's
and eliminates the need for the complex cmap.
In order to remove the cmap, the entire CM state machine
needs to be rewritten.
Only basic testing over tcp on a single system has been done on these changes. Testing at scale is needed. There are also several changes that are still left as to-do, which need to be addressed prior to merging. It may also be possible to look at the final patch and extract some portions out into separate commits, but such areas would need to be identified.
Unfortunately because of the scope of the changes, reviewing the patch directly is likely not the best way to review the code. An analysis of the updated code would be best.