Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/rxm: Rework CM #6778

Closed
wants to merge 6 commits into from
Closed

prov/rxm: Rework CM #6778

wants to merge 6 commits into from

Conversation

shefty
Copy link
Member

@shefty shefty commented May 24, 2021

The main patch here replaces the connection management component of rxm. The main objective is to make the CM code more readable and maintainable by removing the code complexity that results from the cmap implementation. Multi-threaded synchronization should be easier to understand, with the rxm ep's lock used to avoid race conditions. From the commit msg:

This patch greatly simplifies the management of msg
endpoints by removing the cmap indirection. Unfortunately,
there's no straightforward way to replace cmap and all of
its complexities with set of smaller, more manageable
changes.

The new scheme has the following behavior:

The AV manages 2 sets of addresses. The first is an
array of addresses which were inserted into the AV. This
is the standard AV implementation. The second is an array
of known peer addresses (peer array). That array includes
addresses where were inserted into the AV, plus the addresses of
peers to which there is an active connection. Along with
these 2 arrays is an rbtree, which allows searching for
addresses that are in the peer array.

The peer array is implemented using the indexed buffer pool.

RxM ep's will maintain an array of connections to each peer
that it is communicating with. Those connections will be
stored in an indexer. The index that each connection uses
will match the index of the AV peer array corresponding to
the remote address. So if address 10.1.2.3 is stored in the
AV's peer array at index 8, then the rxm ep will store its
connection to 10.1.2.3 in its indexer also at index 8.

This structure makes lookups consistent for all rxm ep's
and eliminates the need for the complex cmap.

In order to remove the cmap, the entire CM state machine
needs to be rewritten.

Only basic testing over tcp on a single system has been done on these changes. Testing at scale is needed. There are also several changes that are still left as to-do, which need to be addressed prior to merging. It may also be possible to look at the final patch and extract some portions out into separate commits, but such areas would need to be identified.

Unfortunately because of the scope of the changes, reviewing the patch directly is likely not the best way to review the code. An analysis of the updated code would be best.

@shefty
Copy link
Member Author

shefty commented May 24, 2021

@ooststep @swelch - If you get a chance, please look at these changes, particularly the updated rxm_conn.c file. Nearly all of the code in that file was re-written.

The code is mostly untested. I'm not worried about CI failures at the moment, as there are changes which are still needed. However, I wanted to post the changes early for review.

@shefty
Copy link
Member Author

shefty commented May 24, 2021

Apparently adding "---" to a comment makes the text above and below it big and bold. The emphasis above isn't intentional

@shefty shefty force-pushed the rxm branch 3 times, most recently from 0279662 to 17239e4 Compare May 27, 2021 04:15
shefty added a commit that referenced this pull request May 27, 2021
shefty added 3 commits May 27, 2021 15:46
In tcpx_cm_recv_req, we duplicate the passive endpoint's
fi_info structure.  The dest_addr is then replaced without
the original dup'ed address being freed.  Free it first.

Fixes asan memory leak report.

Signed-off-by: Sean Hefty <[email protected]>
This can be used by higher level code to indicate if an
insertion should be done at the head or tail of a list.

Signed-off-by: Sean Hefty <[email protected]>
This patch greatly simplifies the management of msg
endpoints by removing the cmap indirection.  Unfortunately,
there's no straightforward way to replace cmap and all of
its complexities with set set of smaller, more manageable
changes.

The new scheme has the following behavior:

The AV manages 2 sets of addresses.  The first is an
array of addresses which were inserted into the AV.  This
is the standard AV implementation.  The second is an array
of known peer addresses (peer array).  That array includes
addresses where were inserted into the AV, plus the addresses of
peers to which there is an active connection.  Along with
these 2 arrays is an rbtree, which allows searching for
addresses that are in the peer array.

The peer array is implemented using the indexed buffer pool.

RxM ep's will maintain an array of connections to each peer
that it is communicating with.  Those connections will be
stored in an indexer.  The index that each connection uses
will match the index of the AV peer array corresponding to
the remote address.  So if address 10.1.2.3 is stored in the
AV's peer array at index 8, then the rxm ep will store its
connection to 10.1.2.3 in its indexer also at index 8.

This structure makes lookups consistent for all rxm ep's
and eliminiates the need for the complex cmap.

In order to remove the cmap, the entire CM state machine
needs to be rewritten.

Signed-off-by: Sean Hefty <[email protected]>
Copy link
Contributor

@ooststep ooststep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about half way through the review. so far so good

@shefty shefty force-pushed the rxm branch 2 times, most recently from 96561e8 to 1e54309 Compare May 28, 2021 06:14
@shefty
Copy link
Member Author

shefty commented May 28, 2021

These changes exposed a few bugs in other parts of the code. All identified have been fixed, with most of the fixes already merged. The code needs testing for verification, but the only left-over work that I can recall is to handle an app that uses an rxm endpoint to talk with itself. There's a FIXME or TODO label in the code for that.

With all connections to the same peer address being store at the same index, the two sides of a loopback connection will want to go into the same index. We'll either need to drop support for this or somehow flag this as a special case. If we go the route of making this a special case, we'll need to do so in a way that doesn't add overhead for the 99.999999% of apps that don't use the network to talk to themselves.

@swelch
Copy link
Contributor

swelch commented May 28, 2021

The code needs testing for verification, but the only left-over work that I can recall is to handle an app that uses an rxm endpoint to talk with itself. There's a FIXME or TODO label in the code for that.

With all connections to the same peer address being store at the same index, the two sides of a loopback connection will want to go into the same index. We'll either need to drop support for this or somehow flag this as a special case. If we go the route of making this a special case, we'll need to do so in a way that doesn't add overhead for the 99.999999% of apps that don't use the network to talk to themselves.

I believe some non-MPI applications that I debugged over the rxm/verbs provider were using a loopback connection when they used libfabric software emulated atomics; so I think this TODO will need to be done before it could be merged.

@shefty
Copy link
Member Author

shefty commented May 28, 2021

We'll want a new fabtest for self-communication then, as I'm not aware of any other tests that do this. I have some ideas for handling this; trying to figure out what's the least messy.

@shefty
Copy link
Member Author

shefty commented Jun 5, 2021

bot:aws:retest

@shefty
Copy link
Member Author

shefty commented Jun 5, 2021

Leaving notes for implementing loopback connections:

  • add rx_loopback flag to rxm_conn, flag indicates conn is not in conn_idx_map
  • create alloc_conn call, separating allocation functionality from add_conn
  • add_conn calls alloc_conn and inserts into the conn_idx_map
  • create alloc_rx_loopback call, calls alloc_conn, sets rx_loopback flag
  • in process_connreq, CONNECTING path, call alloc_rx_loopback and continue with rx conn
  • add rx_loopback check to free_conn and skip conn_idx_map removal if set
  • discover all the issues the above flow missed

Allow an rxm endpoint to talk with itself.  This requires
setting up a connection through the msg provider.

Signed-off-by: Sean Hefty <[email protected]>
@shefty
Copy link
Member Author

shefty commented Jun 5, 2021

Updated with loopback support. But there's currently no way to verify that support actually works.

This test communicates with itself.  No need for client/server
setup.  This tests provider support for loopback communication.

Signed-off-by: Sean Hefty <[email protected]>
@shefty
Copy link
Member Author

shefty commented Jun 7, 2021

Added new fi_loopback fabtest to verify if a provider can send a message to itself.

@shefty
Copy link
Member Author

shefty commented Jun 8, 2021

Intel CI failure:

name:   fi_unexpected_msg -e rdm -i 10 -p "verbs;ofi_rxm"
18:06:18    result: Fail
18:06:18    time:   301
18:06:18    server_cmd: /home/build/ofi-Install/libfabric/ofi_libfabric/PR-6778/12/dbg/bin/fi_unexpected_msg -e rdm -i 10 -p "verbs;ofi_rxm"  -s n13-mlx5_0_1
18:06:18    server_stdout: |
18:06:18      Killed by signal 15.
18:06:18    client_cmd: /home/build/ofi-Install/libfabric/ofi_libfabric/PR-6778/12/dbg/bin/fi_unexpected_msg -e rdm -i 10 -p "verbs;ofi_rxm"  -s n14-mlx5_0_1 n13-mlx5_0_1
18:06:18    client_stdout: |
18:06:18      Killed by signal 15.

fi_unexpected_msg test hangs running over verbs. Server side is spinning trying to send initial set of messages. Client is blocking in ft_sync -- after posting initial send buffers.

@shefty
Copy link
Member Author

shefty commented Jun 8, 2021

Added fix to fabtests oob socket initialization

We need to continue to make progress on the main ep even when
we're using the OOB socket.  Call ft_sock_setup() to initialize
the socket properly.  This fixes a hang in fi_unexpected_msg
test that shows up with rxm CM changes.  However, the hang could
happen with any test that uses OOB exchanges and blocks on recv().

Signed-off-by: Sean Hefty <[email protected]>
@shefty
Copy link
Member Author

shefty commented Jun 8, 2021

Fix to fabtests breaks all sort of other tests... sigh... removed here (returning to the regularly scheduled CI failures). Will work on fabtest fix in the separate PR.

@krehm
Copy link

krehm commented Jul 7, 2021

@shefty I'm confused on the state of the RXM rework, this ticket is closed, but the header says the PRs haven't been merged.

@j-xiong
Copy link
Contributor

j-xiong commented Jul 7, 2021

@krehm This PR is replaced by #6833.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants