Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/ofi implementation incorrect for RMA #4468

Closed
pavanbalaji opened this issue Apr 26, 2020 · 17 comments
Closed

ch4/ofi implementation incorrect for RMA #4468

pavanbalaji opened this issue Apr 26, 2020 · 17 comments

Comments

@pavanbalaji
Copy link
Contributor

pavanbalaji commented Apr 26, 2020

The ch4/ofi implementation seems to be functionally incorrect for RMA in a few cases, and is designed in a way that is impossible to extract performance from in a few other cases. Here is a rough survey of some of the issues that I found:

  1. In some cases, we are not respecting the max_msg_size attribute of OFI. We are simply issuing all of the data in a single message. This should probably be segmented into smaller messages, so each message fits within the specified limit.

  2. For noncontiguous communication, when it counts the number of IOVs needed, the netmod attempts to split the data into smaller chunks so that a proper subset of the IOV array can form a "max_msg_size" segment (so we do not have a case where a subset of an IOV segment has to be sent within a message). This is an extremely inefficient path and is not intended to be used this way. Typerep cannot optimize this case without parsing through the entire datatype, which is very expensive. The intent is to use the max_contig_blocks element stored inside the MPIR_Datatype structure, which would give the maximum number of contiguous segments within the datatype. If further segmentation is needed based on network-specific limits, those should be performed by the netmod.

  3. Ignoring the inefficiency mentioned above, the implementation is still incorrect because it is assuming a highly structured datatype, where simply getting the number of IOV elements in the first part of the datatype and repeating it several times is sufficient. This is incorrect as there is no offset used to get the IOV segments of the later part of the datatype. Clearly the MPICH testing is not catching this error, but the error still exists.

  4. For RMA, there is no pack/unpack-based path at all. Converting noncontiguous segments to IOVs is extremely expensive, especially for cases where the IOV uses more metadata than the actual data itself. Like ch3, we should add a path where, if the average contiguous length of the datatype is below a threshold, we should simply fallback to the AM path.

  5. For RMA, in cases where we do not fallback to the AM path (i.e., each contiguous segment is large), the question remains whether we should prepare an IOV at all, or if we should simply issue each operation individually. I suspect, in that message range, issuing each operation individually would simplify the code and have no performance impact in practice. Assuming we can do that, the entire state machine that the RMA code infrastructure seems to maintain can simply be deleted.

@raffenet
Copy link
Contributor

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

@hzhou
Copy link
Contributor

hzhou commented Apr 28, 2020

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

Of course currently the ofi am path are handling the non-contig case using IOVs as well. I have a PR (#4423) to fix the pt2pt am lmt path. It might be a good time to review that PR first. The non-contig rma am path currently is also using iovs at ch4-generic layer. We'll need some refactoring there first before addressing at the ofi layer.

@pavanbalaji
Copy link
Contributor Author

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

@raffenet
Copy link
Contributor

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

Yes, in a more targeted fashion that's hopefully easier to grok.

@pavanbalaji
Copy link
Contributor Author

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

To be clear, there are four cases that would need to be handled:

  1. Both origin and target datatypes are low density (each individual contiguous chunk is small). In this case, I think the OFI netmod should simply use pack/unpack. Offloading this case to the AM layer might be OK as we cannot benefit from hardware RDMA anyway.

  2. Both origin and target datatypes are high density (each individual contiguous chunk is large). In this case, I think the OFI netmod can simply convert the message to a series of contiguous RDMA operations. We won't need IOVs here, but I think this case would need to be handled by the OFI netmod instead of AM so that it can actually use RDMA operations.

  3. Origin is high density and target is low density. We'd need unpacking on the target buffer, so I think this case is OK to offload to AM, which would need to send individual contiguous pieces from the origin buffer as independent messages, which could be unpacked on the target side.

  4. Origin is low density and target is high density. In this case, the origin data would need to be packed into a temporary buffer and then RDMA issued to the target buffer. I think this case would need to be handled by the OFI netmod in order to use hardware RDMA.

@hzhou
Copy link
Contributor

hzhou commented Apr 28, 2020

But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

For am pt2pt, it always packs on the send side, but currently always do iov (rdma-read) on the recv side. For am rma, it always do iov by ch4-layer.

@pavanbalaji
Copy link
Contributor Author

But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

For am pt2pt, it always packs on the send side, but currently always do iov (rdma-read) on the recv side. For am rma, it always do iov by ch4-layer.

This issue is for RMA communication, although similar problems might exist in the send/recv path too.

@hzhou
Copy link
Contributor

hzhou commented Apr 28, 2020

This issue is for RMA communication, although similar problems might exist in the send/recv path too.

Right. Just for information, the am RMA path always do iov for non-contig datatypes and is having the same issue dealing with large sparsely segmented data.

raffenet added a commit to raffenet/mpich that referenced this issue Apr 28, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue Apr 28, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
@raffenet
Copy link
Contributor

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

To be clear, there are four cases that would need to be handled:

  1. Both origin and target datatypes are low density (each individual contiguous chunk is small). In this case, I think the OFI netmod should simply use pack/unpack. Offloading this case to the AM layer might be OK as we cannot benefit from hardware RDMA anyway.
  2. Both origin and target datatypes are high density (each individual contiguous chunk is large). In this case, I think the OFI netmod can simply convert the message to a series of contiguous RDMA operations. We won't need IOVs here, but I think this case would need to be handled by the OFI netmod instead of AM so that it can actually use RDMA operations.
  3. Origin is high density and target is low density. We'd need unpacking on the target buffer, so I think this case is OK to offload to AM, which would need to send individual contiguous pieces from the origin buffer as independent messages, which could be unpacked on the target side.
  4. Origin is low density and target is high density. In this case, the origin data would need to be packed into a temporary buffer and then RDMA issued to the target buffer. I think this case would need to be handled by the OFI netmod in order to use hardware RDMA.

As far as density goes, this is probably something we want to sort out at type creation time, right? I was thinking put some ch4/ofi specific fields in netmod private area of the datatype. If successful, we can generalize for other transports.

@raffenet
Copy link
Contributor

Or is density just as simple as size / max_contig_blocks?

@hzhou
Copy link
Contributor

hzhou commented Apr 29, 2020

Or is density just as simple as size / max_contig_blocks?

I think this covers majority of the cases. It is rare to have non-uniform non-contig datatypes.

@pavanbalaji
Copy link
Contributor Author

Density is just size / max_contig_blocks. It's very quick to calculate.

raffenet added a commit to raffenet/mpich that referenced this issue Apr 30, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 1, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 1, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 1, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 2, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 3, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 4, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
@raffenet
Copy link
Contributor

raffenet commented May 4, 2020

To be clear, there are four cases that would need to be handled:

  1. Both origin and target datatypes are low density (each individual contiguous chunk is small). In this case, I think the OFI netmod should simply use pack/unpack. Offloading this case to the AM layer might be OK as we cannot benefit from hardware RDMA anyway.
  2. Both origin and target datatypes are high density (each individual contiguous chunk is large). In this case, I think the OFI netmod can simply convert the message to a series of contiguous RDMA operations. We won't need IOVs here, but I think this case would need to be handled by the OFI netmod instead of AM so that it can actually use RDMA operations.
  3. Origin is high density and target is low density. We'd need unpacking on the target buffer, so I think this case is OK to offload to AM, which would need to send individual contiguous pieces from the origin buffer as independent messages, which could be unpacked on the target side.
  4. Origin is low density and target is high density. In this case, the origin data would need to be packed into a temporary buffer and then RDMA issued to the target buffer. I think this case would need to be handled by the OFI netmod in order to use hardware RDMA.

In implementing 4 for MPI_Put, I still utilize MPIR_Typerep_to_iov function to get the size and offset of the contiguous chunks at the target. Is there any better way I am missing before I start on 2?

@hzhou
Copy link
Contributor

hzhou commented May 4, 2020

  1. Origin is low density and target is high density. In this case, the origin data would need to be packed into a temporary buffer and then RDMA issued to the target buffer. I think this case would need to be handled by the OFI netmod in order to use hardware RDMA.

In implementing 4 for MPI_Put, I still utilize MPIR_Typerep_to_iov function to get the size and offset of the contiguous chunks at the target. Is there any better way I am missing before I start on 2?

In this case, we need send in packed data plus the target IOV -- utilize MPIR_Typerep_to_iov seems fine; even when yaksa provide alternative serialization, it should be drop in replacement -- and the remote side need to MPIR_Typerep_unpack (or similar memcpys). We can't do hardware RDMA because the origin side is low density and RDMA will be fragmented and inefficient.

EDIT: I am sorry, this is not active message path 😄 . Yeah, RDMA should work. Local pack - MPIR_Typerep_to_iov - RDMA, sounds good.

@raffenet
Copy link
Contributor

raffenet commented May 4, 2020

The reason I ask is that scenario 2 could get complex with two high density types. What I envision is creating two sets of iovecs and iterating over both, always issuing the minimum size contiguous RDMA operation. Hopefully for the common case, the chunks are of ~equal size and therefore there isn't a lot of fragmentation.

raffenet added a commit to raffenet/mpich that referenced this issue May 4, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
@pavanbalaji
Copy link
Contributor Author

The reason I ask is that scenario 2 could get complex with two high density types. What I envision is creating two sets of iovecs and iterating over both, always issuing the minimum size contiguous RDMA operation. Hopefully for the common case, the chunks are of ~equal size and therefore there isn't a lot of fragmentation.

Yes, you'd need to convert each high density type (origin and/or target) to an IOV and then do the individual RDMA operations. You'd have some fragmentation (because density is the average of the contiguous lengths), but I don't think it'd be too much.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 4, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 4, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 5, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 5, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 6, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 6, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 6, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 6, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 7, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 8, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 8, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 11, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 14, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 15, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 15, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 15, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 15, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 17, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 17, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 18, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
raffenet added a commit to raffenet/mpich that referenced this issue May 18, 2020
This implementation complicates datatype optimization, and also has
correctness issues. See pmodels#4468.
@hzhou
Copy link
Contributor

hzhou commented Jun 4, 2021

Multiple fixes have been merged and I believe this ticket has served its discussion purposes.

@hzhou hzhou closed this as completed Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants