ch4/ofi implementation incorrect for RMA #4468

pavanbalaji · 2020-04-26T19:42:55Z

The ch4/ofi implementation seems to be functionally incorrect for RMA in a few cases, and is designed in a way that is impossible to extract performance from in a few other cases. Here is a rough survey of some of the issues that I found:

In some cases, we are not respecting the max_msg_size attribute of OFI. We are simply issuing all of the data in a single message. This should probably be segmented into smaller messages, so each message fits within the specified limit.
For noncontiguous communication, when it counts the number of IOVs needed, the netmod attempts to split the data into smaller chunks so that a proper subset of the IOV array can form a "max_msg_size" segment (so we do not have a case where a subset of an IOV segment has to be sent within a message). This is an extremely inefficient path and is not intended to be used this way. Typerep cannot optimize this case without parsing through the entire datatype, which is very expensive. The intent is to use the max_contig_blocks element stored inside the MPIR_Datatype structure, which would give the maximum number of contiguous segments within the datatype. If further segmentation is needed based on network-specific limits, those should be performed by the netmod.
Ignoring the inefficiency mentioned above, the implementation is still incorrect because it is assuming a highly structured datatype, where simply getting the number of IOV elements in the first part of the datatype and repeating it several times is sufficient. This is incorrect as there is no offset used to get the IOV segments of the later part of the datatype. Clearly the MPICH testing is not catching this error, but the error still exists.
For RMA, there is no pack/unpack-based path at all. Converting noncontiguous segments to IOVs is extremely expensive, especially for cases where the IOV uses more metadata than the actual data itself. Like ch3, we should add a path where, if the average contiguous length of the datatype is below a threshold, we should simply fallback to the AM path.
For RMA, in cases where we do not fallback to the AM path (i.e., each contiguous segment is large), the question remains whether we should prepare an IOV at all, or if we should simply issue each operation individually. I suspect, in that message range, issuing each operation individually would simplify the code and have no performance impact in practice. Assuming we can do that, the entire state machine that the RMA code infrastructure seems to maintain can simply be deleted.

The text was updated successfully, but these errors were encountered:

raffenet · 2020-04-28T17:47:02Z

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

hzhou · 2020-04-28T18:20:27Z

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

Of course currently the ofi am path are handling the non-contig case using IOVs as well. I have a PR (#4423) to fix the pt2pt am lmt path. It might be a good time to review that PR first. The non-contig rma am path currently is also using iovs at ch4-generic layer. We'll need some refactoring there first before addressing at the ofi layer.

pavanbalaji · 2020-04-28T18:47:48Z

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

raffenet · 2020-04-28T18:56:04Z

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

Yes, in a more targeted fashion that's hopefully easier to grok.

pavanbalaji · 2020-04-28T18:56:29Z

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

To be clear, there are four cases that would need to be handled:

Both origin and target datatypes are low density (each individual contiguous chunk is small). In this case, I think the OFI netmod should simply use pack/unpack. Offloading this case to the AM layer might be OK as we cannot benefit from hardware RDMA anyway.
Both origin and target datatypes are high density (each individual contiguous chunk is large). In this case, I think the OFI netmod can simply convert the message to a series of contiguous RDMA operations. We won't need IOVs here, but I think this case would need to be handled by the OFI netmod instead of AM so that it can actually use RDMA operations.
Origin is high density and target is low density. We'd need unpacking on the target buffer, so I think this case is OK to offload to AM, which would need to send individual contiguous pieces from the origin buffer as independent messages, which could be unpacked on the target side.
Origin is low density and target is high density. In this case, the origin data would need to be packed into a temporary buffer and then RDMA issued to the target buffer. I think this case would need to be handled by the OFI netmod in order to use hardware RDMA.

hzhou · 2020-04-28T19:13:06Z

But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

For am pt2pt, it always packs on the send side, but currently always do iov (rdma-read) on the recv side. For am rma, it always do iov by ch4-layer.

pavanbalaji · 2020-04-28T19:14:26Z

But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

For am pt2pt, it always packs on the send side, but currently always do iov (rdma-read) on the recv side. For am rma, it always do iov by ch4-layer.

This issue is for RMA communication, although similar problems might exist in the send/recv path too.

hzhou · 2020-04-28T19:17:15Z

This issue is for RMA communication, although similar problems might exist in the send/recv path too.

Right. Just for information, the am RMA path always do iov for non-contig datatypes and is having the same issue dealing with large sparsely segmented data.

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet · 2020-04-29T15:39:48Z

Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily.

That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back?

To be clear, there are four cases that would need to be handled:

Both origin and target datatypes are low density (each individual contiguous chunk is small). In this case, I think the OFI netmod should simply use pack/unpack. Offloading this case to the AM layer might be OK as we cannot benefit from hardware RDMA anyway.

Both origin and target datatypes are high density (each individual contiguous chunk is large). In this case, I think the OFI netmod can simply convert the message to a series of contiguous RDMA operations. We won't need IOVs here, but I think this case would need to be handled by the OFI netmod instead of AM so that it can actually use RDMA operations.

Origin is high density and target is low density. We'd need unpacking on the target buffer, so I think this case is OK to offload to AM, which would need to send individual contiguous pieces from the origin buffer as independent messages, which could be unpacked on the target side.

Origin is low density and target is high density. In this case, the origin data would need to be packed into a temporary buffer and then RDMA issued to the target buffer. I think this case would need to be handled by the OFI netmod in order to use hardware RDMA.

As far as density goes, this is probably something we want to sort out at type creation time, right? I was thinking put some ch4/ofi specific fields in netmod private area of the datatype. If successful, we can generalize for other transports.

raffenet · 2020-04-29T15:46:52Z

Or is density just as simple as size / max_contig_blocks?

hzhou · 2020-04-29T16:02:13Z

Or is density just as simple as size / max_contig_blocks?

I think this covers majority of the cases. It is rare to have non-uniform non-contig datatypes.

pavanbalaji · 2020-04-29T19:21:33Z

Density is just size / max_contig_blocks. It's very quick to calculate.

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet · 2020-05-04T14:55:57Z

To be clear, there are four cases that would need to be handled:

Both origin and target datatypes are low density (each individual contiguous chunk is small). In this case, I think the OFI netmod should simply use pack/unpack. Offloading this case to the AM layer might be OK as we cannot benefit from hardware RDMA anyway.

Both origin and target datatypes are high density (each individual contiguous chunk is large). In this case, I think the OFI netmod can simply convert the message to a series of contiguous RDMA operations. We won't need IOVs here, but I think this case would need to be handled by the OFI netmod instead of AM so that it can actually use RDMA operations.

Origin is high density and target is low density. We'd need unpacking on the target buffer, so I think this case is OK to offload to AM, which would need to send individual contiguous pieces from the origin buffer as independent messages, which could be unpacked on the target side.

Origin is low density and target is high density. In this case, the origin data would need to be packed into a temporary buffer and then RDMA issued to the target buffer. I think this case would need to be handled by the OFI netmod in order to use hardware RDMA.

In implementing 4 for MPI_Put, I still utilize MPIR_Typerep_to_iov function to get the size and offset of the contiguous chunks at the target. Is there any better way I am missing before I start on 2?

hzhou · 2020-05-04T15:03:42Z

Origin is low density and target is high density. In this case, the origin data would need to be packed into a temporary buffer and then RDMA issued to the target buffer. I think this case would need to be handled by the OFI netmod in order to use hardware RDMA.

In implementing 4 for MPI_Put, I still utilize MPIR_Typerep_to_iov function to get the size and offset of the contiguous chunks at the target. Is there any better way I am missing before I start on 2?

In this case, we need send in packed data plus the target IOV -- utilize MPIR_Typerep_to_iov seems fine; even when yaksa provide alternative serialization, it should be drop in replacement -- and the remote side need to MPIR_Typerep_unpack (or similar memcpys). We can't do hardware RDMA because the origin side is low density and RDMA will be fragmented and inefficient.

EDIT: I am sorry, this is not active message path 😄 . Yeah, RDMA should work. Local pack - MPIR_Typerep_to_iov - RDMA, sounds good.

raffenet · 2020-05-04T15:36:42Z

The reason I ask is that scenario 2 could get complex with two high density types. What I envision is creating two sets of iovecs and iterating over both, always issuing the minimum size contiguous RDMA operation. Hopefully for the common case, the chunks are of ~equal size and therefore there isn't a lot of fragmentation.

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji · 2020-05-04T17:02:09Z

The reason I ask is that scenario 2 could get complex with two high density types. What I envision is creating two sets of iovecs and iterating over both, always issuing the minimum size contiguous RDMA operation. Hopefully for the common case, the chunks are of ~equal size and therefore there isn't a lot of fragmentation.

Yes, you'd need to convert each high density type (origin and/or target) to an IOV and then do the individual RDMA operations. You'd have some fragmentation (because density is the average of the contiguous lengths), but I don't think it'd be too much.

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

hzhou · 2021-06-04T01:19:36Z

Multiple fixes have been merged and I believe this ticket has served its discussion purposes.

raffenet added a commit to raffenet/mpich that referenced this issue Apr 28, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

6b5bfa2

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet mentioned this issue Apr 28, 2020

ch4/ofi: Improved noncontiguous put/get #4475

Merged

10 tasks

raffenet added a commit to raffenet/mpich that referenced this issue Apr 28, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

a7e08a1

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue Apr 30, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

d4c357e

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 1, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

a451662

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 1, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

c23cedc

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 1, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

db6fcf9

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 2, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

ea2776b

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 3, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

3ef3ad2

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 4, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

9404b66

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 4, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

2e5669d

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 4, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

488f25b

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 4, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

01492e2

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 5, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

99be5e2

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 5, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

e9a2b54

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 6, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

5d37118

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 6, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

1bdb0f2

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 6, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

aac55d2

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

pavanbalaji pushed a commit to pavanbalaji/mpich that referenced this issue May 6, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

1d9527d

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 7, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

cc435e2

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 8, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

efa4423

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 8, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

f174c7e

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 11, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

20aa821

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 14, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

e4bbd77

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 15, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

a0665d8

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 15, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

c66f8e7

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 15, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

ec37d0a

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 15, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

1a47676

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 17, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

0940a10

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 17, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

c0f2d1c

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 18, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

6c396fa

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

raffenet added a commit to raffenet/mpich that referenced this issue May 18, 2020

ch4/ofi: Remove overly complex noncontiguous put/get

42ef6fe

This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.

This was referenced May 22, 2020

ch4/ofi: Use fallback ACC/GET_ACC #4576

Merged

ch4/ofi: Origin-side packing for MPI_PUT/MPI_GET #4593

Merged

hzhou closed this as completed Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch4/ofi implementation incorrect for RMA #4468

ch4/ofi implementation incorrect for RMA #4468

pavanbalaji commented Apr 26, 2020 •

edited

Loading

raffenet commented Apr 28, 2020

hzhou commented Apr 28, 2020

pavanbalaji commented Apr 28, 2020

raffenet commented Apr 28, 2020

pavanbalaji commented Apr 28, 2020

hzhou commented Apr 28, 2020

pavanbalaji commented Apr 28, 2020

hzhou commented Apr 28, 2020

raffenet commented Apr 29, 2020

raffenet commented Apr 29, 2020

hzhou commented Apr 29, 2020

pavanbalaji commented Apr 29, 2020

raffenet commented May 4, 2020 •

edited

Loading

hzhou commented May 4, 2020 •

edited

Loading

raffenet commented May 4, 2020

pavanbalaji commented May 4, 2020

hzhou commented Jun 4, 2021

ch4/ofi implementation incorrect for RMA #4468

ch4/ofi implementation incorrect for RMA #4468

Comments

pavanbalaji commented Apr 26, 2020 • edited Loading

raffenet commented Apr 28, 2020

hzhou commented Apr 28, 2020

pavanbalaji commented Apr 28, 2020

raffenet commented Apr 28, 2020

pavanbalaji commented Apr 28, 2020

hzhou commented Apr 28, 2020

pavanbalaji commented Apr 28, 2020

hzhou commented Apr 28, 2020

raffenet commented Apr 29, 2020

raffenet commented Apr 29, 2020

hzhou commented Apr 29, 2020

pavanbalaji commented Apr 29, 2020

raffenet commented May 4, 2020 • edited Loading

hzhou commented May 4, 2020 • edited Loading

raffenet commented May 4, 2020

pavanbalaji commented May 4, 2020

hzhou commented Jun 4, 2021

pavanbalaji commented Apr 26, 2020 •

edited

Loading

raffenet commented May 4, 2020 •

edited

Loading

hzhou commented May 4, 2020 •

edited

Loading