-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/ofi implementation incorrect for RMA #4468
Comments
Is it worth keeping the current non-contig implementation around at all? I've been looking at it and wondering if we should just revert to AM for all non-contig cases to start. The we can add ch4/ofi native carve outs for cases we think we perform better. This way we can delete a lot of this overly complex stuff right off the bat and make changes more easily. |
Of course currently the ofi am path are handling the non-contig case using IOVs as well. I have a PR (#4423) to fix the pt2pt am lmt path. It might be a good time to review that PR first. The non-contig rma am path currently is also using iovs at ch4-generic layer. We'll need some refactoring there first before addressing at the ofi layer. |
That makes sense to me. But I think the AM code does not handle the IOV path (i.e., it always packs). So we'd lose performance in some cases. Presumably you mean you'll add that back? |
Yes, in a more targeted fashion that's hopefully easier to grok. |
To be clear, there are four cases that would need to be handled:
|
For am pt2pt, it always packs on the send side, but currently always do iov (rdma-read) on the recv side. For am rma, it always do iov by ch4-layer. |
This issue is for RMA communication, although similar problems might exist in the send/recv path too. |
Right. Just for information, the am RMA path always do iov for non-contig datatypes and is having the same issue dealing with large sparsely segmented data. |
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
As far as density goes, this is probably something we want to sort out at type creation time, right? I was thinking put some ch4/ofi specific fields in netmod private area of the datatype. If successful, we can generalize for other transports. |
Or is density just as simple as |
I think this covers majority of the cases. It is rare to have non-uniform non-contig datatypes. |
Density is just |
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
In implementing 4 for MPI_Put, I still utilize |
In this case, we need send in packed data plus the target IOV -- utilize EDIT: I am sorry, this is not active message path 😄 . Yeah, RDMA should work. Local pack - |
The reason I ask is that scenario 2 could get complex with two high density types. What I envision is creating two sets of iovecs and iterating over both, always issuing the minimum size contiguous RDMA operation. Hopefully for the common case, the chunks are of ~equal size and therefore there isn't a lot of fragmentation. |
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
Yes, you'd need to convert each high density type (origin and/or target) to an IOV and then do the individual RDMA operations. You'd have some fragmentation (because density is the average of the contiguous lengths), but I don't think it'd be too much. |
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
This implementation complicates datatype optimization, and also has correctness issues. See pmodels#4468.
Multiple fixes have been merged and I believe this ticket has served its discussion purposes. |
The
ch4/ofi
implementation seems to be functionally incorrect for RMA in a few cases, and is designed in a way that is impossible to extract performance from in a few other cases. Here is a rough survey of some of the issues that I found:In some cases, we are not respecting the
max_msg_size
attribute of OFI. We are simply issuing all of the data in a single message. This should probably be segmented into smaller messages, so each message fits within the specified limit.For noncontiguous communication, when it counts the number of IOVs needed, the netmod attempts to split the data into smaller chunks so that a proper subset of the IOV array can form a "max_msg_size" segment (so we do not have a case where a subset of an IOV segment has to be sent within a message). This is an extremely inefficient path and is not intended to be used this way. Typerep cannot optimize this case without parsing through the entire datatype, which is very expensive. The intent is to use the
max_contig_blocks
element stored inside theMPIR_Datatype
structure, which would give the maximum number of contiguous segments within the datatype. If further segmentation is needed based on network-specific limits, those should be performed by the netmod.Ignoring the inefficiency mentioned above, the implementation is still incorrect because it is assuming a highly structured datatype, where simply getting the number of IOV elements in the first part of the datatype and repeating it several times is sufficient. This is incorrect as there is no offset used to get the IOV segments of the later part of the datatype. Clearly the MPICH testing is not catching this error, but the error still exists.
For RMA, there is no pack/unpack-based path at all. Converting noncontiguous segments to IOVs is extremely expensive, especially for cases where the IOV uses more metadata than the actual data itself. Like ch3, we should add a path where, if the average contiguous length of the datatype is below a threshold, we should simply fallback to the AM path.
For RMA, in cases where we do not fallback to the AM path (i.e., each contiguous segment is large), the question remains whether we should prepare an IOV at all, or if we should simply issue each operation individually. I suspect, in that message range, issuing each operation individually would simplify the code and have no performance impact in practice. Assuming we can do that, the entire state machine that the RMA code infrastructure seems to maintain can simply be deleted.
The text was updated successfully, but these errors were encountered: