-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vader fails IMB-EXT Unidir_Get test #3821
Comments
We are also digging into this issue, but any insight/help would be appreciated. We have tracked it down to a handful of commits and are currently bisecting further to narrow it down to one. But it looks like the rcache / mpool rework might be the source. The current thinking is that it's memory corruption. Adding some additional context to the error reported in vader here shows the |
I'm using
Here is the full trace of a failed run: shell$ cd imb-4.1/src/
shell$ mpirun -np 2 -mca pml ob1 -mca btl vader,self ./IMB-EXT Unidir_get
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.1, MPI-2 part
#------------------------------------------------------------
# Date : Thu Jul 6 10:39:28 2017
# Machine : ppc64le
# System : Linux
# Release : 3.10.0-510.el7.ppc64le
# Version : #1 SMP Wed Sep 21 14:46:20 EDT 2016
# MPI Version : 3.1
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# ./IMB-EXT Unidir_get
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Unidir_Get
#---------------------------------------------------
# Benchmarking Unidir_Get
# #processes = 2
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.22 0.00
4 1000 3.25 1.17
8 1000 2.87 2.66
16 1000 2.79 5.46
32 1000 2.59 11.80
64 1000 2.55 23.92
128 1000 2.53 48.26
256 1000 2.58 94.66
512 1000 2.56 190.48
[c712f6n06:131799] Read -1, expected 18446744073709547536, errno = 22
[c712f6n06:131799] Read -1, expected 18446744073709547536, errno = 22
[c712f6n06:131799] Read -1, expected 18446744073709550864, errno = 22
1024 1000 1.82 535.80
2048 1000 2.03 963.12 |
we found Unidir_Put works well with different options. so how does the vader code handle get and put differently? |
@jjhursey, are the patches in question also on the v3.0.x branch? I assume yes, but hoping no... |
@bwbarrett yeah it impacts the v3.0.x branch. Here is the breakdown for the release branches (as of a build from yesterday):
The leading suspect commit is the mpool/rcache rework (which is not in the v2.0.x branch): Valgrind shows some warnings in the ==85568== Conditional jump or move depends on uninitialised value(s)
==85568== at 0x47AD480: mca_mpool_base_tree_node_compare (mpool_base_tree.c:62)
==85568== by 0x46BD513: btree_insert (opal_rb_tree.c:342)
==85568== by 0x46BCC8F: opal_rb_tree_insert (opal_rb_tree.c:137)
==85568== by 0x47AD9AF: mca_mpool_base_tree_insert (mpool_base_tree.c:110)
==85568== by 0x47AC76B: mca_mpool_base_alloc (mpool_base_alloc.c:87)
==85568== by 0x41599AB: PMPI_Alloc_mem (palloc_mem.c:85)
==85568== by 0x100055A7: IMB_set_buf (in /home/me/imb-4.1/src/IMB-EXT)
==85568== by 0x1000631B: IMB_init_buffers_iter (in /home/me/imb-4.1/src/IMB-EXT)
==85568== by 0x10002123: main (in /home/me/imb-4.1/src/IMB-EXT)
==85568==
#---------------------------------------------------
# Benchmarking Unidir_Get
# #processes = 2
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 9.22 0.00
==85568== Conditional jump or move depends on uninitialised value(s)
==85568== at 0x47AD468: mca_mpool_base_tree_node_compare (mpool_base_tree.c:58)
==85568== by 0x46BCF1B: opal_rb_tree_find_with (opal_rb_tree.c:191)
==85568== by 0x47AD41B: opal_rb_tree_find (opal_rb_tree.h:156)
==85568== by 0x47ADB17: mca_mpool_base_tree_find (mpool_base_tree.c:143)
==85568== by 0x47AC7DB: mca_mpool_base_free (mpool_base_alloc.c:110)
==85568== by 0x41812CB: PMPI_Free_mem (pfree_mem.c:53)
==85568== by 0x1000554B: IMB_set_buf (in /home/me/imb-4.1/src/IMB-EXT)
==85568== by 0x1000631B: IMB_init_buffers_iter (in /home/me/imb-4.1/src/IMB-EXT)
==85568== by 0x10002123: main (in /home/me/imb-4.1/src/IMB-EXT) |
Yeah, looks like the item key is not being set. |
Not sure if it is the cause of this bug though. |
Maybe try adding this @ mpool_base_alloc.c:78
|
@hjelmn No love on that change. Though it did fix the valgrind complaints. |
Ok, so that is a required fix but there is still something else going on. Is this with CMA, KNEM, or XPMEM? |
CMA seems to be the only one active:
|
Here is something. Adding |
Could be. Maybe a bug in the fragment allocator? |
Was able to reproduce the issue on my mac with both put and get. Digging into it now. |
I think IMB_ones_mget and IMB_ones_mput in IMB_ones_unidir.c triggered the issue. There is no MPI_Win_fence for each MPI_win_Get/MPI_win_Put calls. |
A breadcrumb... So
The next question is why are we getting an I have to stop for today. But @wlepera and @mathbird will continue with this tomorrow. @hjelmn let us know if you turn up anything. Thanks! |
my 2 cents this morning: The input parameter size in the function mca_btl_vader_get_cma is not right whenever the error happened. It should equal to the bytes size of the data, but it showed as following: --- i = 1024, iter = 983 My next question is where size is calculated before mca_btl_vader_get_cma is called? The following simple code can also catch the same error.
|
I think I've narrowed this down a bit more. In pml_ob1_recvreq.c, the value for frag->rdma_length is set to bytes_remaining, with a value of 8208. Next, a successful call to mca_pml_ob1_recv_request_get_frag(frag) is made. After this, frag->rdma_length has changed to 4096, which ultimately causes the error in the calculation referenced by @jjhursey. I've traced this as far as the call to mca_bml_base_get, which makes a call into btl->btl_get, passing a pointer to cbdata (which is the frag struct). I suspect the value is being changed in this function, though I have not been able to confirm this yet. I'm attaching the output of the debug run, which includes filenames, line numbers, and values of the bytes_remaining, frag->rdma_length variables. The failed get is logged starting at line 1763 in the output file. The value flips from 8208 to 4096 between lines 1776 and 1778 |
I think we have a conceptual flaw in the RGET implementation. Let's ignore for a minute the comment in the mca_pml_ob1_recv_request_progress_rget function that talks about fragmentation. The loop in mca_pml_ob1_recv_request_progress_rget assumes that the fragment itself is available to the PML for meddling with upon return from the mca_pml_ob1_recv_request_get_frag function. The solution is to completely get rid of the fragmentation in the mca_pml_ob1_recv_request_progress_rget function (fragmentation that cannot work anyway because there is no feedback mechanism from the BTL back into the PML about how much data has been retrieved). The current fragmentation implementation is a M.A.J.O.R flaw, with a drastic impact on the performance of Open MPI. As long as the fragment can be reused in another context (which is that case as long as we have any pending requests or fragments behind MCA_PML_OB1_PROGRESS_PENDING), the current implementation send way more data than needed (because it will see a rdma_length that is not related to the current operation but rather another ongoing operation). This can easily be seen using the following patch. When the assert trigger diff --git a/ompi/mca/pml/ob1/pml_ob1_recvreq.c b/ompi/mca/pml/ob1/pml_ob1_recvreq.c
index ddd60f263c..9aa7783f3c 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvreq.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvreq.c
@@ -718,7 +718,12 @@ void mca_pml_ob1_recv_request_progress_rget( mca_pml_ob1_recv_request_t* recvreq
* get fragmentation internally. This is a reasonable solution since some btls do not
* need any fragmentation (sm, vader, self, etc). Remove this loop if this ends up
* being the case. */
+ int ii = 0;
+ size_t saved_offset[10] = {0}, saved_remaining[10] = {0};
+ size_t saved_length_before[10] = {0}, saved_length_after[10] = {0};
while (bytes_remaining > 0) {
+ saved_offset[ii] = offset;
+ saved_remaining[ii] = bytes_remaining;
/* allocate/initialize a fragment */
MCA_PML_OB1_RDMA_FRAG_ALLOC(frag);
if (OPAL_UNLIKELY(NULL == frag)) {
@@ -752,16 +757,18 @@ void mca_pml_ob1_recv_request_progress_rget( mca_pml_ob1_recv_request_t* recvreq
} else {
frag->rdma_length = bytes_remaining;
}
-
+ saved_length_before[ii] = frag->rdma_length;
/* NTH: TODO -- handle error conditions gracefully */
rc = mca_pml_ob1_recv_request_get_frag(frag);
if (OMPI_SUCCESS != rc) {
break;
}
-
+ saved_length_after[ii] = frag->rdma_length;
prev_sent = frag->rdma_length;
+ assert(prev_sent <= bytes_remaining);
bytes_remaining -= prev_sent;
offset += prev_sent;
+ ii++;
}
} As a side note, unlike all the other supported protocols the current implementation of the RGET completely ignores any pipelining setting provided to the OB1 PML. |
Per discussion at Chicago July 2017 meeting, @hjelmn will look at this in the immediate future. |
@hjelmn In case it helps, Aboorva found that if we disable the |
Whats funny is this fails miserably for me on my mac :-/ |
No vader RDMA there. |
setting |
My thinking is this is either an pml/ob1 or osc/pt2pt bug. |
Though it does pass with tcp.... |
It works with SM. How to you get it to fail on your laptop ? |
This looks like an old ob1 issue. I can provide a quick workaround in vader and work on a bigger fix for the ob1 issue later. |
This commit fixes a bug that occurs when the btl callback happens before the rget returns. In this case the fragment has been returned and is no longer valid. This commit saves the size before calling rget. This is valid since the BTL is not allowed to change the read size. Fixes open-mpi#3821 Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a bug that occurs when the btl callback happens before the rget returns. In this case the fragment has been returned and is no longer valid. This commit saves the size before calling rget. This is valid since the BTL is not allowed to change the read size. Fixes #3821 Signed-off-by: Nathan Hjelm <[email protected]>
This should be backported to the v2.x branch too |
I don't know about 3.x but I backported the patch to 2.1.1 (only conflict is the copyright in header) and it's still broken:
|
I'm setting up some builds this morning to verify for the release branches. It should impact |
This commit fixes a bug that occurs when the btl callback happens before the rget returns. In this case the fragment has been returned and is no longer valid. This commit saves the size before calling rget. This is valid since the BTL is not allowed to change the read size. Fixes open-mpi#3821 Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit e73ab93) Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a bug that occurs when the btl callback happens before the rget returns. In this case the fragment has been returned and is no longer valid. This commit saves the size before calling rget. This is valid since the BTL is not allowed to change the read size. Fixes open-mpi#3821 Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit e73ab93) Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a bug that occurs when the btl callback happens before the rget returns. In this case the fragment has been returned and is no longer valid. This commit saves the size before calling rget. This is valid since the BTL is not allowed to change the read size. Fixes open-mpi#3821 Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit e73ab93) Signed-off-by: Nathan Hjelm <[email protected]>
@nmorey There is a comment here about the pair of commits: #3845 (comment) Nathan has filed PRs to push this fix to the release branches:
|
Hi, I am testing OMPI with Intel IMB test “IMB-EXT Unidir_Get” as following. With “vader”, it gave error. Without “vader” or adding openib as “-mca btl openib,vader,tcp,self”, the test passed well. Does “vader” need special configure option to build? Does it need other options to make it work?
Thanks,
Dahai
#bytes #repetitions t[usec] Mbytes/sec 0 1000 0.04 0.00 4 1000 0.72 5.31 8 1000 0.89 8.54 16 1000 0.88 17.39 32 1000 1.10 27.76 64 1000 1.15 53.29 128 1000 0.91 134.17 256 1000 0.94 260.82 Read -1, expected 18446744073709547536, errno = 22 Read -1, expected 18446744073709547536, errno = 22 Read -1, expected 18446744073709550864, errno = 22 512 1000 0.95 516.29 1024 1000 0.56 1731.87 2048 1000 0.68 2892.14
The text was updated successfully, but these errors were encountered: