You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
psm3 over verbs is failing about one out of every 5 runs of our CI with a hang during the fi_multinode test (3 peers). I can reproduce it by hand (not consistently). Most often I see it during a barrier where 2 of peers have sent all their messages but one gets stuck receiving that message. Below is a backtrace from the peer that is stuck receiving a message.
To Reproduce
fi_multinode -p psm3 -C msg -n 3 -s
^ running server on one node and 2 clients (same command) on a different node. Not sure if this is a necessary factor to reproduce. This is just what our CI does.
Output
psm3_verbs_recvhdrq_progress (recvq=0x10addf8) at prov/psm3/psm3/hal_verbs/verbs_recvhdrq.c:189
189 PSMI_CACHEALIGN struct ips_recvhdrq_event rcv_ev = {
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-164.el8.x86_64 libibverbs-56mlnx40-1.56103.x86_64 libnl3-3.5.0-1.el8.x86_64 librdmacm-56mlnx40-1.56103.x86_64 libuuid-2.32.1-28.el8.x86_64 numactl-libs-2.0.12-13.el8.x86_64
(gdb) bt
#0 psm3_verbs_recvhdrq_progress (recvq=0x10addf8) at prov/psm3/psm3/hal_verbs/verbs_recvhdrq.c:189 #1 0x00007f7f31c928fa in psm3_verbs_ips_ptl_poll (ptl_gen=0x10a8300, _ignored=0) at prov/psm3/psm3/hal_verbs/verbs_ptl_ips.c:116 #2 0x00007f7f31c98669 in psm3_poll_internal (ep=0x10a7b40, poll_amsh=1) at prov/psm3/psm3/psm.c:1624 #3 0x00007f7f31cadac6 in psm3_mq_ipeek_dequeue_multi (mq=0x101f250, status_array=0x7ffe725b8b00, status_copy=0x7f7f31c5b65e <psmx3_mq_status_copy>, count=0x7ffe725b8ae4)
at prov/psm3/psm3/psm_mq.c:1154 #4 0x00007f7f31c5d163 in psmx3_cq_poll_mq (cq=0x10244d0, trx_ctxt=0x1022910, event_in=0x7ffe725b8c60, count=0, src_addr=0x0) at prov/psm3/src/psmx3_cq.c:833 #5 0x00007f7f31c5d220 in psmx3_cq_readfrom (cq=0x10244d0, buf=0x7ffe725b8c60, count=1, src_addr=0x0) at prov/psm3/src/psmx3_cq.c:861 #6 0x00007f7f31c5d52a in psmx3_cq_read (cq=0x10244d0, buf=0x7ffe725b8c60, count=1) at prov/psm3/src/psmx3_cq.c:949 #7 0x0000000000404da9 in fi_cq_read (cq=0x10244d0, buf=0x7ffe725b8c60, count=1) at /home/aingerso/install/libfabric/include/rdma/fi_eq.h:394 #8 0x000000000040e533 in ft_spin_for_comp (cq=0x10244d0, cur=0x61be60 <rx_cq_cntr>, total=6, timeout=-1) at common/shared.c:2287 #9 0x000000000040e949 in ft_get_cq_comp (cq=0x10244d0, cur=0x61be60 <rx_cq_cntr>, total=6, timeout=-1) at common/shared.c:2378 #10 0x000000000040ec62 in ft_get_rx_comp (total=6) at common/shared.c:2458 #11 0x0000000000403b7f in send_recv_barrier (sync=0) at multinode/src/core.c:395 #12 0x0000000000403d69 in multi_run_test () at multinode/src/core.c:442 #13 0x00000000004040c3 in multinode_run_tests (argc=9, argv=0x7ffe725b8ee8) at multinode/src/core.c:505 #14 0x0000000000402770 in main (argc=9, argv=0x7ffe725b8ee8) at multinode/src/harness.c:371
Environment:
Linux
The text was updated successfully, but these errors were encountered:
psm3 is showing transient failures with the multinode test.
Will re-enable once issue ofiwg#8090 is resolved.
Signed-off-by: Alexia Ingerson <[email protected]>
psm3 is showing transient failures with the multinode test.
Will re-enable once issue #8090 is resolved.
Signed-off-by: Alexia Ingerson <[email protected]>
Describe the bug
psm3 over verbs is failing about one out of every 5 runs of our CI with a hang during the fi_multinode test (3 peers). I can reproduce it by hand (not consistently). Most often I see it during a barrier where 2 of peers have sent all their messages but one gets stuck receiving that message. Below is a backtrace from the peer that is stuck receiving a message.
To Reproduce
fi_multinode -p psm3 -C msg -n 3 -s
^ running server on one node and 2 clients (same command) on a different node. Not sure if this is a necessary factor to reproduce. This is just what our CI does.
Output
psm3_verbs_recvhdrq_progress (recvq=0x10addf8) at prov/psm3/psm3/hal_verbs/verbs_recvhdrq.c:189
189 PSMI_CACHEALIGN struct ips_recvhdrq_event rcv_ev = {
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-164.el8.x86_64 libibverbs-56mlnx40-1.56103.x86_64 libnl3-3.5.0-1.el8.x86_64 librdmacm-56mlnx40-1.56103.x86_64 libuuid-2.32.1-28.el8.x86_64 numactl-libs-2.0.12-13.el8.x86_64
(gdb) bt
#0 psm3_verbs_recvhdrq_progress (recvq=0x10addf8) at prov/psm3/psm3/hal_verbs/verbs_recvhdrq.c:189
#1 0x00007f7f31c928fa in psm3_verbs_ips_ptl_poll (ptl_gen=0x10a8300, _ignored=0) at prov/psm3/psm3/hal_verbs/verbs_ptl_ips.c:116
#2 0x00007f7f31c98669 in psm3_poll_internal (ep=0x10a7b40, poll_amsh=1) at prov/psm3/psm3/psm.c:1624
#3 0x00007f7f31cadac6 in psm3_mq_ipeek_dequeue_multi (mq=0x101f250, status_array=0x7ffe725b8b00, status_copy=0x7f7f31c5b65e <psmx3_mq_status_copy>, count=0x7ffe725b8ae4)
at prov/psm3/psm3/psm_mq.c:1154
#4 0x00007f7f31c5d163 in psmx3_cq_poll_mq (cq=0x10244d0, trx_ctxt=0x1022910, event_in=0x7ffe725b8c60, count=0, src_addr=0x0) at prov/psm3/src/psmx3_cq.c:833
#5 0x00007f7f31c5d220 in psmx3_cq_readfrom (cq=0x10244d0, buf=0x7ffe725b8c60, count=1, src_addr=0x0) at prov/psm3/src/psmx3_cq.c:861
#6 0x00007f7f31c5d52a in psmx3_cq_read (cq=0x10244d0, buf=0x7ffe725b8c60, count=1) at prov/psm3/src/psmx3_cq.c:949
#7 0x0000000000404da9 in fi_cq_read (cq=0x10244d0, buf=0x7ffe725b8c60, count=1) at /home/aingerso/install/libfabric/include/rdma/fi_eq.h:394
#8 0x000000000040e533 in ft_spin_for_comp (cq=0x10244d0, cur=0x61be60 <rx_cq_cntr>, total=6, timeout=-1) at common/shared.c:2287
#9 0x000000000040e949 in ft_get_cq_comp (cq=0x10244d0, cur=0x61be60 <rx_cq_cntr>, total=6, timeout=-1) at common/shared.c:2378
#10 0x000000000040ec62 in ft_get_rx_comp (total=6) at common/shared.c:2458
#11 0x0000000000403b7f in send_recv_barrier (sync=0) at multinode/src/core.c:395
#12 0x0000000000403d69 in multi_run_test () at multinode/src/core.c:442
#13 0x00000000004040c3 in multinode_run_tests (argc=9, argv=0x7ffe725b8ee8) at multinode/src/core.c:505
#14 0x0000000000402770 in main (argc=9, argv=0x7ffe725b8ee8) at multinode/src/harness.c:371
Environment:
Linux
The text was updated successfully, but these errors were encountered: