Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submodule: Update UCX to v1.12.0 #5812

Merged
merged 1 commit into from
Feb 9, 2022
Merged

Conversation

raffenet
Copy link
Contributor

@raffenet raffenet commented Jan 28, 2022

Pull Request Description

A user reported seeing a bug when using the embedded UCX with MPICH 4.0, but went away with an external 1.12.0 install. Update submodule to 1.12.0, then backport once merged.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@raffenet
Copy link
Contributor Author

test:mpich/ch4/ucx

@raffenet
Copy link
Contributor Author

test:mpich/ch4/ucx

@raffenet raffenet requested a review from hzhou January 31, 2022 21:27
@hzhou
Copy link
Contributor

hzhou commented Jan 31, 2022

A user reported seeing a bug when using the embedded UCX with MPICH 4.0, but went away with an externel 1.12.0 install. Update submodule to 1.12.0, then backport once merged.

Did the user provide any details on the bug? MPICH is likely to be deployed with different external ucx versions, it is good to have some record on what works and what does not.

Copy link
Contributor

@hzhou hzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@raffenet
Copy link
Contributor Author

raffenet commented Feb 1, 2022

A user reported seeing a bug when using the embedded UCX with MPICH 4.0, but went away with an externel 1.12.0 install. Update submodule to 1.12.0, then backport once merged.

Did the user provide any details on the bug? MPICH is likely to be deployed with different external ucx versions, it is good to have some record on what works and what does not.

I'm asking for more details. After chatting a bit more, I kind of suspect it was user error, but we will see.

@raffenet
Copy link
Contributor Author

raffenet commented Feb 9, 2022

A user reported seeing a bug when using the embedded UCX with MPICH 4.0, but went away with an externel 1.12.0 install. Update submodule to 1.12.0, then backport once merged.

Did the user provide any details on the bug? MPICH is likely to be deployed with different external ucx versions, it is good to have some record on what works and what does not.

I'm asking for more details. After chatting a bit more, I kind of suspect it was user error, but we will see.

Here is the error behavior seen on ThetaGPU. Apparently this goes away with the newer UCX 1.12.0. I don't have an allocation on ThetaGPU, but might be able to try on a JLSE machine instead.

thetagpu11:~/soft/petsc/packages$ mpiexec -n 1 -host localhost ./hello
[1644178926.844524] [thetagpu11:3110029:0]          parser.c:1885 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1644178926.861597] [thetagpu11:3110029:0]          wireup.c:1027 UCX  ERROR   old: am_lane 1 wireup_msg_lane 3 cm_lane <none> reachable_mds 0x1ffff ep_check_map 0x0
[1644178926.861633] [thetagpu11:3110029:0]          wireup.c:1048 UCX  ERROR   old: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] rkey_ptr
[1644178926.861638] [thetagpu11:3110029:0]          wireup.c:1048 UCX  ERROR   old: lane[1]:  2:self/memory0.0 md[2]          -> md[2]/self/sysdev[255] am am_bw#0
[1644178926.861640] [thetagpu11:3110029:0]          wireup.c:1048 UCX  ERROR   old: lane[2]:  8:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#0
[1644178926.861643] [thetagpu11:3110029:0]          wireup.c:1048 UCX  ERROR   old: lane[3]: 13:rc_mlx5/mlx5_1:1.0 md[5]      -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[1644178926.861645] [thetagpu11:3110029:0]          wireup.c:1027 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> reachable_mds 0x1ffff ep_check_map 0x0
[1644178926.861647] [thetagpu11:3110029:0]          wireup.c:1048 UCX  ERROR   new: lane[0]:  2:self/memory0.0 md[2]          -> md[2]/self/sysdev[255] am am_bw#0
[1644178926.861649] [thetagpu11:3110029:0]          wireup.c:1048 UCX  ERROR   new: lane[1]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255]
[1644178926.861651] [thetagpu11:3110029:0]          wireup.c:1048 UCX  ERROR   new: lane[2]:  8:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#0
[1644178926.861654] [thetagpu11:3110029:0]          wireup.c:1048 UCX  ERROR   new: lane[3]: 13:rc_mlx5/mlx5_1:1.0 md[5]      -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[thetagpu11:3110029:0:3110029]      wireup.c:1336 Fatal: endpoint reconfiguration not supported yet
==== backtrace (tid:3110029) ====
 0  /lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f781c123864]
 1  /lib/libucs.so.0(ucs_fatal_error_message+0xb8) [0x7f781c120498]
 2  /lib/libucs.so.0(ucs_fatal_error_format+0x11d) [0x7f781c1205bd]
 3  /lib/libucp.so.0(ucp_wireup_init_lanes+0x3c9) [0x7f781c1c8e59]
 4  /lib/libucp.so.0(+0x75d20) [0x7f781c1c6d20]
 5  /lib/libucp.so.0(+0x76559) [0x7f781c1c7559]
 6  /lib/libuct.so.0(uct_self_ep_am_bcopy+0x69) [0x7f781a10d1d9]
 7  /lib/libucp.so.0(ucp_wireup_msg_progress+0xc9) [0x7f781c1c5fa9]
 8  /lib/libucp.so.0(+0x752c2) [0x7f781c1c62c2]
 9  /lib/libucp.so.0(ucp_wireup_send_request+0x14e) [0x7f781c1c901e]
10  /lib/libucp.so.0(+0x2526d) [0x7f781c17626d]
11  /lib/libucp.so.0(ucp_ep_create+0xe8) [0x7f781c176488]
12  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0xbf22fa) [0x7f781e4932fa]
13  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0xbf2e5b) [0x7f781e493e5b]
14  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0xbab66d) [0x7f781e44c66d]
15  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0xae9a01) [0x7f781e38aa01]
16  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0xaea8fd) [0x7f781e38b8fd]
17  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0xae6081) [0x7f781e387081]
18  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0xb29528) [0x7f781e3ca528]
19  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0xb28e85) [0x7f781e3c9e85]
20  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(+0x8842a0) [0x7f781e1252a0]
21  /home/gbetrie/soft/mpich-4.0rc3/lib/libmpi.so.0(PMPI_Init+0x1d) [0x7f781e1253ad]
22  ./hello() [0x4011bc]
23  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f781c2380b3]
24  ./hello() [0x4010ce]
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3110029 RUNNING AT localhost
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

@raffenet raffenet merged commit f53fe20 into pmodels:main Feb 9, 2022
@raffenet raffenet deleted the ucx-submodule branch February 9, 2022 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants