Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove UCX environment variables from __init__, document in knownissues.md #370

Merged
merged 1 commit into from
Apr 23, 2020

Conversation

simonbyrne
Copy link
Member

The underlying issue is now fixed in UCX 1.7.0 and later, so we can disable these hooks for now. I've added a known issues section to the docs, and included this there.

@simonbyrne
Copy link
Member Author

Hmm, looks like it isn't fixed!

@simonbyrne
Copy link
Member Author

Hmm, if I don't set UCX_MEM_EVENTS, then I start seeing ReadOnlyMemoryErrors

@simonbyrne
Copy link
Member Author

Okay, the error on CI here appears to be different: JuliaGPU/GPUArrays.jl#266

@simonbyrne
Copy link
Member Author

Okay, this is the simplest example I can come up with:

using MPI, CuArrays

MPI.Init()

send_mesg = CuArray{Float64}(1:3)
recv_mesg = CuArray{Float64}(undef,3)

rreq = MPI.Irecv!(recv_mesg, 0, 32, MPI.COMM_WORLD)
sreq = MPI.Isend(send_mesg, 0, 32, MPI.COMM_WORLD)

MPI.Waitall!([rreq, sreq])

If I run it without UCX_MEM_EVENTS=no, then I get

ERROR: LoadError: ReadOnlyMemoryError()
Stacktrace:
 [1] Isend(::MPI.Buffer{CuArray{Float64,1,Nothing}}, ::Int64, ::Int64, ::MPI.Comm) at /central/home/spjbyrne/src/MPI.jl/src/pointtopoint.jl:230
 [2] Isend(::CuArray{Float64,1,Nothing}, ::Int64, ::Int64, ::MPI.Comm) at /central/home/spjbyrne/src/MPI.jl/src/pointtopoint.jl:238
 [3] top-level scope at /central/home/spjbyrne/src/MPI.jl/test/xx.jl:9
 [4] include at ./boot.jl:328 [inlined]
 [5] include_relative(::Module, ::String) at ./loading.jl:1105
 [6] include(::Module, ::String) at ./Base.jl:31
 [7] exec_options(::Base.JLOptions) at ./client.jl:287
 [8] _start() at ./client.jl:460
in expression starting at /central/home/spjbyrne/src/MPI.jl/test/xx.jl:9
[1587440927.537380] [hpc-90-35:82320:0]          mpool.c:43   UCX  WARN  object 0x26f0a40 was not returned to mpool self_msg_desc
[1587440927.587980] [hpc-90-35:82320:0]          mpool.c:43   UCX  WARN  object 0x26ee640 was not returned to mpool ucp_requests
[1587440927.587992] [hpc-90-35:82320:0]          mpool.c:43   UCX  WARN  object 0x26ee800 was not returned to mpool ucp_requests
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[46063,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Looks like it could be related to openucx/ucx#4988?

@simonbyrne
Copy link
Member Author

simonbyrne commented Apr 21, 2020

Hmm, the gdb backtrace suggests it isn't detecting that it is a CuArray correctly:

#0  0x00002aaaabbec436 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x00002aab3447eb6e in uct_am_short_fill_data (length=<optimized out>, payload=<optimized out>, header=<optimized out>,
    buffer=<optimized out>) at /home/naveed/imss_admin/build/ucx-1.8.0/src/uct/base/uct_iface.h:725
#2  uct_self_ep_am_short (tl_ep=<optimized out>, id=<optimized out>, header=<optimized out>, payload=<optimized out>, length=24)
    at sm/self/self.c:259
#3  0x00002aab34238cc8 in uct_ep_am_short (length=<optimized out>, payload=<optimized out>, header=<optimized out>, id=2 '\002',
    ep=<optimized out>) at /home/naveed/imss_admin/build/ucx-1.8.0/src/uct/api/uct.h:2424
#4  ucp_tag_eager_contig_short (self=0x28f5928) at tag/eager_snd.c:125
#5  0x00002aab34243268 in ucp_request_try_send (pending_flags=0, req_status=0x7fffffffb6d0, req=<optimized out>)
    at /home/naveed/imss_admin/build/ucx-1.8.0/src/ucp/core/ucp_request.inl:171
#6  ucp_request_send (pending_flags=0, req=<optimized out>)
    at /home/naveed/imss_admin/build/ucx-1.8.0/src/ucp/core/ucp_request.inl:206
#7  ucp_tag_send_req (enable_zcopy=1, proto=<optimized out>, cb=0x2aab34006400 <mca_pml_ucx_send_completion>,
    rndv_am_thresh=<optimized out>, rndv_rma_thresh=<optimized out>, msg_config=<optimized out>, dt_count=3, req=<optimized out>)
    at tag/tag_send.c:109
#8  ucp_tag_send_nb (ep=0x2aab2ffb2040, buffer=<optimized out>, count=3, datatype=<optimized out>, tag=<optimized out>,
    cb=0x2aab34006400 <mca_pml_ucx_send_completion>) at tag/tag_send.c:224
#9  0x00002aab34005046 in mca_pml_ucx_isend () from /central/software/OpenMPI/4.0.3_cuda-10.0/lib/openmpi/mca_pml_ucx.so
#10 0x00002aaad7ca6887 in PMPI_Isend () from /central/software/OpenMPI/4.0.3_cuda-10.0//lib/libmpi.so
#11 0x00002aaad785abcf in ?? ()
#12 0x00002aaadaef2db0 in ?? ()
#13 0x0000000000000011 in ?? ()
#14 0x0000000000000002 in ?? ()
#15 0x00007fffffffb850 in ?? ()

Whie the RPATH issue is now fixed in UCX 1.7.0 and later, it appears that the memory cache still does not work correctly. Switching to using `UCX_MEMTYPE_CACHE=no` produces fewer warning messages.
@simonbyrne
Copy link
Member Author

I've switched to using UCX_MEMTYPE_CACHE=no as that gives fewer warnings (and fixes both the RPATH and segafult issues).

@simonbyrne simonbyrne merged commit 5ae9eb8 into master Apr 23, 2020
@simonbyrne simonbyrne deleted the sb/issues branch April 23, 2020 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants