Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Transpose Worker-to-Worker communication failing #278

Closed
rohanmclure opened this issue Jun 27, 2019 · 5 comments
Closed

MPI_Transpose Worker-to-Worker communication failing #278

rohanmclure opened this issue Jun 27, 2019 · 5 comments

Comments

@rohanmclure
Copy link
Contributor

rohanmclure commented Jun 27, 2019

Use of MPI_Transpose_All under Julia v1.0.1 on an HPC cluster arrives at line 285 of cman.jl which runs get with a single integer parameter and so crashes.

return start_send_event_loop(mgr, get(config.connect_at))

┌ Error: Error on 4 while connecting to peer 3, exiting
│   exception =
│    MethodError: no method matching get(::Int64)
│    Closest candidates are:
│      get(!Matched::Base.EnvDict, !Matched::AbstractString, !Matched::Any) at env.jl:77
│      get(!Matched::Base.TTY, !Matched::Symbol, !Matched::Any) at stream.jl:415
│      get(!Matched::REPL.Terminals.TTYTerminal, !Matched::Any, !Matched::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib
/v1.0/REPL/src/Terminals.jl:176
│      ...
│    Stacktrace:
│     [1] connect(::MPIManager, ::Int64, ::WorkerConfig) at ~/.julia/packages/MPI/wu7um/src/cman.jl:288
│     [2] connect_to_peer(::MPIManager, ::Int64, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distribu
ted/src/process_messages.jl:338
│     [3] (::getfield(Distributed, Symbol("##123#125")){Int64,WorkerConfig})() at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib
/v1.0/Distributed/src/process_messages.jl:322
│     [4] exec_conn_func(::Distributed.Worker) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl
:134
│     [5] (::getfield(Distributed, Symbol("##25#28")){Distributed.Worker})() at ./task.jl:259
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/process_messages.jl:344
@simonbyrne
Copy link
Member

Wow, that's been around since 2015:
885c395#diff-0c8bfec18ea987d73d1f57b0c64511f3R290

@amitmurthy any idea what that should be?

@simonbyrne
Copy link
Member

I wonder why this wasn't caught in tests. @rohanmclure what did you do to call it?

@rohanmclure
Copy link
Contributor Author

rohanmclure commented Jun 28, 2019

I believe this was invoked when worker processes attempted to remotecall one another.

@rohanmclure
Copy link
Contributor Author

rohanmclure commented Jun 29, 2019

Essentially it is an issue for when worker processes message one another. I should note that I would have generated this minimal example with RemoteChannels, however the commented code appears to result in infinite recursion.

using Test
using MPI, Distributed

mgr = MPI.start_main_loop(MPI.MPI_TRANSPORT_ALL)

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)

# Generating RemoteChannels on other workers also results in a crash 
# c2, c3 = RemoteChannel(() -> Channel{Int}(0), 2), RemoteChannel(() -> Channel{Int}(0), 3)

@assert nprocs() >= 3
@fetchfrom 2 global c2 = Channel{Int}(0)
@fetchfrom 3 global c3 = Channel{Int}(0)

b1 = remotecall(2) do
    l = 2
    @sync @spawnat 3 begin
        put!(c3, l)
    end
    return take!(c2) == 3
end
b2 = remotecall(3) do
    correct = take!(c3) == 2
    l = 3
    @sync @spawnat 2 begin
        put!(c2, l)
    end
    return correct
end

@test fetch(b1)
@test fetch(b2)

MPI.stop_main_loop(mgr)

To run this script, please run mpirun -np 3 julia <name of script>.jl

@rohanmclure rohanmclure changed the title MPI_Transport_All Error on <pid_1> while connecting to peer <pid_2>, exiting MPI_Transpose Worker-to-Worker communication failing Jul 23, 2019
@simonbyrne
Copy link
Member

Fixed by #293

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants