SLURM 10 nodes good, 16 nodes error #178

Lightup1 · 2021-10-17T12:38:01Z

I'm using the HPC with Slurm.
In the HPC, every node has 24 CPUs and I'm permitted to use 16 nodes simultaneously
To test my code, I write a .sh file:

#!/bin/bash
#SBATCH -n 384 -N 16
#SBATCH --ntasks-per-node 24
#SBATCH --cpus-per-task=1
#SBATCH -J test
#SBATCH -p work
#SBATCH -t 00:15:00
julia 1.2\ th2testp.jl

and a "1.2 th2testp.jl" file:

using Distributed
using JLD
using ClusterManagers
addprocs(SlurmManager(384),N=16,t="00:15:00")
N_t=@distributed (+) for i in workers()
         i
end
println(N_t)

Then I get an error:

WARNING: failed to select UTF-8 encoding, using ASCII
ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1082
     [2] worker_from_id
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1079 [inlined]
     [3] #remote_do#154
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [4] remote_do
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [5] kill
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:593
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:532
     [2] connect
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:639
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:566
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:589
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [7] (::Distributed.var"#41#44"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

...and 311 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:369
 [2] macro expansion
   @ ./task.jl:388 [inlined]
 [3] addprocs_locked(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:N,), Tuple{Int64}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:480
 [4] addprocs(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:N,), Tuple{Int64}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [5] top-level scope
   @ ~/Yuby/SidebandCooling/1.2 th2testp.jl:4
in expression starting at /WORK/hust_jmcai_1/Yuby/SidebandCooling/1.2 th2testp.jl:4
connecting to worker 1 out of 384
connecting to worker 2 out of 384
connecting to worker 3 out of 384
...
connecting to worker 383 out of 384
connecting to worker 384 out of 384
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

But when I change to use 10 nodes with 240 cpus.
The error disappeared. And I got the right answer.

What cause this?

The text was updated successfully, but these errors were encountered:

kescobo · 2021-10-18T13:15:58Z

This sounds to me like something that's specific to your cluster. Did you try with any other numbers other than 10 and 16? I'm most curious about 15...

Lightup1 · 2021-10-19T01:19:34Z

15 nodes: similar error

$cat slurm-17614857.out
WARNING: failed to select UTF-8 encoding, using ASCII
ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1082
     [2] worker_from_id
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1079 [inlined]
     [3] #remote_do#154
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [4] remote_do
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [5] kill
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:593
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:532
     [2] connect
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:639
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:566
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:589
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [7] (::Distributed.var"#41#44"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

...and 143 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:369
 [2] macro expansion
   @ ./task.jl:388 [inlined]
 [3] addprocs_locked(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:N, :t), Tuple{Int64, String}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:480
 [4] addprocs(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:N, :t), Tuple{Int64, String}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [5] top-level scope
   @ ~/Yuby/SidebandCooling/1.2th2testp.jl:4
in expression starting at /WORK/hust_jmcai_1/Yuby/SidebandCooling/1.2th2testp.jl:4
connecting to worker 1 out of 360
connecting to worker 2 out of 360
connecting to worker 3 out of 360
...
connecting to worker 359 out of 360
connecting to worker 360 out of 360
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: cn11761: task 250: Exited with exit code 143
srun: error: Timed out waiting for job step to complete

11 nodes: works good.

$cat slurm-17614860.out
WARNING: failed to select UTF-8 encoding, using ASCII
connecting to worker 1 out of 264
connecting to worker 2 out of 264
connecting to worker 3 out of 264
connecting to worker 263 out of 264
connecting to worker 264 out of 264
35244

13 nodes: cancelled due to the wall time I set, 5m or 15m, (can not make sure) but I think it is enough for a simple task like this.

WARNING: failed to select UTF-8 encoding, using ASCII
srun: error: cn9801: tasks 4,8-11,21: Exited with exit code 1
slurmd[cn9801]: *** JOB 17614903 CANCELLED AT 2021-10-17T17:48:09 ***

signal (15): Terminated
in expression starting at /WORK/hust_jmcai_1/Yuby/SidebandCooling/1.2th2testp.jl:4
__xstat64 at /lib64/libc.so.6 (unknown line)
uv__fs_stat at /workspace/srcdir/libuv/src/unix/fs.c:1531
uv__fs_work at /workspace/srcdir/libuv/src/unix/fs.c:1678
uv_fs_stat at /workspace/srcdir/libuv/src/unix/fs.c:2073
jl_stat at /buildworker/worker/package_linux64/build/src/sys.c:128
stat at ./stat.jl:67
isfile at ./stat.jl:311 [inlined]
launch at /WORK/hust_jmcai_1/.julia/packages/ClusterManagers/PVaRG/src/slurm.jl:64
#39 at ./task.jl:411
unknown function (ip: 0x2b934cff101c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:833
unknown function (ip: (nil))
Allocations: 142034229 (Pool: 142032131; Big: 2098); GC: 774
connecting to worker 1 out of 312

Lightup1 · 2022-04-12T16:44:38Z

I checked it with the manager of our HPC that common user can only use 10 nodes at same time. I will close the issue

kescobo added bug manager: SLURM The Slurm Workload Manager labels Oct 18, 2021

Lightup1 closed this as completed Apr 12, 2022

kescobo removed the bug label Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURM 10 nodes good, 16 nodes error #178

SLURM 10 nodes good, 16 nodes error #178

Lightup1 commented Oct 17, 2021 •

edited

Loading

kescobo commented Oct 18, 2021

Lightup1 commented Oct 19, 2021 •

edited

Loading

Lightup1 commented Apr 12, 2022

SLURM 10 nodes good, 16 nodes error #178

SLURM 10 nodes good, 16 nodes error #178

Comments

Lightup1 commented Oct 17, 2021 • edited Loading

kescobo commented Oct 18, 2021

Lightup1 commented Oct 19, 2021 • edited Loading

Lightup1 commented Apr 12, 2022

Lightup1 commented Oct 17, 2021 •

edited

Loading

Lightup1 commented Oct 19, 2021 •

edited

Loading