Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM 10 nodes good, 16 nodes error #178

Closed
Lightup1 opened this issue Oct 17, 2021 · 3 comments
Closed

SLURM 10 nodes good, 16 nodes error #178

Lightup1 opened this issue Oct 17, 2021 · 3 comments
Labels
manager: SLURM The Slurm Workload Manager

Comments

@Lightup1
Copy link

Lightup1 commented Oct 17, 2021

I'm using the HPC with Slurm.
In the HPC, every node has 24 CPUs and I'm permitted to use 16 nodes simultaneously
To test my code, I write a .sh file:

#!/bin/bash
#SBATCH -n 384 -N 16
#SBATCH --ntasks-per-node 24
#SBATCH --cpus-per-task=1
#SBATCH -J test
#SBATCH -p work
#SBATCH -t 00:15:00
julia 1.2\ th2testp.jl

and a "1.2 th2testp.jl" file:

using Distributed
using JLD
using ClusterManagers
addprocs(SlurmManager(384),N=16,t="00:15:00")
N_t=@distributed (+) for i in workers()
         i
end
println(N_t)

Then I get an error:

WARNING: failed to select UTF-8 encoding, using ASCII
ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1082
     [2] worker_from_id
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1079 [inlined]
     [3] #remote_do#154
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [4] remote_do
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [5] kill
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:593
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:532
     [2] connect
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:639
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:566
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:589
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [7] (::Distributed.var"#41#44"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

...and 311 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:369
 [2] macro expansion
   @ ./task.jl:388 [inlined]
 [3] addprocs_locked(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:N,), Tuple{Int64}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:480
 [4] addprocs(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:N,), Tuple{Int64}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [5] top-level scope
   @ ~/Yuby/SidebandCooling/1.2 th2testp.jl:4
in expression starting at /WORK/hust_jmcai_1/Yuby/SidebandCooling/1.2 th2testp.jl:4
connecting to worker 1 out of 384
connecting to worker 2 out of 384
connecting to worker 3 out of 384
...
connecting to worker 383 out of 384
connecting to worker 384 out of 384
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

But when I change to use 10 nodes with 240 cpus.
The error disappeared. And I got the right answer.

What cause this?

@kescobo kescobo added bug manager: SLURM The Slurm Workload Manager labels Oct 18, 2021
@kescobo
Copy link
Collaborator

kescobo commented Oct 18, 2021

This sounds to me like something that's specific to your cluster. Did you try with any other numbers other than 10 and 16? I'm most curious about 15...

@Lightup1
Copy link
Author

Lightup1 commented Oct 19, 2021

15 nodes: similar error

$cat slurm-17614857.out
WARNING: failed to select UTF-8 encoding, using ASCII
ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1082
     [2] worker_from_id
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1079 [inlined]
     [3] #remote_do#154
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [4] remote_do
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [5] kill
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:593
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:532
     [2] connect
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:639
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:566
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:589
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [7] (::Distributed.var"#41#44"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

...and 143 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:369
 [2] macro expansion
   @ ./task.jl:388 [inlined]
 [3] addprocs_locked(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:N, :t), Tuple{Int64, String}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:480
 [4] addprocs(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:N, :t), Tuple{Int64, String}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [5] top-level scope
   @ ~/Yuby/SidebandCooling/1.2th2testp.jl:4
in expression starting at /WORK/hust_jmcai_1/Yuby/SidebandCooling/1.2th2testp.jl:4
connecting to worker 1 out of 360
connecting to worker 2 out of 360
connecting to worker 3 out of 360
...
connecting to worker 359 out of 360
connecting to worker 360 out of 360
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: cn11761: task 250: Exited with exit code 143
srun: error: Timed out waiting for job step to complete

11 nodes: works good.

$cat slurm-17614860.out
WARNING: failed to select UTF-8 encoding, using ASCII
connecting to worker 1 out of 264
connecting to worker 2 out of 264
connecting to worker 3 out of 264
connecting to worker 263 out of 264
connecting to worker 264 out of 264
35244

13 nodes: cancelled due to the wall time I set, 5m or 15m, (can not make sure) but I think it is enough for a simple task like this.

WARNING: failed to select UTF-8 encoding, using ASCII
srun: error: cn9801: tasks 4,8-11,21: Exited with exit code 1
slurmd[cn9801]: *** JOB 17614903 CANCELLED AT 2021-10-17T17:48:09 ***

signal (15): Terminated
in expression starting at /WORK/hust_jmcai_1/Yuby/SidebandCooling/1.2th2testp.jl:4
__xstat64 at /lib64/libc.so.6 (unknown line)
uv__fs_stat at /workspace/srcdir/libuv/src/unix/fs.c:1531
uv__fs_work at /workspace/srcdir/libuv/src/unix/fs.c:1678
uv_fs_stat at /workspace/srcdir/libuv/src/unix/fs.c:2073
jl_stat at /buildworker/worker/package_linux64/build/src/sys.c:128
stat at ./stat.jl:67
isfile at ./stat.jl:311 [inlined]
launch at /WORK/hust_jmcai_1/.julia/packages/ClusterManagers/PVaRG/src/slurm.jl:64
#39 at ./task.jl:411
unknown function (ip: 0x2b934cff101c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:833
unknown function (ip: (nil))
Allocations: 142034229 (Pool: 142032131; Big: 2098); GC: 774
connecting to worker 1 out of 312

@Lightup1
Copy link
Author

I checked it with the manager of our HPC that common user can only use 10 nodes at same time. I will close the issue

@kescobo kescobo removed the bug label Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
manager: SLURM The Slurm Workload Manager
Projects
None yet
Development

No branches or pull requests

2 participants