workers launched with htcondor cluster manager cannot connect back with master? #107

rgavazzi · 2018-11-19T16:22:13Z

I get the following error on my local cluster with htcondor scheduler ( julia version 1.1.0-DEV). 1

julia>  addproc_htc( 4 )   
Error launching condor
MethodError(iterate, (Process(`condor_submit /raid/gavazzi/.julia-htc/julia-1195449.sub`, ProcessExited(0)),), 0x00000000000061f6)
0-element Array{Int64,1}

The created condor script file seems OK:

executable = /bin/bash
arguments = ./julia-1195449.sh
universe = vanilla
should_transfer_files = yes
transfer_input_files = /home/dir/.julia-htc/julia-1195449.sh
Notification = Error
output = /home/dir/.julia-htc/julia-1195449-1.o
error= /home/dir/.julia-htc/julia-1195449-1.e
queue
output = /home/dir/.julia-htc/julia-1195449-2.o
error= /home/dir/.julia-htc/julia-1195449-2.e
queue
output = /home/dir/.julia-htc/julia-1195449-3.o
error= /home/dir/.julia-htc/julia-1195449-3.e
queue
output = /home/dir/.julia-htc/julia-1195449-4.o
error= /home/dir/.julia-htc/julia-1195449-4.e
queue

The temporary shell script file /home/dir/.julia-htc/julia-1195449.sh seems OK:

#!/bin/sh
cd /tmp
/usr/bin/julia --worker=o7tjjc9VsZGKA8qn | /usr/bin/telnet  machinenode.from_which_I_ran.julia 8848

All ouput *.o files look like:
Trying 192.168.1.3...

All ouput *.e files look like:
telnet: connect to address 192.168.1.3: Connection refused

(machinenode.from_which_I_ran.julia has IP address 192.168.1.3 , locally )

Other issue: The method "addprocs_htc(np::Integer) = addprocs(HTCManager(np))" does not seem to allow the specification a a different working directory. In many cases, htcondor will place the julia-1195449.sh and associated files into a temporary scratch working directory where one may want to stay during the worker lifetime. Couldn't we avoid that with a

(dir!=nothing) && println(scriptf, "cd $(Base.shell_escape(dir))")

and
addprocs_htc(np::Integer ; dir=nothing ) = addprocs(HTCManager(np) , dir=dir)

change in condor.jl

The text was updated successfully, but these errors were encountered:

vchuravy · 2018-11-19T16:25:33Z

Condor might need a similar fix to JuliaParallel/MPI.jl#222

juliohm · 2020-10-06T19:47:55Z

Too old to reproduce. Please retry with the current stable release and reopen the issue if needed.

rgavazzi · 2020-11-18T02:36:26Z

As far as I can tell, the problem is stlll present!!! I keep failing launching workers with htcondor. The problem remains the same.
telnet keeps complaining:

telnet: connect to address 192.168.1.3: Connection refused

If I directly run "nc -l 8200" on a machine mmm in the cluster and I telnet mmm 820 . Telnet connection succeeds!!
It seems to me that equivalent of nc -l command is the listen(portnum) call at line 45 of the condor.jl script...

Anyhow, I'd be interested to read from anyone facing the same issue or not, while using ClusterManagers in a HTCondor scheduler!

juliohm closed this as completed Oct 6, 2020

rgavazzi mentioned this issue Nov 18, 2020

HTCondor: failure when listening to a telnet commu #150

Open

aminnj mentioned this issue Dec 17, 2020

Fix up and add flexibility to HTCManager #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workers launched with htcondor cluster manager cannot connect back with master? #107

workers launched with htcondor cluster manager cannot connect back with master? #107

rgavazzi commented Nov 19, 2018 •

edited by vchuravy

Loading

vchuravy commented Nov 19, 2018

juliohm commented Oct 6, 2020

rgavazzi commented Nov 18, 2020

workers launched with htcondor cluster manager cannot connect back with master? #107

workers launched with htcondor cluster manager cannot connect back with master? #107

Comments

rgavazzi commented Nov 19, 2018 • edited by vchuravy Loading

vchuravy commented Nov 19, 2018

juliohm commented Oct 6, 2020

rgavazzi commented Nov 18, 2020

rgavazzi commented Nov 19, 2018 •

edited by vchuravy

Loading