Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workers launched with htcondor cluster manager cannot connect back with master? #107

Closed
rgavazzi opened this issue Nov 19, 2018 · 3 comments

Comments

@rgavazzi
Copy link

rgavazzi commented Nov 19, 2018

I get the following error on my local cluster with htcondor scheduler ( julia version 1.1.0-DEV). 1

julia>  addproc_htc( 4 )   
Error launching condor
MethodError(iterate, (Process(`condor_submit /raid/gavazzi/.julia-htc/julia-1195449.sub`, ProcessExited(0)),), 0x00000000000061f6)
0-element Array{Int64,1}

The created condor script file seems OK:

executable = /bin/bash
arguments = ./julia-1195449.sh
universe = vanilla
should_transfer_files = yes
transfer_input_files = /home/dir/.julia-htc/julia-1195449.sh
Notification = Error
output = /home/dir/.julia-htc/julia-1195449-1.o
error= /home/dir/.julia-htc/julia-1195449-1.e
queue
output = /home/dir/.julia-htc/julia-1195449-2.o
error= /home/dir/.julia-htc/julia-1195449-2.e
queue
output = /home/dir/.julia-htc/julia-1195449-3.o
error= /home/dir/.julia-htc/julia-1195449-3.e
queue
output = /home/dir/.julia-htc/julia-1195449-4.o
error= /home/dir/.julia-htc/julia-1195449-4.e
queue

The temporary shell script file /home/dir/.julia-htc/julia-1195449.sh seems OK:

#!/bin/sh
cd /tmp
/usr/bin/julia --worker=o7tjjc9VsZGKA8qn | /usr/bin/telnet  machinenode.from_which_I_ran.julia 8848

All ouput *.o files look like:
Trying 192.168.1.3...

All ouput *.e files look like:
telnet: connect to address 192.168.1.3: Connection refused

(machinenode.from_which_I_ran.julia has IP address 192.168.1.3 , locally )

Other issue: The method "addprocs_htc(np::Integer) = addprocs(HTCManager(np))" does not seem to allow the specification a a different working directory. In many cases, htcondor will place the julia-1195449.sh and associated files into a temporary scratch working directory where one may want to stay during the worker lifetime. Couldn't we avoid that with a

(dir!=nothing) && println(scriptf, "cd $(Base.shell_escape(dir))")

and
addprocs_htc(np::Integer ; dir=nothing ) = addprocs(HTCManager(np) , dir=dir)

change in condor.jl

@vchuravy
Copy link
Member

Condor might need a similar fix to JuliaParallel/MPI.jl#222

@juliohm
Copy link
Collaborator

juliohm commented Oct 6, 2020

Too old to reproduce. Please retry with the current stable release and reopen the issue if needed.

@juliohm juliohm closed this as completed Oct 6, 2020
@rgavazzi
Copy link
Author

As far as I can tell, the problem is stlll present!!! I keep failing launching workers with htcondor. The problem remains the same.
telnet keeps complaining:

telnet: connect to address 192.168.1.3: Connection refused

If I directly run "nc -l 8200" on a machine mmm in the cluster and I telnet mmm 820 . Telnet connection succeeds!!
It seems to me that equivalent of nc -l command is the listen(portnum) call at line 45 of the condor.jl script...

Anyhow, I'd be interested to read from anyone facing the same issue or not, while using ClusterManagers in a HTCondor scheduler!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants