Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drmaa-python wait not playing nice with Condor #21

Closed
leipzig opened this issue Jun 10, 2015 · 6 comments · Fixed by #47
Closed

drmaa-python wait not playing nice with Condor #21

leipzig opened this issue Jun 10, 2015 · 6 comments · Fixed by #47

Comments

@leipzig
Copy link

leipzig commented Jun 10, 2015

condor_version 
$CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
$CondorPlatform: x86_64_rhap_6.3 $

using the example3 (example1.py, example1.1.py, example2.py, and example2.1.py all work fine)

./example3.py
Creating job template
DEBUG: Join_files is set
DEBUG: drmaa_join_files: y
DEBUG: drmaa_v_argv: ?:i
DEBUG: drmaa_remote_command: /home/leipzigj/drmaa-python/examples/sleeper.sh
Your job has been submitted with id variome.chop.edu.108921.0
DEBUG: -> wait_job(variome.chop.edu.108921.0)
DEBUG: Sleeping for a momentDEBUG: Sleeping for a momentDEBUG: Sleeping for a momentDEBUG: Sleeping for a momentDEBUG: Resulting stat value is 200
DEBUG: RUsage data: submission_time=1433953498, start_time=1433953499, end_time=1433953502
DEBUG: Unreferencing job variome.chop.edu.108921.0
DEBUG: Not removing job variome.chop.edu.108921.0 yet (ref_count: 1 -> 0)
DEBUG: Marking job variome.chop.edu.108921.0 as DISPOSED
DEBUG: Removing job info for variome.chop.edu.108921.0 (0x26e58c0, 0x26e58c0, (nil), 1)
DEBUG: <- wait_job(variome.chop.edu.108921.0)
Traceback (most recent call last):
  File "./example3.py", line 29, in <module>
    main()
  File "./example3.py", line 21, in main
    retval = s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
  File "/nas/is1/bin/variome-env/lib/python3.3/site-packages/drmaa/session.py", line 480, in wait
    c(drmaa_wcoredump, byref(coredumped), stat)
  File "/nas/is1/bin/variome-env/lib/python3.3/site-packages/drmaa/helpers.py", line 299, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/nas/is1/bin/variome-env/lib/python3.3/site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.InvalidArgumentException: code 4: Invalid argument
@pidupuis
Copy link

Have you found a solution?

@leipzig
Copy link
Author

leipzig commented Feb 24, 2017

no sorry haven't thought about this for years. switched to snakemake.

@pidupuis
Copy link

pidupuis commented Feb 24, 2017

I'm using snakemake + HTCondor too but still the same error :/
How did you figure it out?

@leipzig
Copy link
Author

leipzig commented Feb 24, 2017

well we were just running condor on top of SGE. i think snakemake DRMAA will work fine with SGE. The problem must be condor.

@stverhae
Copy link
Contributor

stverhae commented Mar 8, 2017

Same issue here.
problem is, the straight up condor_wait command also returns with an error(1) when you point it to a log file saying "no jobs found". possibly this is deeper routed than python libdrmaa ..

@stverhae
Copy link
Contributor

stverhae commented Mar 8, 2017

Alright, found the issue, when doing a wait in drmaa-python it asks for both info on coredump and termsignal, even when the job exited normally without a signal.
The drmaa.h file describes this: Returns the name of the signal that terminated the process if the drmaa_wifsignaled() returned non-zero.
fixed by actually checking if threre was a signal/abort before calling these functions.

On sge and slurm drmaa libs these called would not complain, but the drmaa stat specify we need to check so this is the correct way to go.

I'll be making a PR shortly ...

@pidupuis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants