-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PBS: problem with truncated job identifier #4051
Comments
Since we are giving |
Is it possible for the numerical part to be non-unique? |
I've tried reading about peer scheduling in the PBS reference guide (https://www.altair.com/pbs-works-documentation/). As far as I can see the numerical part is always unique. PBS appears to specifically support the shortened form of the server name in the job id (although I can't find it mentioned in the documentation). In my case above, all the following commands work and return the same result:
As far as I can tell the qstat output always uses the shortened server name. One option to fix this is to always shorten the job id reported by the qsub command. This has the advantage that we then only display the shortened id in the GUI. This can be done by changing 1 line change in
to
(needs checking!) |
Support for JSON output was added to qstat in 2017: https://openpbs.atlassian.net/browse/PP-484 |
Note that JSON output only works with full output which returns a lot more fields and may put a heavier load on the PBS server so may be discouraged. |
I've hit a different, but related, problem with job identifiers and qstat on another system.
in pbs.py fixes the problem. |
See https://stackoverflow.com/a/22901816 |
Hi this thread is quite old, but I encountered this problem with cylc 8.1.2. I apologize for the incomplete bug report, but the symptoms were simple: cylc was marking any task as failed that was polled with |
I suggest we change the code to just match the numerical part of the job id. As far as we can tell it should work in all cases and will avoid the need for another configuration option. |
I think this diff should be sufficient: diff --git a/cylc/flow/job_runner_handlers/pbs.py b/cylc/flow/job_runner_handlers/pbs.py
index c28ce3899..d050d0636 100644
--- a/cylc/flow/job_runner_handlers/pbs.py
+++ b/cylc/flow/job_runner_handlers/pbs.py
@@ -83,6 +83,7 @@ class PBSHandler:
POLL_CMD = "qstat"
POLL_CANT_CONNECT_ERR = "Connection refused"
REC_ID_FROM_SUBMIT_OUT = re.compile(r"""\A\s*(?P<id>\S+)\s*\Z""")
+ REC_ID_FROM_POLL_OUT = re.compile(r'(\d{4,})\.\w+')
SUBMIT_CMD_TMPL = "qsub '%(job)s'"
def format_directives(self, job_conf):
@@ -123,5 +124,9 @@ class PBSHandler:
lines.append(self.DIRECTIVE_PREFIX + key)
return lines
+ @classmethod
+ def filter_poll_many_output(cls, out):
+ return cls.REC_ID_FROM_POLL_OUT.findall(out)
+
JOB_RUNNER_HANDLER = PBSHandler() The old code had avoided having to specify the job ID pattern for polling, to strip the text part we will have to use regexes so will need to know the full range of possible job ID formats to proceed. |
I've hit a problem trying to get Cylc 7.8.7 working on Isambard (https://gw4-isambard.github.io/docs/).
PBS (version 19.2) is reporting very long job identifiers which are then shortening in the output from
qstat
:Polling fails to find the job in the queue so any jobs which haven't already completed enter the submit-failed state when polled.
I can't find this PBS behaviour documented anywhere.
The text was updated successfully, but these errors were encountered: