Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBS: problem with truncated job identifier #4051

Closed
dpmatthews opened this issue Jan 27, 2021 · 10 comments · Fixed by #5616
Closed

PBS: problem with truncated job identifier #4051

dpmatthews opened this issue Jan 27, 2021 · 10 comments · Fixed by #5616
Labels
bug Something is wrong :(
Milestone

Comments

@dpmatthews
Copy link
Contributor

I've hit a problem trying to get Cylc 7.8.7 working on Isambard (https://gw4-isambard.github.io/docs/).
PBS (version 19.2) is reporting very long job identifiers which are then shortening in the output from qstat:

$ qsub job 
38438.master.gw4.metoffice.gov.uk
$ qstat 38438.master.gw4.metoffice.gov.uk
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
38438.master      test-pbs         xxxxxxx                  0 Q romeq

Polling fails to find the job in the queue so any jobs which haven't already completed enter the submit-failed state when polled.

I can't find this PBS behaviour documented anywhere.

@dpmatthews dpmatthews added the bug Something is wrong :( label Jan 27, 2021
@dpmatthews dpmatthews added this to the cylc-7.8.x milestone Jan 27, 2021
@dpmatthews
Copy link
Contributor Author

Since we are giving qstat a list of job id's I'd have thought we only really need to worry about matching the number and can ignore anything after the first "."?

@hjoliver
Copy link
Member

Is it possible for the numerical part to be non-unique?

@dpmatthews
Copy link
Contributor Author

I've tried reading about peer scheduling in the PBS reference guide (https://www.altair.com/pbs-works-documentation/). As far as I can see the numerical part is always unique.

PBS appears to specifically support the shortened form of the server name in the job id (although I can't find it mentioned in the documentation). In my case above, all the following commands work and return the same result:

qstat 38438.master.gw4.metoffice.gov.uk
qstat 38438.master
qstat 38438

As far as I can tell the qstat output always uses the shortened server name.

One option to fix this is to always shorten the job id reported by the qsub command. This has the advantage that we then only display the shortened id in the GUI. This can be done by changing 1 line change in pbs.py from

    REC_ID_FROM_SUBMIT_OUT = re.compile(r"""\A\s*(?P<id>\S+)\s*\Z""")

to

    REC_ID_FROM_SUBMIT_OUT = re.compile(r"""\A\s*(?P<id>\d+(?:\.[^\.\s]+)?)(?:\.[^\.\s]+)*\s*\Z""")

(needs checking!)
This fix appears to be working OK on Isambard.

@dpmatthews
Copy link
Contributor Author

Support for JSON output was added to qstat in 2017: https://openpbs.atlassian.net/browse/PP-484
I'm not sure what version this corresponds to but I've found it's available in PBS 18.
Switching to using JSON output would probably be a much better solution.

@dpmatthews
Copy link
Contributor Author

Note that JSON output only works with full output which returns a lot more fields and may put a heavier load on the PBS server so may be discouraged.

@dpmatthews
Copy link
Contributor Author

I've hit a different, but related, problem with job identifiers and qstat on another system.
In this case it's using PBS 19.4.
Job id's are of the form XXXX.pbs-service-XXX.
This is too long to fit into the normal job id field returned by qstat so it gets truncated to XXXX.pbs-servi* which breaks the polling.
Fortunately you can get qstat to use a wide format using qstat -w so setting

POLL_CMD = "qstat -w"

in pbs.py fixes the problem.
However, it appears you can't use the -w option on it's own on the other PBS systems I have access to (versions 19.2 & 18.2).
The best solution I can see is to make POLL_CMD configurable (or use the JSON output).

@hjoliver hjoliver modified the milestones: cylc-7.8.x, cylc-8.x Aug 4, 2021
@dpmatthews
Copy link
Contributor Author

See https://stackoverflow.com/a/22901816
I think this implies we should be safe to just use the numerical part of the jobid.

@ejhyer
Copy link

ejhyer commented Mar 15, 2023

Hi this thread is quite old, but I encountered this problem with cylc 8.1.2. I apologize for the incomplete bug report, but the symptoms were simple: cylc was marking any task as failed that was polled with qstat. When I set POLL_CMD = "qstat -w" as described above by @dpmatthews the problem went away. I would like a solution to this problem that does not involve hand-hacking the cylc code. I could implement a "pbswide" custom job runner handler, but perhaps there's an easier way?

@dpmatthews
Copy link
Contributor Author

I suggest we change the code to just match the numerical part of the job id. As far as we can tell it should work in all cases and will avoid the need for another configuration option.

@oliver-sanders oliver-sanders modified the milestones: cylc-8.x, cylc-8.2.0 Mar 16, 2023
@oliver-sanders
Copy link
Member

I think this diff should be sufficient:

diff --git a/cylc/flow/job_runner_handlers/pbs.py b/cylc/flow/job_runner_handlers/pbs.py
index c28ce3899..d050d0636 100644
--- a/cylc/flow/job_runner_handlers/pbs.py
+++ b/cylc/flow/job_runner_handlers/pbs.py
@@ -83,6 +83,7 @@ class PBSHandler:
     POLL_CMD = "qstat"
     POLL_CANT_CONNECT_ERR = "Connection refused"
     REC_ID_FROM_SUBMIT_OUT = re.compile(r"""\A\s*(?P<id>\S+)\s*\Z""")
+    REC_ID_FROM_POLL_OUT = re.compile(r'(\d{4,})\.\w+')
     SUBMIT_CMD_TMPL = "qsub '%(job)s'"
 
     def format_directives(self, job_conf):
@@ -123,5 +124,9 @@ class PBSHandler:
                 lines.append(self.DIRECTIVE_PREFIX + key)
         return lines
 
+    @classmethod
+    def filter_poll_many_output(cls, out):
+        return cls.REC_ID_FROM_POLL_OUT.findall(out)
+
 
 JOB_RUNNER_HANDLER = PBSHandler()

The old code had avoided having to specify the job ID pattern for polling, to strip the text part we will have to use regexes so will need to know the full range of possible job ID formats to proceed.

oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jul 4, 2023
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants