Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_sys_tests: add more clarity about whether it successfully launched the test jobs #1508

Open
billsacks opened this issue Oct 1, 2021 · 7 comments
Labels
enhancement new capability or improved behavior of existing capability

Comments

@billsacks
Copy link
Member

There are two related enhancements that we should make to run_sys_tests:

(1) If the jobs were launched successfully, we should print a clear message to that effect. New users of this script find it confusing that there is a bunch of output then no clear message about whether things got launched successfully or not: it is unclear whether the normal output is expected or is signaling some issue. (Alternatively, we could make run_sys_tests less verbose by default, changing the prints in run_sys_tests to logger.info calls, and requiring you to run with -v if you want all of the current output. My feeling is that the current output is something that I always want to see, and assumed the same would be true for others, so I have treated this output as part of the normal output of the script and put it in print statements so it always appears. But I am open to hiding the output by default and always specifying -v when I run it myself if people prefer that. Then it would be following the Unix standard operation of "silence indicates success", and there would be less of a need for a line saying "success" at the end.)

(2) @glemieux just ran into a situation where the create_test job failed to launch because the cheyenne account was overspent. run_sys_tests did not make this clear unless you ran with the -v option. run_sys_tests should probably abort with an error message in this case. I think this would involve changing this:

if err:
logger.info('qsub ERROR:\n%s', err)

to abort with an error message if an error occurs from the qsub job (and similar for the other job launchers).

@billsacks billsacks added the enhancement new capability or improved behavior of existing capability label Oct 1, 2021
@billsacks billsacks added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Apr 19, 2023
@billsacks
Copy link
Member Author

@johnpaulalex was also misled by the output neglecting to mention that the jobs were successfully launched (point (1) above). I'd like to hear thoughts on whether we should change this to be more verbose (explicitly stating that the test jobs have been successfully launched) or less verbose (because I think that part of the issue here is that there is a lot of output, but then nothing that states success). I personally lean towards more verbosity here, but am interested in other people's thoughts.

@billsacks billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Apr 19, 2023
@ekluzek
Copy link
Collaborator

ekluzek commented Apr 19, 2023

I like having something that explicitly says it was successful in launching the job. So that's the critical thing for me. I do think there should be a noticeable difference between running with "-v" and NOT "-v", with the later being fairly free, but only totally free if "--silent" is used.

@billsacks
Copy link
Member Author

Thanks for the comment @ekluzek . Do you feel like the current level of output is appropriate or too much for non-verbose? I personally find all of the current output useful, so if we hid it behind -v I would probably always run with -v, but I'm interested in how others feel.

@johnpaulalex
Copy link
Contributor

Yeah I was totally fooled by its silence-on-success :).

Looking at the code I don't see any -v or --silent flag. How about adding the explicit 'success' message now, and if people really want a silent option that could be added later (including a decision about whether silent or chatty should be the default)?

@billsacks
Copy link
Member Author

Thanks @johnpaulalex - I support this plan. I will note that this script does support -v/--verbose and --silent via calls to the appropriate routines in https://github.com/escomp/CTSM/blob/master/python/ctsm/ctsm_logging.py#L1 (which is a standard thing we try to do in our CTSM python scripts), but I still support your plan. If you want to fold this into one of the branches you're working on, that would be great - thank you!

@johnpaulalex
Copy link
Contributor

Thanks Bill. I took a look and it looks like there are two code paths (for running test suites or 'create_test') and both call to a launcher.run_command() method, any idea whether that dies on failure? If it does I could just add a print at the bottom of this method.

I can poke around more in the launchers if you don't happen to magically know the answer :). There’s this Launcher factory, here’s one launcher implementation, and with the subprocesses and the wait calls I couldn't quickly tell what might happen error-wise. I could do a deeper dive and maybe document the error-handling protocol for each launcher implementation (and then push that info up to the launcher factory)...on other hand, I just want to add one print statement somewhere. :) and shrug.

@billsacks
Copy link
Member Author

Good question. In our recent discussions, I had been thinking of points (1) and (2) in the original comment as independent, but your comment points to the fact that a robust solution to (1) depends on solving (2) - and once (2) is solved then I agree that (1) is probably as simple as adding a single print statement somewhere.

Currently, if there is a problem launching the job, this problem is not detected. Here is a diff that you can put in place to illustrate the problem:

diff --git a/python/ctsm/joblauncher/job_launcher_qsub.py b/python/ctsm/joblauncher/job_launcher_qsub.py
index 779911400..8532d4c83 100644
--- a/python/ctsm/joblauncher/job_launcher_qsub.py
+++ b/python/ctsm/joblauncher/job_launcher_qsub.py
@@ -49,6 +49,7 @@ class JobLauncherQsub(JobLauncherBase):
         """Returns the qsub command"""
         qsub_cmd = [
             "qsub",
+            "--bad-arg",
             "-q",
             self._queue,
             "-l",

With this in place on cheyenne, running ./run_sys_tests --skip-compare --skip-generate -s clm_short looks successful even though it is not; if you add the -v argument you can see the problem, but currently that doesn't propagate up as an error.

My original comment suggested checking the err string from the qsub command, but I'm not sure if that's a totally robust solution or if we should check the error return value from the subprocess.Popen command and/or the qsub_process.communicate.

Given the increasing usage of run_sys_tests, I do feel like this is worth digging into and fixing, but if this is growing to be larger than you want to take on right now and you'd prefer to work on other things, I totally understand!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability
Projects
None yet
Development

No branches or pull requests

3 participants