run_sys_tests: add more clarity about whether it successfully launched the test jobs #1508

billsacks · 2021-10-01T21:36:11Z

There are two related enhancements that we should make to run_sys_tests:

(1) If the jobs were launched successfully, we should print a clear message to that effect. New users of this script find it confusing that there is a bunch of output then no clear message about whether things got launched successfully or not: it is unclear whether the normal output is expected or is signaling some issue. (Alternatively, we could make run_sys_tests less verbose by default, changing the prints in run_sys_tests to logger.info calls, and requiring you to run with -v if you want all of the current output. My feeling is that the current output is something that I always want to see, and assumed the same would be true for others, so I have treated this output as part of the normal output of the script and put it in print statements so it always appears. But I am open to hiding the output by default and always specifying -v when I run it myself if people prefer that. Then it would be following the Unix standard operation of "silence indicates success", and there would be less of a need for a line saying "success" at the end.)

(2) @glemieux just ran into a situation where the create_test job failed to launch because the cheyenne account was overspent. run_sys_tests did not make this clear unless you ran with the -v option. run_sys_tests should probably abort with an error message in this case. I think this would involve changing this:

CTSM/python/ctsm/joblauncher/job_launcher_qsub.py

Lines 31 to 32 in 3dcbc74

    
           if err: 
        
               logger.info('qsub ERROR:\n%s', err)

to abort with an error message if an error occurs from the qsub job (and similar for the other job launchers).

The text was updated successfully, but these errors were encountered:

billsacks · 2023-04-19T18:40:52Z

@johnpaulalex was also misled by the output neglecting to mention that the jobs were successfully launched (point (1) above). I'd like to hear thoughts on whether we should change this to be more verbose (explicitly stating that the test jobs have been successfully launched) or less verbose (because I think that part of the issue here is that there is a lot of output, but then nothing that states success). I personally lean towards more verbosity here, but am interested in other people's thoughts.

ekluzek · 2023-04-19T22:38:07Z

I like having something that explicitly says it was successful in launching the job. So that's the critical thing for me. I do think there should be a noticeable difference between running with "-v" and NOT "-v", with the later being fairly free, but only totally free if "--silent" is used.

billsacks · 2023-04-19T23:06:43Z

Thanks for the comment @ekluzek . Do you feel like the current level of output is appropriate or too much for non-verbose? I personally find all of the current output useful, so if we hid it behind -v I would probably always run with -v, but I'm interested in how others feel.

johnpaulalex · 2023-04-26T18:11:49Z

Yeah I was totally fooled by its silence-on-success :).

Looking at the code I don't see any -v or --silent flag. How about adding the explicit 'success' message now, and if people really want a silent option that could be added later (including a decision about whether silent or chatty should be the default)?

billsacks · 2023-04-27T00:18:47Z

Thanks @johnpaulalex - I support this plan. I will note that this script does support -v/--verbose and --silent via calls to the appropriate routines in https://github.com/escomp/CTSM/blob/master/python/ctsm/ctsm_logging.py#L1 (which is a standard thing we try to do in our CTSM python scripts), but I still support your plan. If you want to fold this into one of the branches you're working on, that would be great - thank you!

johnpaulalex · 2023-05-03T18:50:59Z

Thanks Bill. I took a look and it looks like there are two code paths (for running test suites or 'create_test') and both call to a launcher.run_command() method, any idea whether that dies on failure? If it does I could just add a print at the bottom of this method.

I can poke around more in the launchers if you don't happen to magically know the answer :). There’s this Launcher factory, here’s one launcher implementation, and with the subprocesses and the wait calls I couldn't quickly tell what might happen error-wise. I could do a deeper dive and maybe document the error-handling protocol for each launcher implementation (and then push that info up to the launcher factory)...on other hand, I just want to add one print statement somewhere. :) and shrug.

billsacks · 2023-05-03T19:15:07Z

Good question. In our recent discussions, I had been thinking of points (1) and (2) in the original comment as independent, but your comment points to the fact that a robust solution to (1) depends on solving (2) - and once (2) is solved then I agree that (1) is probably as simple as adding a single print statement somewhere.

Currently, if there is a problem launching the job, this problem is not detected. Here is a diff that you can put in place to illustrate the problem:

diff --git a/python/ctsm/joblauncher/job_launcher_qsub.py b/python/ctsm/joblauncher/job_launcher_qsub.py
index 779911400..8532d4c83 100644
--- a/python/ctsm/joblauncher/job_launcher_qsub.py
+++ b/python/ctsm/joblauncher/job_launcher_qsub.py
@@ -49,6 +49,7 @@ class JobLauncherQsub(JobLauncherBase):
         """Returns the qsub command"""
         qsub_cmd = [
             "qsub",
+            "--bad-arg",
             "-q",
             self._queue,
             "-l",

With this in place on cheyenne, running ./run_sys_tests --skip-compare --skip-generate -s clm_short looks successful even though it is not; if you add the -v argument you can see the problem, but currently that doesn't propagate up as an error.

My original comment suggested checking the err string from the qsub command, but I'm not sure if that's a totally robust solution or if we should check the error return value from the subprocess.Popen command and/or the qsub_process.communicate.

Given the increasing usage of run_sys_tests, I do feel like this is worth digging into and fixing, but if this is growing to be larger than you want to take on right now and you'd prefer to work on other things, I totally understand!

billsacks added the enhancement new capability or improved behavior of existing capability label Oct 1, 2021

billsacks added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Apr 19, 2023

billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_sys_tests: add more clarity about whether it successfully launched the test jobs #1508

run_sys_tests: add more clarity about whether it successfully launched the test jobs #1508

billsacks commented Oct 1, 2021

billsacks commented Apr 19, 2023

ekluzek commented Apr 19, 2023

billsacks commented Apr 19, 2023

johnpaulalex commented Apr 26, 2023

billsacks commented Apr 27, 2023

johnpaulalex commented May 3, 2023

billsacks commented May 3, 2023

run_sys_tests: add more clarity about whether it successfully launched the test jobs #1508

run_sys_tests: add more clarity about whether it successfully launched the test jobs #1508

Comments

billsacks commented Oct 1, 2021

billsacks commented Apr 19, 2023

ekluzek commented Apr 19, 2023

billsacks commented Apr 19, 2023

johnpaulalex commented Apr 26, 2023

billsacks commented Apr 27, 2023

johnpaulalex commented May 3, 2023

billsacks commented May 3, 2023