-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run_sys_tests: add more clarity about whether it successfully launched the test jobs #1508
Comments
@johnpaulalex was also misled by the output neglecting to mention that the jobs were successfully launched (point (1) above). I'd like to hear thoughts on whether we should change this to be more verbose (explicitly stating that the test jobs have been successfully launched) or less verbose (because I think that part of the issue here is that there is a lot of output, but then nothing that states success). I personally lean towards more verbosity here, but am interested in other people's thoughts. |
I like having something that explicitly says it was successful in launching the job. So that's the critical thing for me. I do think there should be a noticeable difference between running with "-v" and NOT "-v", with the later being fairly free, but only totally free if "--silent" is used. |
Thanks for the comment @ekluzek . Do you feel like the current level of output is appropriate or too much for non-verbose? I personally find all of the current output useful, so if we hid it behind -v I would probably always run with -v, but I'm interested in how others feel. |
Yeah I was totally fooled by its silence-on-success :). Looking at the code I don't see any -v or --silent flag. How about adding the explicit 'success' message now, and if people really want a silent option that could be added later (including a decision about whether silent or chatty should be the default)? |
Thanks @johnpaulalex - I support this plan. I will note that this script does support |
Thanks Bill. I took a look and it looks like there are two code paths (for running test suites or 'create_test') and both call to a launcher.run_command() method, any idea whether that dies on failure? If it does I could just add a print at the bottom of this method. I can poke around more in the launchers if you don't happen to magically know the answer :). There’s this Launcher factory, here’s one launcher implementation, and with the subprocesses and the wait calls I couldn't quickly tell what might happen error-wise. I could do a deeper dive and maybe document the error-handling protocol for each launcher implementation (and then push that info up to the launcher factory)...on other hand, I just want to add one print statement somewhere. :) and shrug. |
Good question. In our recent discussions, I had been thinking of points (1) and (2) in the original comment as independent, but your comment points to the fact that a robust solution to (1) depends on solving (2) - and once (2) is solved then I agree that (1) is probably as simple as adding a single print statement somewhere. Currently, if there is a problem launching the job, this problem is not detected. Here is a diff that you can put in place to illustrate the problem: diff --git a/python/ctsm/joblauncher/job_launcher_qsub.py b/python/ctsm/joblauncher/job_launcher_qsub.py
index 779911400..8532d4c83 100644
--- a/python/ctsm/joblauncher/job_launcher_qsub.py
+++ b/python/ctsm/joblauncher/job_launcher_qsub.py
@@ -49,6 +49,7 @@ class JobLauncherQsub(JobLauncherBase):
"""Returns the qsub command"""
qsub_cmd = [
"qsub",
+ "--bad-arg",
"-q",
self._queue,
"-l", With this in place on cheyenne, running My original comment suggested checking the err string from the qsub command, but I'm not sure if that's a totally robust solution or if we should check the error return value from the subprocess.Popen command and/or the qsub_process.communicate. Given the increasing usage of run_sys_tests, I do feel like this is worth digging into and fixing, but if this is growing to be larger than you want to take on right now and you'd prefer to work on other things, I totally understand! |
There are two related enhancements that we should make to run_sys_tests:
(1) If the jobs were launched successfully, we should print a clear message to that effect. New users of this script find it confusing that there is a bunch of output then no clear message about whether things got launched successfully or not: it is unclear whether the normal output is expected or is signaling some issue. (Alternatively, we could make run_sys_tests less verbose by default, changing the prints in run_sys_tests to logger.info calls, and requiring you to run with
-v
if you want all of the current output. My feeling is that the current output is something that I always want to see, and assumed the same would be true for others, so I have treated this output as part of the normal output of the script and put it in print statements so it always appears. But I am open to hiding the output by default and always specifying-v
when I run it myself if people prefer that. Then it would be following the Unix standard operation of "silence indicates success", and there would be less of a need for a line saying "success" at the end.)(2) @glemieux just ran into a situation where the create_test job failed to launch because the cheyenne account was overspent. run_sys_tests did not make this clear unless you ran with the
-v
option. run_sys_tests should probably abort with an error message in this case. I think this would involve changing this:CTSM/python/ctsm/joblauncher/job_launcher_qsub.py
Lines 31 to 32 in 3dcbc74
to abort with an error message if an error occurs from the qsub job (and similar for the other job launchers).
The text was updated successfully, but these errors were encountered: