newly flaky test hold-release/13-ready-restart.t #2610

hjoliver · 2018-03-25T01:35:43Z

This test overrides the job submit command to sleep 10 for task foo-1 and then does a quick shutdown --now --now to stop the server after foo-1 is submitted and is in the "ready" state (due to the sleep). Then on restart --hold the ready task should reload to held and run after release with the job submit command restored.

As of recently(the change to the process pool maybe?) foo-1 sometimes, but not always, goes to "submit-failed" immediately after the shutdown command (but well before the sleep ends). This breaks the test because the task then restarts as "submit-failed" instead of ready.

The text was updated successfully, but these errors were encountered:

hjoliver · 2018-03-25T01:43:14Z

Here's the suite log for a failed case:

$ cat test.log | grep -v 'foo\.1' | grep -v 'bar\.1'
2018-03-25T11:03:35+13 INFO - Suite starting: server=vagrant.vm:43025 pid=5973
2018-03-25T11:03:35+13 INFO - Cylc version: 7.6.0-85-gf5ad64
2018-03-25T11:03:35+13 INFO - Run mode: live
2018-03-25T11:03:35+13 INFO - Initial point: 1
2018-03-25T11:03:35+13 INFO - Final point: 1
2018-03-25T11:03:35+13 INFO - Cold Start 1
2018-03-25T11:03:37+13 INFO - [foo-1.1] -submit-num=1, owner@host=localhost
2018-03-25T11:03:38+13 INFO - [client-command] stop_now [email protected]:cylc-stop 94c0acf5-93c9-4c16-b5f4-652feecc532b
2018-03-25T11:03:38+13 INFO - Command succeeded: stop_now(terminate=True)
2018-03-25T11:03:38+13 INFO - Processing 1 queued command(s)
        +       stop_now(terminate=True)
2018-03-25T11:03:38+13 INFO - Suite shutting down - REQUEST(NOW-NOW)
2018-03-25T11:03:38+13 ERROR - [jobs-submit cmd] cylc jobs-submit -- /home/vagrant/cylc-run/cylctb-20180324T220333Z/hold-release/13-ready-restart/log/job 1/foo-1/01
        [jobs-submit ret_code] -9
2018-03-25T11:03:38+13 ERROR - [jobs-submit cmd] cylc jobs-submit -- /home/vagrant/cylc-run/cylctb-20180324T220333Z/hold-release/13-ready-restart/log/job 1/foo-1/01
        [jobs-submit ret_code] 1
        [jobs-submit out] 2018-03-25T11:03:38+13|1/foo-1/01|1
2018-03-25T11:03:38+13 INFO - [foo-1.1] -(current:ready) submission failed at 2018-03-25T11:03:38+13
2018-03-25T11:03:38+13 ERROR - [foo-1.1] -submission failed
2018-03-25T11:03:44+13 INFO - DONE

Note the job submit failure is recognized only a second after job submission, and just after the shutdown command is actioned.

Does suite shutdown kill the job submit process, and the resulting submit failure may be recognized or not depending on how quickly the shutdown occurs?

Also, odd that there are two submit fail messages with different return codes??

matthewrmshin · 2018-03-25T06:47:30Z

Interesting. May help if we change shutdown option to a single --now?

hjoliver · 2018-03-25T09:28:04Z

Single --now does not stop the test failing (although it leaves no processes running after the test exits, whereas double now does for a while ... as expected, I think).

hjoliver · 2018-03-25T10:15:07Z

💩 this seems to be caused or at least exacerbated by my network config screw-up 😡 of #2611 ... however I won't close as invalid quite yet, because this issue was motivated by repeated failures of the same test on Travis CI for #2503 (and that platform is presumably not suffering from my networking incompetence).

hjoliver · 2018-03-25T19:44:46Z

Confirmed this test does still fail intermittently, on master as well as #2503 (it does have an error, fixed in 2503, that leaves an immortal stalled suite on test failure, but that's not the primary problem).

hjoliver · 2018-03-25T19:53:54Z

With my network fixed, and shutdown with single --now, the test fails for a different reason: the suite waits before shutdown until the fake "sleep 10" job submit command exits, and detects submit-failed then (correctly, but not what the test is trying to do).

So the thing to understand is: why does shutdown --now --now sometimes detect task submit-failed before shutdown and sometimes not. Maybe the test is just poorly conceived (but still this does seem to flag an annoying inconsistency that could affect real suites on shutdown and restart).

matthewrmshin · 2018-03-26T11:18:03Z

This is most likely a #2590 issue. (Sorry!)

Before this change, on stop --now --now, it will call multiprocessing.Pool.terminate, which I believe would attempt to terminate any remaining child processes in the process pool. I am not sure if it is able to kill a subprocess that is being run by the pool child process.

Now that we are managing our own sub-processes (each being its own process group leader), on stop --now --now, it will terminate each process group using os.killpg, hence you are more likely to see submit-failed. (I have not seen this yet, but I'll have a go to repeat it in my environment.)

matthewrmshin · 2018-03-26T15:28:00Z

The test uses a polling logic to poll for the existence of the job file of the affected task before issuing a stop --now --now. In most of my attempts to replicate the problem, the suite would stop just after the job file is written, but before the cylc jobs-submit command runs, so the task did not get into submit-failed.

The new behaviour is likely to be the most desirable - it terminates its own child processes correctly to allow the suite to fully shut down. However, the behaviour that this test attempts to test is not going to be reliable as it requires a state that is more transient than before.

hjoliver · 2018-03-26T19:40:59Z

Ah, that explains it!

This test tests that a restart does the right thing if it starts up with a task in in the ready state. The problems above show it is still possible - although not reliably! - for a clean shutdown to leave tasks as ready. (I agree that the new behaviour is more desirable though - it just makes the test trickier). However, it is certainly possible if the suite gets killed rather than shut down. I'll put up a quick PR that modifies the test to do a kill ...

hjoliver added the bug? Not sure if this is a bug or not label Mar 25, 2018

hjoliver added this to the soon milestone Mar 25, 2018

hjoliver mentioned this issue Mar 26, 2018

cat-log: remote cylc sub-command #2503

Merged

hjoliver mentioned this issue Mar 26, 2018

Fix test 13-ready-restart. #2612

Merged

matthewrmshin closed this as completed in #2612 Mar 27, 2018

matthewrmshin assigned hjoliver Mar 28, 2018

matthewrmshin modified the milestones: soon, next release Mar 28, 2018

matthewrmshin added bug Something is wrong :( and removed bug? Not sure if this is a bug or not labels Apr 26, 2018

MetRonnie mentioned this issue Jan 28, 2021

tests/k/hold-release/13-ready-restart.t not working #4052

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

newly flaky test hold-release/13-ready-restart.t #2610

newly flaky test hold-release/13-ready-restart.t #2610

hjoliver commented Mar 25, 2018

hjoliver commented Mar 25, 2018 •

edited

Loading

matthewrmshin commented Mar 25, 2018

hjoliver commented Mar 25, 2018

hjoliver commented Mar 25, 2018 •

edited

Loading

hjoliver commented Mar 25, 2018

hjoliver commented Mar 25, 2018

matthewrmshin commented Mar 26, 2018

matthewrmshin commented Mar 26, 2018

hjoliver commented Mar 26, 2018 •

edited

Loading

newly flaky test hold-release/13-ready-restart.t #2610

newly flaky test hold-release/13-ready-restart.t #2610

Comments

hjoliver commented Mar 25, 2018

hjoliver commented Mar 25, 2018 • edited Loading

matthewrmshin commented Mar 25, 2018

hjoliver commented Mar 25, 2018

hjoliver commented Mar 25, 2018 • edited Loading

hjoliver commented Mar 25, 2018

hjoliver commented Mar 25, 2018

matthewrmshin commented Mar 26, 2018

matthewrmshin commented Mar 26, 2018

hjoliver commented Mar 26, 2018 • edited Loading

hjoliver commented Mar 25, 2018 •

edited

Loading

hjoliver commented Mar 25, 2018 •

edited

Loading

hjoliver commented Mar 26, 2018 •

edited

Loading