Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newly flaky test hold-release/13-ready-restart.t #2610

Closed
hjoliver opened this issue Mar 25, 2018 · 9 comments
Closed

newly flaky test hold-release/13-ready-restart.t #2610

hjoliver opened this issue Mar 25, 2018 · 9 comments
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@hjoliver
Copy link
Member

This test overrides the job submit command to sleep 10 for task foo-1 and then does a quick shutdown --now --now to stop the server after foo-1 is submitted and is in the "ready" state (due to the sleep). Then on restart --hold the ready task should reload to held and run after release with the job submit command restored.

As of recently(the change to the process pool maybe?) foo-1 sometimes, but not always, goes to "submit-failed" immediately after the shutdown command (but well before the sleep ends). This breaks the test because the task then restarts as "submit-failed" instead of ready.

@hjoliver hjoliver added the bug? Not sure if this is a bug or not label Mar 25, 2018
@hjoliver hjoliver added this to the soon milestone Mar 25, 2018
@hjoliver
Copy link
Member Author

hjoliver commented Mar 25, 2018

Here's the suite log for a failed case:

$ cat test.log | grep -v 'foo\.1' | grep -v 'bar\.1'
2018-03-25T11:03:35+13 INFO - Suite starting: server=vagrant.vm:43025 pid=5973
2018-03-25T11:03:35+13 INFO - Cylc version: 7.6.0-85-gf5ad64
2018-03-25T11:03:35+13 INFO - Run mode: live
2018-03-25T11:03:35+13 INFO - Initial point: 1
2018-03-25T11:03:35+13 INFO - Final point: 1
2018-03-25T11:03:35+13 INFO - Cold Start 1
2018-03-25T11:03:37+13 INFO - [foo-1.1] -submit-num=1, owner@host=localhost
2018-03-25T11:03:38+13 INFO - [client-command] stop_now [email protected]:cylc-stop 94c0acf5-93c9-4c16-b5f4-652feecc532b
2018-03-25T11:03:38+13 INFO - Command succeeded: stop_now(terminate=True)
2018-03-25T11:03:38+13 INFO - Processing 1 queued command(s)
        +       stop_now(terminate=True)
2018-03-25T11:03:38+13 INFO - Suite shutting down - REQUEST(NOW-NOW)
2018-03-25T11:03:38+13 ERROR - [jobs-submit cmd] cylc jobs-submit -- /home/vagrant/cylc-run/cylctb-20180324T220333Z/hold-release/13-ready-restart/log/job 1/foo-1/01
        [jobs-submit ret_code] -9
2018-03-25T11:03:38+13 ERROR - [jobs-submit cmd] cylc jobs-submit -- /home/vagrant/cylc-run/cylctb-20180324T220333Z/hold-release/13-ready-restart/log/job 1/foo-1/01
        [jobs-submit ret_code] 1
        [jobs-submit out] 2018-03-25T11:03:38+13|1/foo-1/01|1
2018-03-25T11:03:38+13 INFO - [foo-1.1] -(current:ready) submission failed at 2018-03-25T11:03:38+13
2018-03-25T11:03:38+13 ERROR - [foo-1.1] -submission failed
2018-03-25T11:03:44+13 INFO - DONE

Note the job submit failure is recognized only a second after job submission, and just after the shutdown command is actioned.

Does suite shutdown kill the job submit process, and the resulting submit failure may be recognized or not depending on how quickly the shutdown occurs?

Also, odd that there are two submit fail messages with different return codes??

@matthewrmshin
Copy link
Contributor

Interesting. May help if we change shutdown option to a single --now?

@hjoliver
Copy link
Member Author

Single --now does not stop the test failing (although it leaves no processes running after the test exits, whereas double now does for a while ... as expected, I think).

@hjoliver
Copy link
Member Author

hjoliver commented Mar 25, 2018

💩 this seems to be caused or at least exacerbated by my network config screw-up 😡 of #2611 ... however I won't close as invalid quite yet, because this issue was motivated by repeated failures of the same test on Travis CI for #2503 (and that platform is presumably not suffering from my networking incompetence).

@hjoliver
Copy link
Member Author

Confirmed this test does still fail intermittently, on master as well as #2503 (it does have an error, fixed in 2503, that leaves an immortal stalled suite on test failure, but that's not the primary problem).

@hjoliver
Copy link
Member Author

With my network fixed, and shutdown with single --now, the test fails for a different reason: the suite waits before shutdown until the fake "sleep 10" job submit command exits, and detects submit-failed then (correctly, but not what the test is trying to do).

So the thing to understand is: why does shutdown --now --now sometimes detect task submit-failed before shutdown and sometimes not. Maybe the test is just poorly conceived (but still this does seem to flag an annoying inconsistency that could affect real suites on shutdown and restart).

@matthewrmshin
Copy link
Contributor

This is most likely a #2590 issue. (Sorry!)

Before this change, on stop --now --now, it will call multiprocessing.Pool.terminate, which I believe would attempt to terminate any remaining child processes in the process pool. I am not sure if it is able to kill a subprocess that is being run by the pool child process.

Now that we are managing our own sub-processes (each being its own process group leader), on stop --now --now, it will terminate each process group using os.killpg, hence you are more likely to see submit-failed. (I have not seen this yet, but I'll have a go to repeat it in my environment.)

@matthewrmshin
Copy link
Contributor

The test uses a polling logic to poll for the existence of the job file of the affected task before issuing a stop --now --now. In most of my attempts to replicate the problem, the suite would stop just after the job file is written, but before the cylc jobs-submit command runs, so the task did not get into submit-failed.

The new behaviour is likely to be the most desirable - it terminates its own child processes correctly to allow the suite to fully shut down. However, the behaviour that this test attempts to test is not going to be reliable as it requires a state that is more transient than before.

@hjoliver
Copy link
Member Author

hjoliver commented Mar 26, 2018

Ah, that explains it!

This test tests that a restart does the right thing if it starts up with a task in in the ready state. The problems above show it is still possible - although not reliably! - for a clean shutdown to leave tasks as ready. (I agree that the new behaviour is more desirable though - it just makes the test trickier). However, it is certainly possible if the suite gets killed rather than shut down. I'll put up a quick PR that modifies the test to do a kill ...

@matthewrmshin matthewrmshin modified the milestones: soon, next release Mar 28, 2018
@matthewrmshin matthewrmshin added bug Something is wrong :( and removed bug? Not sure if this is a bug or not labels Apr 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

2 participants