-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
newly flaky test hold-release/13-ready-restart.t #2610
Comments
Here's the suite log for a failed case:
Note the job submit failure is recognized only a second after job submission, and just after the shutdown command is actioned. Does suite shutdown kill the job submit process, and the resulting submit failure may be recognized or not depending on how quickly the shutdown occurs? Also, odd that there are two submit fail messages with different return codes?? |
Interesting. May help if we change shutdown option to a single --now? |
Single |
💩 this seems to be caused or at least exacerbated by my network config screw-up 😡 of #2611 ... however I won't close as invalid quite yet, because this issue was motivated by repeated failures of the same test on Travis CI for #2503 (and that platform is presumably not suffering from my networking incompetence). |
Confirmed this test does still fail intermittently, on master as well as #2503 (it does have an error, fixed in 2503, that leaves an immortal stalled suite on test failure, but that's not the primary problem). |
With my network fixed, and shutdown with single So the thing to understand is: why does shutdown |
This is most likely a #2590 issue. (Sorry!) Before this change, on Now that we are managing our own sub-processes (each being its own process group leader), on |
The test uses a polling logic to poll for the existence of the job file of the affected task before issuing a The new behaviour is likely to be the most desirable - it terminates its own child processes correctly to allow the suite to fully shut down. However, the behaviour that this test attempts to test is not going to be reliable as it requires a state that is more transient than before. |
Ah, that explains it! This test tests that a restart does the right thing if it starts up with a task in in the ready state. The problems above show it is still possible - although not reliably! - for a clean shutdown to leave tasks as ready. (I agree that the new behaviour is more desirable though - it just makes the test trickier). However, it is certainly possible if the suite gets killed rather than shut down. I'll put up a quick PR that modifies the test to do a kill ... |
This test overrides the job submit command to
sleep 10
for taskfoo-1
and then does a quickshutdown --now --now
to stop the server afterfoo-1
is submitted and is in the "ready" state (due to the sleep). Then onrestart --hold
the ready task should reload toheld
and run after release with the job submit command restored.As of recently(the change to the process pool maybe?)
foo-1
sometimes, but not always, goes to "submit-failed" immediately after the shutdown command (but well before the sleep ends). This breaks the test because the task then restarts as "submit-failed" instead of ready.The text was updated successfully, but these errors were encountered: