-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad server panics on evaluation of higher-priority system job #5777
Comments
Thanks @cheeseprocedure for reporting this issue, and this is a serious regression. We believe this was addressed incidentally in GH-5545. The fix will be released under 0.9.2 (to be released shortly), and you can test it against the 0.9.2-rc1 binaries in https://releases.hashicorp.com/nomad/0.9.2-rc1/ . Please let us know of your findings. |
Thanks very much @notnoop - we'll try it with 0.9.2-rc1 shortly! |
@notnoop looks good - I was unable to reproduce this with the Nomad 0.9.2 release. Thanks again! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Ubuntu 16.04LTS amd64
Issue
When attempting to run a new system job, the Nomad cluster leader panics if another system job is already running which occupies nearly all Nomad client resources and has a lower priority than the new job. This panic appears to occur on the next cluster leader, and so on, effectively rendering the cluster unmanageable.
Running
nomad plan
against the new system job definition causes a similar panic. In our case, Nomad is respawned by systemd and the cluster is otherwise healthy, but the job is never scheduled.Reproduction steps
Cluster setup: 3x Nomad servers, 5x Nomad clients w/6192MHz CPU available for tasks. (Scale
resources
in job definitions below accordingly.)Execute
nomad run systemjob1.hcl
, then executenomad run systemjob2.hcl
ornomad plan systemjob2.hcl
.Job file (if appropriate)
systemjob1.hcl
systemjob2.hcl
Nomad Client logs (if appropriate)
n/a
Nomad Server logs (if appropriate)
After executing
nomad plan systemjob2.hcl
and receivingError during plan: Unexpected response code: 500 (rpc error: EOF)
, the Nomad cluster leader logs the following:nomad run systemjob2.hcl
hangs atEvaluation triggered by job "systemjob2"
while a similar trace appears in the leader's logs:The outcome here is much worse than the
nomad plan
, as panics continue on the current cluster leader:The text was updated successfully, but these errors were encountered: