-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with max_client_disconnect
#15144
Comments
@yaroslav-007 can you clarify the expected behavior here? Nomad servers can't know whether a Nomad client is netsplit vs crashed, because either way the client is no longer communicating with the server. And it doesn't actually matter in terms of the intended behavior of |
Hello @tgross , |
More specifically, this could have been caused by #12680 where a new allocation was being placed unnecessarily. Hopefully this was fixed in 1.4.3 😄 |
Hello, I have checked with the latest version (1.4.3) and seems nomad continue to reschedule the jobs. |
Awesome, thanks for following up! |
Apologies, @yaroslav-007 followed up with me out-of-band and clarified that 1.4.3 has not fixed the bug. I'm going to reopen this and make sure it gets marked for roadmapping. |
I asked @yaroslav-007 to retest this since we've had a lot of work on this recently. Copy/pasting his Slack reply here:
Thanks @yaroslav-007 |
After speaking with @yaroslav-007 we came to the conclusion that what the customer wants is more control over which allocation is kept once the disconnected node starts responding again:
|
I spent a bunch of time in Something to keep in mind here is that you can restart a client agent and all the allocations will be continue to run. There's no way for a client to detect a "system crash". The only thing the client can know on restart is that an allocation is no longer running because it fails to restore the allocrunner. Which is fine, because the client could be gracefully stopped and the allocation could crash and we'd have the same situation as a "system crash" as far as the server is concerned. So when the client restarts, it does the following:
It looks to me that the problem is that we mark the node as reconnecting when we get a heartbeat, so there's a race between when the first heartbeat is sent and the stopped allocation is marked as failed. The trivial fix would be to do the restore before we start registration and heartbeat. But I'm not sure what side-effects that'll have. I do know that we've often discussed trying to move the restore until after the At first I thought maybe the solution is to not mark the allocation as reconnecting until we get an update from the client on the allocation status? But if the client was disconnected because of a network split and not crashed, it would not have any allocation status updates to send, just a heartbeat! We'd be able to fix the crashed case but would introduce a bug in the non-crashed case. Again, I don't have a solution but hopefully this context helps 😀 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with
max_client_disconnect
for the relocation of allocationNomad version
nomad version
:1.4.2
Issue
Nomad reschedule allocations jobs with
max_client_disconnect
on crashed system where the allocation was stoppedReproduction steps
Vagrant project attached. Follow the README file in the attachment
Expected Result
Nomad should differentiate between crashed and disconnected nomad client for jobs with
max_client_disconnect
. It should be able to see that there is no running allocation on the reconnected client and thus not take any further actionActual Result
Nomad will reschedule a job (with
max_client_disconnect
) on the reconnect of crashed system.Attachment
Attachment
The text was updated successfully, but these errors were encountered: