-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New behavior when max_client_disconnect is used and the worker node moves from disconnected to ready #15483
Comments
Hi, I will also rephrase the problem in shorter form:
We have BUG in point number 4. Allocation shouldn't stop. BR, |
Hey @rikislaw, thanks for the report. We haven't been able to start on this fix yet due to some unrelated issues popping up this last week, but it is pretty high in our queue. Should be picked up soon. Apologies for the regression. |
Hi everyone, thanks for the report and detailed information. I'm still investigating the issue, but so far I believe that the prestart task does not affect the problem as I was able to reproduce the problem with a job that only has a single task, and this may not be a regression, as I've seen it happen in 1.4.2 after a few tries. My current guess is that there's a race condition when the node reconnects and, depending on the order of events, the scheduler will make the wrong decision about the state of the allocation. The changes made in 1.4.3 (more specifically in #15068) introduced stronger ordering for events, which may have cause this specific race condition to happen more frequently. I will keep investigating the problem and I will post more updates as I figure things out. |
I understand what the problem is now, and have confirmed that:
The root cause of the problem is that Nomad clients make two different RPC calls to update its status with Nomad:
These calls are made in parallel, and so can reach the server in any order. This can cause a series of problem for the scheduler because it needs both information (client status and alloc status) in order to make proper scheduling decisions, but if a scheduling operation happens between the two calls the scheduler has incomplete data to work with. #15068 made it so This fixed the problem described in (#12680) where, upon reconnecting, a client could call The issue described here is a similar problem, but one where the The order of events for this problem to happen is as follow:
If I have prototyped a fix and I should have a PR available soon. |
Hi, Reproducing steps (I used the same job spec ):
The job with max_client_disconnect and constraint stanza job
Alloc Status (nomad client disconnected )
Alloc Status (before start nomad service )
Alloc Status (nomad client is ready again)
Alloc Status (nomad client is ready for 15 minutes)
Debug bundles taken while testing (1.3.10 & 1.5.0) have been uploaded to: |
I was finally able to reproduce the error, some key things that are needed:
The root cause seems to be a failed logic in the scheduler reconciler that doesn't stop failed allocations when an allocation reconnects, leaving the cluster in a state where 2 allocs have The reconciler then runs again and notices that it needs to pick on allocation to stop. Since the allocation is already reconnected it doesn't have any preference between which on to keep and it will stop either of them. One key problem is they the job file is written. It has an implicit constraint that makes it so it can only run in a specific client. This is not a good practice as it prevents Nomad from being able to perform proper scheduling. Situations like this should be handled with I have a custom build just to validate this assumption. @ron-savoia or @margiran would either of you be able to validate if this custom build fixes the problem? |
Thank you @lgfa29,
|
Thank you very much for the confirmation @margiran! I working on a proper fix and will open a PR as soon as it's ready. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
1.3.8/1.4.3
Issue
A new behavior is seen in 1.3.8 & 1.4.3 (verified) where jobs which have
max_client_disconnect
defined and a prestart task is used to ensure the task only runs on the desired node, changes state from unknown, to running and then to complete once the node is back in a ready state. Eventually a new alloc will be placed on the desired node, after it has cycled through other non-desired nodes, however this is not the same behavior seen in prior versions tested (eg. 1.3.3/1.4.2)Behavior seen when testing with 1.3.8/1.4.3:
After the alloc is placed on the desired node; if the nomad service is stopped on the desired node:
disconnected
and the placed alloc status isunknown
, as expected.unknown
for the duration of the node beingdisconnected
.ready
running
but within ~25 seconds the status changes tocomplete
and the job is moved topending
.pending
until a new alloc is placed on the preferred node.This behavior was only seen when a job uses a prestart task (raw_exec in this example) which ensures the job is placed on the specific/desired node before the main task starts.
Reproduction steps
The same steps, times and job files were used while testing against 1.3.x (1.3.8 & 1.3.3) and 1.4.x (1.4.3 & 1.4.2).
Jobs Used for Testing:
max_client_disconnect
is not set. Affinity used so the preferred node will be used for the initial placing and can be migrated to other nodes if needed.max_client_disconnect
is set. Constraint used so the alloc will be pinned to the preferred node.max_client_disconnect
is set and a pre-start task is used to ensure that the job is placed on the desired node before the main task starts.High Level Steps:
nomad node status
andnomad job status -verbose <JOB_ID>
in intervals. I used 5 minute intervals, after the nomad service was stopped, for the baseline status. At the 15 minute mark, I started the nomad service on the specific/desired node and again rannomad node status
andnomad job status -verbose <JOB_ID>
for example2.nomad and 91244_mod-2.nomad.Expected Result
In versions prior to 1.3.8/1.4.3, the alloc status of 91244_mod-2.nomad behavior is:
running
unknown
State of allocs placed on non-preferred node, after the nomad service is stopped on the preferred node:
failed
running
5 Minute - Alloc Status
10 Minute - Alloc Status
15 Minute - Alloc Status (after nomad service started on the worker node)
Actual Result
In versions 1.3.8/1.4.3, the alloc status of 91244_mod-2.nomad behavior is:
running
unknown
State of allocs placed on non-preferred node, after the nomad service is stopped on the preferred node:
failed
running
, then moves tocomplete
5 Minute - Alloc Status
10 Minute - Alloc Status
15 Minute - Alloc Status (after nomad service started on the worker node)
Job file (if appropriate)
example1.nomad
example2.nomad
91244_mod-2.nomad
Nomad Server logs (if appropriate)
Debug bundles taken while testing (1.3.8/1.3.3 & 1.4.3/1.4.2) have been uploaded to: https://drive.google.com/drive/folders/1Ds83JQBQlPQukPELj3Ia_iCUnaaE73Jp?usp=sharing
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: