-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ReconnectModifyIndex to handle reconnect lifecycle #14948
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1220,6 +1220,7 @@ func (n *Node) UpdateAlloc(args *structs.AllocUpdateRequest, reply *structs.Gene | |
if evalTriggerBy != structs.EvalTriggerJobDeregister && | ||
alloc.ClientStatus == structs.AllocClientStatusUnknown { | ||
evalTriggerBy = structs.EvalTriggerReconnect | ||
alloc.ReconnectModifyIndex = allocToUpdate.ReconnectModifyIndex | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This assignment corrupts the state store because |
||
} | ||
|
||
// If we weren't able to determine one of our expected eval triggers, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9730,6 +9730,9 @@ type Allocation struct { | |
// lets the client pull only the allocs updated by the server. | ||
AllocModifyIndex uint64 | ||
|
||
// ReconnectModifyIndex is used to determine if the server has processed the node reconnect. | ||
ReconnectModifyIndex uint64 | ||
Comment on lines
+9733
to
+9734
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should make sure this gets onto the |
||
|
||
// CreateTime is the time the allocation has finished scheduling and been | ||
// verified by the plan applier. | ||
CreateTime int64 | ||
|
@@ -10393,9 +10396,14 @@ func (a *Allocation) LastUnknown() time.Time { | |
return lastUnknown.UTC() | ||
} | ||
|
||
// Reconnected determines whether a reconnect event has occurred for any task | ||
// and whether that event occurred within the allowable duration specified by MaxClientDisconnect. | ||
func (a *Allocation) Reconnected() (bool, bool) { | ||
// IsReconnecting determines whether a reconnect event has occurred for a task, | ||
// whether that event occurred within the allowable duration specified by MaxClientDisconnect, | ||
// and whether the reconnect has been processed. | ||
func (a *Allocation) IsReconnecting() (isReconnecting bool, expired bool) { // isReconnecting, expired | ||
if a.ReconnectModifyIndex != 0 && a.AllocModifyIndex > a.ReconnectModifyIndex { | ||
isReconnecting = true | ||
} | ||
|
||
var lastReconnect time.Time | ||
for _, taskState := range a.TaskStates { | ||
for _, taskEvent := range taskState.Events { | ||
|
@@ -10413,7 +10421,7 @@ func (a *Allocation) Reconnected() (bool, bool) { | |
return false, false | ||
} | ||
|
||
return true, a.Expired(lastReconnect) | ||
return isReconnecting, a.Expired(lastReconnect) | ||
} | ||
|
||
func (a *Allocation) ToIdentityClaims(job *Job) *IdentityClaims { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value comes from the server:
But that made me remember there are two code paths in the state store for updating allocations: one for upserting allocs from the server and one for updating allocs from the client. But in any case neither of them are handling the
ReconnectModifyIndex
field because for existing allocations (which is what we care about here), we copy the existing Allocation and then merge the needed fields over before inserting.So we're not actually updating this field in Nomad's state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in client.go line 2033 it get's added to the stripped alloc during allocSync, that then gets sent to Node Update, which then updates state, and triggers an eval, When the eval fires, the index is set. We then have to unset it when applying the plan. Have you tried it out? I had logging in here during development showing it all flowed through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the bit where I don't see how it's happening. Any update of an existing object takes a copy first (ref
state_store.go#L3474
) and then modifies the copy before inserting it. So if we haven't pulled in information that the client is authoritative on, the state isn't getting updated for the transaction.I haven't had a chance to test it out thoroughly (still trying to get 1.4.2 out! 😁 ) but I suspect the reason it's "working" right now is because of the state store corruption on line 1223. That'll appear correct under some circumstances but won't have gone thru raft correctly.