-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocrunner: refactor task coordinator #14009
Conversation
The current implementation for the task coordinator unblocks tasks by performing destructive operations over its internal state (like closing channels and deleting maps from keys). This presents a problem in situations where we would like to revert the state of a task, such as when restarting an allocation with tasks that have already exited. With this new implementation the task coordinator behaves more like a finite state machine where task may be blocked/unblocked multiple times by performing a state transition. This initial part of the work only refactors the task coordinator and is functionally equivalent to the previous implementation. Future work will build upon this to provide bug fixes and enhancements.
default: | ||
return true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is just a refactoring of existing code, I tried to keep the behaviour the same, including the tests, but the file rename created a new diff. Here's the old vs. new test diff: https://gist.github.com/lgfa29/6c7d7b6590a9ea377ad12e886c3bfd4c
@lgfa29 any chance you've run this under the Go race detector? |
I have not, but that's a great idea. I have a janky random task state generator that I used to find some subtle bugs, I will run it with the race detector activated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgfa29 this is a great PR and will make a huge improvement in comprehensibility of this tricky area of the client. I've left some comments for discussion.
// Start and wait for all tasks. | ||
for _, task := range ar.tasks { | ||
go task.Run() | ||
} | ||
|
||
// Block on all tasks except poststop tasks | ||
for _, task := range ar.tasks { | ||
if !task.IsPoststopTask() { | ||
<-task.WaitCh() | ||
} | ||
} | ||
|
||
// Signal poststop tasks to proceed to main runtime | ||
ar.taskHookCoordinator.StartPoststopTasks() | ||
|
||
// Wait for poststop tasks to finish before proceeding | ||
for _, task := range ar.tasks { | ||
if task.IsPoststopTask() { | ||
<-task.WaitCh() | ||
} | ||
<-task.WaitCh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section immediately shows the value of this approach 👍
|
||
// taskStateUpdated notifies that a task state has changed. This may cause the | ||
// taskCoordinator FSM to progresses to another state. | ||
func (c *taskCoordinator) taskStateUpdated(states map[string]*structs.TaskState) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note to any other reviewers: these *structs.TaskState
pointers are to copies of the task state created in allocrunner
The initial implementation wasn't very clear as to how all pieces were connected together. This was the result of bad naming and encapsulation. This commits fixes this by moving the new structs into their own package since they are isolated from the rest of the allocrunner. Moving them into their own package allows for simpler names that don't have to repeat the task* prefix all the time and for better interfaces, where methods that are expected to be called by external components are now public and internal methods remain private. The package also has a doc.go file with more extensive documentation. The taskCoordinatorController struct was also poorly named (controller is too generic) so it was renamed to simply Gate as it better reflects its controls and behaviour.
A poststart task can have sidecar set to true or false. This difference is usually not relevant when coordinating task start order: even with a restart command, tasks only run once. But when a taskRunner is recovered after a Nomad agent restarts, the Coordinator must block postart non-sidecar tasks from running again.
@shoenig @tgross @schmichael I'm done fiddling with this PR now 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The improvement in readability is 💯
One last thought - would it be worth holding off on backporting until it has time to bake in the 1.4 beta? We could still backport it to 1.3.x on the 1.4 release; this just seems a bit much to go into a bugfix backport directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here @lgfa29!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing
package comment is the only blocker I think. Great work!
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).
This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.
With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.
This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.
Note to reviewers: the implementation ended up a little different from the internal RFC description. During tests I noticed that there were more state transitions than I first thought about and the mechanism of closing channels became brittle.
It was very easy to accidentally close a channel twice, which meant that we had to be very careful during state transitions. This resulted in an implicit dependency between a state and the path taken to reach it (e.g., if coming from the
X
state don't close channelY
).So I opted for an earlier idea which is to have channels with a producer that can be disabled on demand. This makes the operation idempotent, so we don't have to worry about previous states during state transition, but it adds a bit more code and could have some other problems I may have missed.