-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: RescheduleTracker
dropped if follow-up fails placements
#12319
Conversation
RescheduleTracker
dropped if follow-up fail placementsRescheduleTracker
dropped if follow-up fails placements
2e99a15
to
c704fc2
Compare
c704fc2
to
0533b7c
Compare
0533b7c
to
b8127cb
Compare
b8127cb
to
680a968
Compare
While working on #20462 #12319 I found that some of our scheduler tests around down nodes or disconnected clients were enforcing invariants that were unclear. This changeset pulls out some minor refactorings so that the bug fix PR is easier to review. This includes: * Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch in #12319 anyways. * Adding test names to the node down test * Update the disconnected client test so that we always re-process the pending/blocked eval it creates; this eliminates 2 redundant sub-tests. * Update the disconnected client test assertions so that they're explicit in the test setup rather than implied by whether we re-process the pending/blocked eval. Ref: #20462 Ref: #12319
While working on #20462 #12319 I found that some of our scheduler tests around down nodes or disconnected clients were enforcing invariants that were unclear. This changeset pulls out some minor refactorings so that the bug fix PR is easier to review. This includes: * Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch in #12319 anyways. * Adding test names to the node down test * Update the disconnected client test so that we always re-process the pending/blocked eval it creates; this eliminates 2 redundant sub-tests. * Update the disconnected client test assertions so that they're explicit in the test setup rather than implied by whether we re-process the pending/blocked eval. Ref: #20462 Ref: #12319
680a968
to
b162cc2
Compare
b162cc2
to
a2bd26e
Compare
a2bd26e
to
ac6d975
Compare
2e75cd1
to
d3c5c8d
Compare
When an allocation fails it triggers an evaluation. The evaluation is processed and the scheduler sees it needs to reschedule, which triggers a follow-up eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated). But in the case where the follow-up eval can't place all allocs (there aren't enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then creates a blocked eval. The plan applier stops the failed alloc. Then when the blocked eval is processed, the job is missing an allocation, so the scheduler creates a new allocation. This allocation is _not_ a replacement from the perspective of the scheduler, so it's not handed off a `RescheduleTracker`. This changeset fixes this by annotating the reschedule tracker whenever the scheduler can't place a replacement allocation. We check this annotation for allocations that have the `stop` desired status when filtering out allocations to pass to the reschedule tracker. I've also included tests that cover this case and expands coverage of the relevant area of the code. Fixes: #12147 Fixes: #17072
d3c5c8d
to
e404f55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully absorbed all the implications of this, but here are my comments so far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Waiting on some comments Juanita told me she was working on before merging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work in this meta bug :)
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When an allocation fails it triggers an evaluation. The evaluation is processed and the scheduler sees it needs to reschedule, which triggers a follow-up eval. The follow-up eval creates a plan to
(stop 1) (place 1)
. The replacement alloc has aRescheduleTracker
(or gets itsRescheduleTracker
updated).But in the case where the follow-up eval can't place all allocs (there aren't enough resources), it can create a partial plan to
(stop 1) (place 0)
. It then creates a blocked eval. The plan applier stops the failed alloc. Then when the blocked eval is processed, the job is missing an allocation, so the scheduler creates a new allocation. This allocation is not a replacement from the perspective of the scheduler, so it's not handed off aRescheduleTracker
.This changeset fixes this by annotating the reschedule tracker whenever the scheduler can't place a replacement allocation. We check this annotation for allocations that have the
stop
desired status when filtering out allocations to pass to the reschedule tracker. I've also included tests that cover this case and expands coverage of the relevant area of the code.Fixes: #12147
Fixes: #17072
Fixes: https://hashicorp.atlassian.net/browse/NET-9551
Workflow of the typical happy path:
Workflow of the buggy path: