-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep evicted workloads in admitted while the associated jobs are still active #692
Keep evicted workloads in admitted while the associated jobs are still active #692
Conversation
Hi @trasc. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
/ok-to-test |
/milestone v0.4 |
a5bcbb5
to
01e5c35
Compare
/retest |
01e5c35
to
8bae1a5
Compare
8bae1a5
to
bcf19dd
Compare
bcf19dd
to
5f84965
Compare
/retest |
5f84965
to
ca6f725
Compare
ca6f725
to
4595be9
Compare
4595be9
to
e6488d7
Compare
e6488d7
to
e51b20a
Compare
@@ -169,7 +169,8 @@ func (s *Scheduler) schedule(ctx context.Context) { | |||
log.Error(err, "Failed to preempt workloads") | |||
} | |||
if preempted != 0 { | |||
e.inadmissibleMsg += fmt.Sprintf(". Preempted %d workload(s)", preempted) | |||
e.inadmissibleMsg += fmt.Sprintf(". Pending the preemption of %d workload(s)", preempted) | |||
e.requeueReason = queue.RequeueReasonPendingPreemption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this reason?
We shouldn't requeue immediately if there is preemption, because actually we need to wait for the running workloads to terminate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to keep the workload that triggered the preemption at the head of the queue, otherwise we trigger the presumption and for wlN
and maybe accept wlN+1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is ok. The users chose BestEffortFIFO, so they should be ok with that. If they want stronger guarantees, they can use StrictFIFO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it more as a special case of RequeueReasonFailedAfterNomination
and I'd like to keep it as is, however we could drop it if you insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preempted workloads could take an arbitrary amount of time to fully terminate.
If we put the preemptor back into the head of the queue, this means that we won't admit any other workload until preemptions finish. This is not the behavior that people want when they use BestEffortFIFO. This mode tries to keep throughput up.
The good thing is that when a workload actually terminates, the preemptor workload should get back into the head of the queue.
So I don't think we should specialize preemption when requeueing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did tried this approach, but the everything gets complicated when multiple workloads should be preempted, say we have :
queue (of 3 cpu)
wl1(low prio, 1 cpu, admitted)
wl2(low prio, 1 cpu, admitted)
create wl3(high prio, 3cpu)
scheduler step1 (head=wl3):
- wl1 -> evicted
- wl2 -> evicted
- wl3 -> inadmissible
wl1 finish the eviction:
- wl1 -> pending
- wl2 -> evicted
- wl3 -> pending (inadmissible is removed)
scheduler step2 (head=wl3):
- wl1 -> pending (no change)
- wl2 -> evicted
- wl3 -> inadmissible (still needs the cpu used by wl2)
scheduler step3 (head=wl1):
- wl1 -> admitted
- wl2 -> evicted
- wl3 -> inadmissible
sure after this , wl3 will request the preemption of wl1 again, but unless wl1 and wl2 are finishing at the same time, we will be locked in this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we are admitting the recently evicted wl too fast.
This reminds me how we deal with this in kube-scheduler: we would mark wl3 as "nominated", meaning that it's likely to be admitted soon. Then, wl1 wouldn't be able to fit. "nominated" is somewhat similar to "assumed", but with the exception that if the scheduler later takes the decision to put the workload somewhere else, the previous assumption is forgotten.
Could you explore this option? It still gives the chance for other workloads to be admitted, but without taking the space reserved for the preemptor.
Now, the problem you describe can arise even before your change, so we can probably tackle it in a separate PR.
For the purpose of this PR, maybe we can remove RequeueReasonPendingPreemption and change the integration test to only have to preempt 1 workload. We will change it back to 2 once we have implemented the "nominated" logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our case , regardless of the queuing strategy, if a workload that is nominated
did not make it to admission is re-queued without going to inadmissible. If a new workload, which by other means (priority, timestamp) should become the head is added to the queue it will become the queue head and checked for admission. The same happens with the ones that are waiting for preemption.
The reservation could work but it will be tricky to implement, since is very likely for us to want to be able to cancel the reservation of lower priority workloads ....
Now, the problem you describe can arise even before your change, so we can probably tackle it in a separate PR.
Not relay, currently the preemption happens on the spot, and the preemptor gets admitted at kueue level , even if the scheduler may not be able to accept it due to lack of resources.
For the purpose of this PR, maybe we can remove RequeueReasonPendingPreemption and change the integration test to only have to preempt 1 workload. We will change it back to 2 once we have implemented the "nominated" logic.
If RequeueReasonPendingPreemption
is not acceptable, as a temporary solution, we could just use the RequeueReasonFailedAfterNomination
and drop the changes in the queues. Or hold the PR (I don't like the idea of intentionally breaking/disabling part of the preemption).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reservation could work but it will be tricky to implement, since is very likely for us to want to be able to cancel the reservation of lower priority workloads ....
Right, but we would be able to admit another lower priority workload that could fit in a different flavor (assuming the preemptor doesn't fit there).
But sure, this is acceptable.
e51b20a
to
fbecba5
Compare
fbecba5
to
8e3dc3e
Compare
Use also this reason to keep the high priority workloads that trigger preemption "admissible" in the best effort fifo.
8e3dc3e
to
e8b5a37
Compare
Wait for the preempted job to finish before accepting the new workload.
e8b5a37
to
1d686d4
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, trasc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Keep evicted workloads in admitted while the associated jobs are still active
Which issue(s) this PR fixes:
Fixes #510
Special notes for your reviewer: