Keep evicted workloads in admitted while the associated jobs are still active #692

trasc · 2023-04-12T14:19:54Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Keep evicted workloads in admitted while the associated jobs are still active

Which issue(s) this PR fixes:

Fixes #510

Special notes for your reviewer:

k8s-ci-robot · 2023-04-12T14:20:04Z

Hi @trasc. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

netlify · 2023-04-12T14:20:04Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`1d686d4`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/645389ff6d54be0009e89b90

alculquicondor · 2023-04-12T20:18:07Z

/ok-to-test

alculquicondor · 2023-04-12T20:18:30Z

/milestone v0.4

trasc · 2023-04-13T08:20:38Z

/retest

trasc · 2023-04-24T07:02:46Z

/retest

pkg/controller/jobframework/reconciler.go

alculquicondor · 2023-04-27T18:03:24Z

pkg/scheduler/scheduler.go

@@ -169,7 +169,8 @@ func (s *Scheduler) schedule(ctx context.Context) {
 				log.Error(err, "Failed to preempt workloads")
 			}
 			if preempted != 0 {
-				e.inadmissibleMsg += fmt.Sprintf(". Preempted %d workload(s)", preempted)
+				e.inadmissibleMsg += fmt.Sprintf(". Pending the preemption of %d workload(s)", preempted)
+				e.requeueReason = queue.RequeueReasonPendingPreemption


why do we need this reason?

We shouldn't requeue immediately if there is preemption, because actually we need to wait for the running workloads to terminate.

We want to keep the workload that triggered the preemption at the head of the queue, otherwise we trigger the presumption and for wlN and maybe accept wlN+1.

That is ok. The users chose BestEffortFIFO, so they should be ok with that. If they want stronger guarantees, they can use StrictFIFO.

I see it more as a special case of RequeueReasonFailedAfterNomination and I'd like to keep it as is, however we could drop it if you insist.

preempted workloads could take an arbitrary amount of time to fully terminate.

If we put the preemptor back into the head of the queue, this means that we won't admit any other workload until preemptions finish. This is not the behavior that people want when they use BestEffortFIFO. This mode tries to keep throughput up.

The good thing is that when a workload actually terminates, the preemptor workload should get back into the head of the queue.

So I don't think we should specialize preemption when requeueing.

I did tried this approach, but the everything gets complicated when multiple workloads should be preempted, say we have :

queue (of 3 cpu) wl1(low prio, 1 cpu, admitted) wl2(low prio, 1 cpu, admitted) create wl3(high prio, 3cpu) scheduler step1 (head=wl3): - wl1 -> evicted - wl2 -> evicted - wl3 -> inadmissible wl1 finish the eviction: - wl1 -> pending - wl2 -> evicted - wl3 -> pending (inadmissible is removed) scheduler step2 (head=wl3): - wl1 -> pending (no change) - wl2 -> evicted - wl3 -> inadmissible (still needs the cpu used by wl2) scheduler step3 (head=wl1): - wl1 -> admitted - wl2 -> evicted - wl3 -> inadmissible

sure after this , wl3 will request the preemption of wl1 again, but unless wl1 and wl2 are finishing at the same time, we will be locked in this loop.

so we are admitting the recently evicted wl too fast.

This reminds me how we deal with this in kube-scheduler: we would mark wl3 as "nominated", meaning that it's likely to be admitted soon. Then, wl1 wouldn't be able to fit. "nominated" is somewhat similar to "assumed", but with the exception that if the scheduler later takes the decision to put the workload somewhere else, the previous assumption is forgotten.

Could you explore this option? It still gives the chance for other workloads to be admitted, but without taking the space reserved for the preemptor.

Now, the problem you describe can arise even before your change, so we can probably tackle it in a separate PR.

For the purpose of this PR, maybe we can remove RequeueReasonPendingPreemption and change the integration test to only have to preempt 1 workload. We will change it back to 2 once we have implemented the "nominated" logic.

In our case , regardless of the queuing strategy, if a workload that is nominated did not make it to admission is re-queued without going to inadmissible. If a new workload, which by other means (priority, timestamp) should become the head is added to the queue it will become the queue head and checked for admission. The same happens with the ones that are waiting for preemption.

The reservation could work but it will be tricky to implement, since is very likely for us to want to be able to cancel the reservation of lower priority workloads ....

Now, the problem you describe can arise even before your change, so we can probably tackle it in a separate PR.

Not relay, currently the preemption happens on the spot, and the preemptor gets admitted at kueue level , even if the scheduler may not be able to accept it due to lack of resources.

For the purpose of this PR, maybe we can remove RequeueReasonPendingPreemption and change the integration test to only have to preempt 1 workload. We will change it back to 2 once we have implemented the "nominated" logic.

If RequeueReasonPendingPreemption is not acceptable, as a temporary solution, we could just use the RequeueReasonFailedAfterNomination and drop the changes in the queues. Or hold the PR (I don't like the idea of intentionally breaking/disabling part of the preemption).

The reservation could work but it will be tricky to implement, since is very likely for us to want to be able to cancel the reservation of lower priority workloads ....

Right, but we would be able to admit another lower priority workload that could fit in a different flavor (assuming the preemptor doesn't fit there).
But sure, this is acceptable.

test/integration/controller/job/job_controller_test.go

test/integration/scheduler/podsready/scheduler_test.go

pkg/scheduler/preemption/preemption.go

pkg/controller/jobframework/reconciler.go

test/util/util.go

test/integration/controller/job/job_controller_test.go

Use also this reason to keep the high priority workloads that trigger preemption "admissible" in the best effort fifo.

Wait for the preempted job to finish before accepting the new workload.

alculquicondor · 2023-05-04T18:28:40Z

/lgtm
/approve

k8s-ci-robot · 2023-05-04T18:28:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 12, 2023

k8s-ci-robot requested review from denkensk and tenzen-y April 12, 2023 14:20

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 12, 2023

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 12, 2023

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Apr 12, 2023

k8s-ci-robot added this to the v0.4 milestone Apr 12, 2023

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 12, 2023

trasc force-pushed the keep_preempted_in_admitted_while_active branch from a5bcbb5 to 01e5c35 Compare April 13, 2023 08:20

This was referenced Apr 13, 2023

[workload] WaitForPodsReady: Requeue at the back of the queue after timeout #689

Merged

REQUEST: New membership for trasc kubernetes/org#4168

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 20, 2023

trasc force-pushed the keep_preempted_in_admitted_while_active branch from 01e5c35 to 8bae1a5 Compare April 20, 2023 06:25

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 20, 2023

trasc force-pushed the keep_preempted_in_admitted_while_active branch from 8bae1a5 to bcf19dd Compare April 21, 2023 10:18

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 21, 2023

trasc force-pushed the keep_preempted_in_admitted_while_active branch from bcf19dd to 5f84965 Compare April 24, 2023 06:06

trasc force-pushed the keep_preempted_in_admitted_while_active branch from 5f84965 to ca6f725 Compare April 24, 2023 13:10

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2023

trasc force-pushed the keep_preempted_in_admitted_while_active branch from ca6f725 to 4595be9 Compare April 26, 2023 08:08

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2023

trasc force-pushed the keep_preempted_in_admitted_while_active branch from 4595be9 to e6488d7 Compare April 27, 2023 15:27

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 27, 2023

trasc marked this pull request as ready for review April 27, 2023 15:27

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 27, 2023

k8s-ci-robot requested review from alculquicondor and kerthcet April 27, 2023 15:27

trasc force-pushed the keep_preempted_in_admitted_while_active branch from e6488d7 to e51b20a Compare April 27, 2023 15:33

alculquicondor reviewed Apr 27, 2023

View reviewed changes

trasc changed the title ~~Keep preempted workload in admitted while it's job still active~~ Keep preempted workloads in admitted while the associated jobs are still active Apr 27, 2023

trasc force-pushed the keep_preempted_in_admitted_while_active branch from e51b20a to fbecba5 Compare April 28, 2023 06:38

trasc changed the title ~~Keep preempted workloads in admitted while the associated jobs are still active~~ Keep evicted workloads in admitted while the associated jobs are still active Apr 28, 2023

alculquicondor reviewed Apr 28, 2023

View reviewed changes

test/integration/controller/job/job_controller_test.go Outdated Show resolved Hide resolved

test/integration/scheduler/podsready/scheduler_test.go Outdated Show resolved Hide resolved

tenzen-y reviewed May 1, 2023

View reviewed changes

trasc force-pushed the keep_preempted_in_admitted_while_active branch from fbecba5 to 8e3dc3e Compare May 3, 2023 07:05

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2023

[queue] Add "PendingPreemption" requeue reason.

94bd4f4

Use also this reason to keep the high priority workloads that trigger preemption "admissible" in the best effort fifo.

trasc force-pushed the keep_preempted_in_admitted_while_active branch from 8e3dc3e to e8b5a37 Compare May 3, 2023 19:06

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2023

trasc added 2 commits May 4, 2023 13:33

[preemption] Wait for the preempted job to finish

24617fb

Wait for the preempted job to finish before accepting the new workload.

[preemption] Wait for the preempted job to finish - CHANGELOG

1d686d4

trasc force-pushed the keep_preempted_in_admitted_while_active branch from e8b5a37 to 1d686d4 Compare May 4, 2023 10:33

k8s-ci-robot assigned alculquicondor May 4, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2023

k8s-ci-robot merged commit 3b2f370 into kubernetes-sigs:main May 4, 2023

trasc deleted the keep_preempted_in_admitted_while_active branch May 8, 2023 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep evicted workloads in admitted while the associated jobs are still active #692

Keep evicted workloads in admitted while the associated jobs are still active #692

trasc commented Apr 12, 2023 •

edited

Loading

k8s-ci-robot commented Apr 12, 2023

netlify bot commented Apr 12, 2023 •

edited

Loading

alculquicondor commented Apr 12, 2023

alculquicondor commented Apr 12, 2023

trasc commented Apr 13, 2023

trasc commented Apr 24, 2023

alculquicondor Apr 27, 2023

trasc Apr 28, 2023

alculquicondor Apr 28, 2023

trasc May 3, 2023

alculquicondor May 3, 2023

trasc May 4, 2023

alculquicondor May 4, 2023

trasc May 4, 2023

alculquicondor May 4, 2023

alculquicondor commented May 4, 2023

k8s-ci-robot commented May 4, 2023

Keep evicted workloads in admitted while the associated jobs are still active #692

Keep evicted workloads in admitted while the associated jobs are still active #692

Conversation

trasc commented Apr 12, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

k8s-ci-robot commented Apr 12, 2023

netlify bot commented Apr 12, 2023 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

alculquicondor commented Apr 12, 2023

alculquicondor commented Apr 12, 2023

trasc commented Apr 13, 2023

trasc commented Apr 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented May 4, 2023

k8s-ci-robot commented May 4, 2023

trasc commented Apr 12, 2023 •

edited

Loading

netlify bot commented Apr 12, 2023 •

edited

Loading