Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep evicted workloads in admitted while the associated jobs are still active #692

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Apr 12, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Keep evicted workloads in admitted while the associated jobs are still active

Which issue(s) this PR fixes:

Fixes #510

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 12, 2023
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 12, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @trasc. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@netlify
Copy link

netlify bot commented Apr 12, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 1d686d4
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/645389ff6d54be0009e89b90

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 12, 2023
@alculquicondor
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Apr 12, 2023
@alculquicondor
Copy link
Contributor

/milestone v0.4

@k8s-ci-robot k8s-ci-robot added this to the v0.4 milestone Apr 12, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 12, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from a5bcbb5 to 01e5c35 Compare April 13, 2023 08:20
@trasc
Copy link
Contributor Author

trasc commented Apr 13, 2023

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 20, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from 01e5c35 to 8bae1a5 Compare April 20, 2023 06:25
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 20, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from 8bae1a5 to bcf19dd Compare April 21, 2023 10:18
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 21, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from bcf19dd to 5f84965 Compare April 24, 2023 06:06
@trasc
Copy link
Contributor Author

trasc commented Apr 24, 2023

/retest

@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from 5f84965 to ca6f725 Compare April 24, 2023 13:10
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from ca6f725 to 4595be9 Compare April 26, 2023 08:08
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from 4595be9 to e6488d7 Compare April 27, 2023 15:27
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 27, 2023
@trasc trasc marked this pull request as ready for review April 27, 2023 15:27
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 27, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from e6488d7 to e51b20a Compare April 27, 2023 15:33
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
@@ -169,7 +169,8 @@ func (s *Scheduler) schedule(ctx context.Context) {
log.Error(err, "Failed to preempt workloads")
}
if preempted != 0 {
e.inadmissibleMsg += fmt.Sprintf(". Preempted %d workload(s)", preempted)
e.inadmissibleMsg += fmt.Sprintf(". Pending the preemption of %d workload(s)", preempted)
e.requeueReason = queue.RequeueReasonPendingPreemption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this reason?

We shouldn't requeue immediately if there is preemption, because actually we need to wait for the running workloads to terminate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep the workload that triggered the preemption at the head of the queue, otherwise we trigger the presumption and for wlN and maybe accept wlN+1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is ok. The users chose BestEffortFIFO, so they should be ok with that. If they want stronger guarantees, they can use StrictFIFO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it more as a special case of RequeueReasonFailedAfterNomination and I'd like to keep it as is, however we could drop it if you insist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preempted workloads could take an arbitrary amount of time to fully terminate.

If we put the preemptor back into the head of the queue, this means that we won't admit any other workload until preemptions finish. This is not the behavior that people want when they use BestEffortFIFO. This mode tries to keep throughput up.

The good thing is that when a workload actually terminates, the preemptor workload should get back into the head of the queue.

So I don't think we should specialize preemption when requeueing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did tried this approach, but the everything gets complicated when multiple workloads should be preempted, say we have :

queue (of 3 cpu)
wl1(low prio, 1 cpu, admitted)
wl2(low prio, 1 cpu, admitted)

create wl3(high prio, 3cpu) 

scheduler step1 (head=wl3):
- wl1 -> evicted
- wl2 -> evicted
- wl3 -> inadmissible

wl1 finish the eviction:
- wl1 -> pending
- wl2 -> evicted
- wl3 -> pending (inadmissible is removed)

scheduler step2 (head=wl3):
- wl1  -> pending (no change)
- wl2 -> evicted
- wl3 -> inadmissible (still needs the cpu used by wl2)

scheduler step3 (head=wl1):
- wl1  -> admitted
- wl2 -> evicted
- wl3 -> inadmissible 

sure after this , wl3 will request the preemption of wl1 again, but unless wl1 and wl2 are finishing at the same time, we will be locked in this loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are admitting the recently evicted wl too fast.

This reminds me how we deal with this in kube-scheduler: we would mark wl3 as "nominated", meaning that it's likely to be admitted soon. Then, wl1 wouldn't be able to fit. "nominated" is somewhat similar to "assumed", but with the exception that if the scheduler later takes the decision to put the workload somewhere else, the previous assumption is forgotten.

Could you explore this option? It still gives the chance for other workloads to be admitted, but without taking the space reserved for the preemptor.

Now, the problem you describe can arise even before your change, so we can probably tackle it in a separate PR.

For the purpose of this PR, maybe we can remove RequeueReasonPendingPreemption and change the integration test to only have to preempt 1 workload. We will change it back to 2 once we have implemented the "nominated" logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case , regardless of the queuing strategy, if a workload that is nominated did not make it to admission is re-queued without going to inadmissible. If a new workload, which by other means (priority, timestamp) should become the head is added to the queue it will become the queue head and checked for admission. The same happens with the ones that are waiting for preemption.

The reservation could work but it will be tricky to implement, since is very likely for us to want to be able to cancel the reservation of lower priority workloads ....

Now, the problem you describe can arise even before your change, so we can probably tackle it in a separate PR.

Not relay, currently the preemption happens on the spot, and the preemptor gets admitted at kueue level , even if the scheduler may not be able to accept it due to lack of resources.

For the purpose of this PR, maybe we can remove RequeueReasonPendingPreemption and change the integration test to only have to preempt 1 workload. We will change it back to 2 once we have implemented the "nominated" logic.

If RequeueReasonPendingPreemption is not acceptable, as a temporary solution, we could just use the RequeueReasonFailedAfterNomination and drop the changes in the queues. Or hold the PR (I don't like the idea of intentionally breaking/disabling part of the preemption).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reservation could work but it will be tricky to implement, since is very likely for us to want to be able to cancel the reservation of lower priority workloads ....

Right, but we would be able to admit another lower priority workload that could fit in a different flavor (assuming the preemptor doesn't fit there).
But sure, this is acceptable.

@trasc trasc changed the title Keep preempted workload in admitted while it's job still active Keep preempted workloads in admitted while the associated jobs are still active Apr 27, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from e51b20a to fbecba5 Compare April 28, 2023 06:38
@trasc trasc changed the title Keep preempted workloads in admitted while the associated jobs are still active Keep evicted workloads in admitted while the associated jobs are still active Apr 28, 2023
pkg/scheduler/preemption/preemption.go Show resolved Hide resolved
pkg/scheduler/preemption/preemption.go Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
test/util/util.go Outdated Show resolved Hide resolved
test/integration/controller/job/job_controller_test.go Outdated Show resolved Hide resolved
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from fbecba5 to 8e3dc3e Compare May 3, 2023 07:05
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2023
Use also this reason to keep the high priority workloads that trigger
preemption "admissible" in the best effort fifo.
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from 8e3dc3e to e8b5a37 Compare May 3, 2023 19:06
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2023
@trasc trasc force-pushed the keep_preempted_in_admitted_while_active branch from e8b5a37 to 1d686d4 Compare May 4, 2023 10:33
@alculquicondor
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2023
@k8s-ci-robot k8s-ci-robot merged commit 3b2f370 into kubernetes-sigs:main May 4, 2023
@trasc trasc deleted the keep_preempted_in_admitted_while_active branch May 8, 2023 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Account for terminating pods when doing preemption
4 participants