diff --git a/keps/prod-readiness/sig-scheduling/4247.yaml b/keps/prod-readiness/sig-scheduling/4247.yaml new file mode 100644 index 000000000000..affcd9fa242d --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/4247.yaml @@ -0,0 +1,3 @@ +kep-number: 4247 +beta: + approver: "@wojtek-t" diff --git a/keps/sig-scheduling/4247-queueinghint/README.md b/keps/sig-scheduling/4247-queueinghint/README.md new file mode 100644 index 000000000000..e20529ea5616 --- /dev/null +++ b/keps/sig-scheduling/4247-queueinghint/README.md @@ -0,0 +1,1071 @@ + +# KEP-4247: Per-plugin callback functions for efficient enqueueing in the scheduling queue + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Overview](#overview) + - [When to skip/not skip backoff](#when-to-skipnot-skip-backoff) + - [Block a next scheduling retry](#block-a-next-scheduling-retry) + - [Track Pods being processed in the scheduling queue](#track-pods-being-processed-in-the-scheduling-queue) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Return QueueImmediately, QueueAfterBackoff, and QueueSkip from QueueingHintFn instead of introducing new status SuccessButReject](#return---and--from--instead-of-introducing-new-status-) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Introduce callback function named `QueueingHint` to `EventsToRegister` so that each plugin can control when to retry Pods scheduling more finely. +Also, we give an ability to skip backoff in an appropriate case, which improve the preformance to schedule Pods with some plugins (DRA plugin in in-tree plugins). + +## Motivation + + + +**Retry Pods only when the probability of getting scheduled is high** + +Currently, each plugin can define when to retry Pods, rejected by the plugin, to schedule roughly via `EventsToRegister`. + +For example, NodeAffinity retries the Pods scheduling when Node is added or updated ([ref](https://github.com/kubernetes/kubernetes/blob/v1.27.6/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L86)) because added/updated Node may have the label which matches with the NodeAffinity on the Pod. +But, actually, a lot of Node update events happens in the cluster, which cannot make the Pod rejected by NodeAffinity schedulable. +By introducing the callback function to filter out events more finely, the scheduler can retry scheduling of Pods which is only likely to be scheduled in the next scheduling cycle. + +**Skip the backoff** + +DRA plugin sometimes needs to reject Pods to wait for the update from the device driver. +So, it's natural by its design to take several scheduling cycles to finish the scheduling of a Pod. + +But, it takes time to go through backoff rather than waiting for the update from the device driver actually. +https://github.com/kubernetes/kubernetes/pull/117561 + +We want to improve the performance there by giving ability to plugins to skip backoff in appropriate cases. + +### Goals + + + +- Introduce `QueueingHint` to `EventsToRegister` and the scheduling queue requeues Pods based on the result from `QueueingHint` +- Improve how the Pods being processed are tracked by the scheduling queue and requeued to an appropriate queue if they are rejected and back ot the queue. + +### Non-Goals + + + +- Remove the backoff mechanism completely in the scheduling queue. +- Do something related to `PreEnqueue` extension point. + - `QueueingHint` and `PreEnqueue` are both for the scheduling queue, but the responsibilities are completely different from each other. + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +Supposing developping the `NodeAffinity` plugin. + +When `NodeAffinity` rejects Pods, those Pods might be schedulable in the following case: +- when a new Node is created, which matches the Pod's NodeAffinity. +- when an existing Node's label is updated and becomes matching the Pod's NodeAffinity. + +In such events, QueueingHint of the NodeAffinity plugin returns `Queue`, otherwise returns `QueueSkip`. + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +If a plugin has QueueingHint and it misses some events which can make Pods schedulable, +Pods rejected by it may be stuck in the unschedulable Pod pool. + +The scheduling queue flushes the Pods in the unschedulable Pod pool priodically, and the interval of flushing is configurable. (5m by default) + +It's on the way of being removed as the following issue described though, +we will postpone its removal until all QueueingHint are implemented and we see no bug report for a while. +https://github.com/kubernetes/kubernetes/issues/87850 + +## Design Details + + + +### Overview + +The returning type of `EventsToRegister` is changed to `[]ClusterEventWithHint` + +```go +// EnqueueExtensions is an optional interface that plugins can implement to efficiently +// move unschedulable Pods in internal scheduling queues. Plugins +// that fail pod scheduling (e.g., Filter plugins) are expected to implement this interface. +type EnqueueExtensions interface { + Plugin + // EventsToRegister returns a series of possible events that may cause a Pod + // failed by this plugin schedulable. Each event has a callback function that + // filters out events to reduce useless retry of Pod's scheduling. + // The events will be registered when instantiating the internal scheduling queue, + // and leveraged to build event handlers dynamically. + // Note: the returned list needs to be static (not depend on configuration parameters); + // otherwise it would lead to undefined behavior. + EventsToRegister() []ClusterEventWithHint +} +``` + +Each `ClusterEventWithHint` has `ClusterEvent` and `QueueingHintFn`, which is executed when the event happens. +Let's say the scheduling queue has the Pod which is rejected by pluginA and pluginB, +and both pluginA and pluginB are interested only in NodeAdded events. + +The scheduling queue is watching resources and keep tellng them `NodeAdded` events. +For every events, `QueueingHintFn` from both plugins are executed, +and if either of them returns `Queue`, the Pod is moved to activeQ/backoffQ. + +```go +type ClusterEventWithHint struct { + Event ClusterEvent + // QueueingHintFn is executed for the plugin rejected by this plugin when the above Event happens, + // and filters out events to reduce useless retry of Pod's scheduling. + // It's an optional field. If not set, + // the scheduling of Pods will be always retried when this Event happens. + // (the same as Queue) + QueueingHintFn QueueingHintFn +} + +// QueueingHintFn returns a hint that signals whether the event can make a Pod, +// which was rejected by this plugin in the past scheduling cycle, schedulable or not. +// It's called before a Pod gets moved from unschedulableQ to backoffQ or activeQ. +// If it returns an error, we'll take the returned QueueingHint as `QueueAfterBackoff` at the caller whatever we returned here so that +// we can prevent the Pod from being stuck in the unschedulable pod pool. +// +// - `pod`: the Pod to be enqueued, which is rejected by this plugin in the past. +// - `oldObj` `newObj`: the object involved in that event. +// - For example, the given event is "Node deleted", the `oldObj` will be that deleted Node. +// - `oldObj` is nil if the event is add event. +// - `newObj` is nil if the event is delete event. +type QueueingHintFn func(logger klog.Logger, pod *v1.Pod, oldObj, newObj interface{}) (QueueingHint, error) + +type QueueingHint int + +const ( + // QueueSkip implies that the cluster event has no impact on + // scheduling of the pod. + QueueSkip QueueingHint = iota + + // Queue implies that the Pod may be schedulable by the event. + Queue +) +``` + +That's the basic idea of QueueingHint. + +### When to skip/not skip backoff + +BackoffQ is a light way of keeping throughput high +by preventing pods that are "permanently unschedulable" from blocking the queue. + +And, the more the Pod has been rejected in the scheduling cycle, the longer the Pod needs to wait as backoff. +**We can regard the backoff as a penalty of wasting the scheduling cycle.** + +So, when, for example, NodeAffinity rejected the Pod and later returns `Queue` in its `QueueingHintFn`, +the Pod's scheduling is retried after going through the backoff. +It's because the past scheduling cycle was wasted by that Pod. + +But, some plugins need to go through some failures in the scheduling cycle by design. +[DRA](https://github.com/kubernetes/kubernetes/tree/v1.27.6/pkg/scheduler/framework/plugins/dynamicresources) plugin is one example in in-tree plugins - at the Reserve extension point, it tells the resource driver the scheduling result, and rejects the Pod once to wait for the response from the resource driver. +In this kind of rejections, we cannot say the scheduling cycle is wasted because the scheduling result from it is used to proceed the Pod's scheduling forward, that particular scheduling cycle is failed though. +So, Pods rejected by such reasons don't need to suffer a penalty (backoff). + +In order to support such cases, we introduces a new status `SuccessButReject`. +When the `DRA` plugin rejected the Pod with `SuccessButReject` and later returns `Queue` in its `QueueingHintFn`, +the pod skips the backoff and the Pod's scheduling is retried. + +### Block a next scheduling retry + +For example, when a PVC for the Pod isn't found, the Pod cannot be scheduled and `VolumeBinding` plugin returns `UnschedulableAndUnresolvable` in this case. +The point here is that this Pod will never be schedulable until the appropriate PVC is created for the Pod. + +For such cases, we introduced a new supplemental status `Blocked`, which can be used like this: + +```go +func (pl *VolumeBinding) PreFilter(ctx context.Context, state *framework.CycleState, pod *v1.Pod) (*framework.PreFilterResult, *framework.Status) { + if hasPVC, err := pl.podHasPVCs(pod); err != nil { + if apierrors.IsNotFound(err) { + // PVC isn't found for this Pod. + // This rejection must be resolved before retrying this Pod's scheduling. + // Otherwise, the retry would just result in the same rejection from this plugin here. + return UnschedulableAndUnresolvable | Blocked + } + //... +} +``` + +### Track Pods being processed in the scheduling queue + +By introducing QueueingHint, we can retry the scheduling only when particular event happens. +But, what if such events happen during Pod's scheduling? + +The scheduler takes snapshot of the cluster and schedules Pods based on the snapshot. And the snapshot is updated everytime the scheduling cycle is started, in other words, the same snapshot is used in the same scheduling cycle. + +Thinking about a problematic scenario, for example, Pod is being scheduled and it's going to be rejected by NodeAffinity because no Node matches the Pod's NodeAffinity. But, actually, during the scheduling, one new Node is created, which matches the Pod's NodeAffinity. + +As mentioned, that new Node doesn't get in the candidates during this scheduling cycle, so this Pod is rejected by NodeAffinity anyways. +The problem here is that, if the scheduling queue put this Pod into the unschedulable Pod pool, this Pod would need to wait for another event, although there is already a Node matching the Pod's NodeAffinity. + +In order to prevent such Pods from missing the events during its scheduling, the scheduling queue remembers events happened during Pods's scheduling and decide where the Pod is enqueued to based on those events and `QueueingHint`. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- `k8s.io/kubernetes/pkg/scheduler/internal/queue`: `10-01 20:28 JST` - `88.4` + +##### Integration tests + + + + + +- [`k8s.io/kubernetes/test/integration/scheduler/rescheduling_test.go`](https://github.com/kubernetes/kubernetes/blob/v1.28.0/test/integration/scheduler/rescheduling_test.go#L117): + - https://storage.googleapis.com/k8s-triage/index.html?test=TestReScheduling + +##### e2e tests + + + + +n/a + +-- + +This feature doesn't introduce any new API endpoints and doesn't interact with other components. +So, E2E tests doesn't add extra value to integration tests. + +### Graduation Criteria + + + +It was suggested we have a KEP for QueueingHint after we implemented it. +It's kind of a special case though, we can assume DRA is the parent KEP and this KEP stems from it. +And I set the alpha version v1.26 which is the same as DRA KEP, +and the beta version v1.28 which we actually implemented it and enable it via the beta feature flag (enabled by default). + +Slack discussion: https://kubernetes.slack.com/archives/C5P3FE08M/p1695639140018139?thread_ts=1694167948.846139&cid=C5P3FE08M + +#### Alpha + +n/a + +#### Beta + +- The scheduling queue is changed to work with QueueingHint. +- The feature gate is implemented. (enabled by default) +- QueueingHint is implemented in a few plugins. + +#### GA + +- QueueingHint is implemented in all plugins. +- No bug report for a while. + +### Upgrade / Downgrade Strategy + + + +**Upgrade** + +Nothing needs to be done to opt-in this feature. (The feature gate is enabled by default) + +**Downgrade** + +Users need to disable the feature gate. + +### Version Skew Strategy + + + +n/a + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `SchedulerQueueingHints` + - Components depending on the feature gate: kube-scheduler +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +No. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes. +The feature can be disabled in Alpha and Beta versions +by restarting kube-scheduler with the feature-gate off. + +###### What happens if we reenable the feature if it was previously rolled back? + +The scheduling queue again starts to work with `QueueingHint`. + +###### Are there any tests for feature enablement/disablement? + + + +The unit tests are added in the scheduling queue so that we can make sure that the scheduling queue works fine in both the feature gate enabled/disabled. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +n/a because the scheduler is only the component to rollout this feature. + +###### What specific metrics should inform a rollback? + + + +If `scheduler_pending_pods` metric with `queue: unschedulable` label grows and keeps high number, +maybe something goes wrong with QueueingHint and Pods are stuck in the queue. + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +No. This feature is the internal feature of the scheduler and no impact left outside the scheduler. +So, just upgrading it and upgrade->downgrade->upgrade are both the same. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +No. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +This feature is used during all Pods' scheduling if the feature gate is enabled. + +###### How can someone using this feature know that it is working for their instance? + + + +n/a + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +n/a + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [x] Metrics + - Metric name: `scheduler_pending_pods` metric with `queue: unschedulable` + - Components exposing the metric: kube-scheduler + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +No. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +No. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +No. + +###### Will enabling / using this feature result in introducing new API types? + + + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +No. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +Yes. +The memory usage in kube-scheduler is supposed to increase because the scheduling queue needs to keep the events happened during scheduling. Thus, the busier cluster it is, the more memory it's likely to require. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +No. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +n/a + +###### What are other known failure modes? + + + +n/a + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +- Jun 26, 2023: The QueueingHint is implemented and the `EnqueueExtension` interface is changed. +- Jul 15, 2023: The feature gate is implemented. (enabled by default) +- Jul 18, 2023: The scheduling queue tracks the Pod being processed to put it back to an appropriate queue. +- Oct 01, 2023: The initial KEP is submitted. + +## Drawbacks + + + +## Alternatives + + + +### Return `QueueImmediately`, `QueueAfterBackoff`, and `QueueSkip` from `QueueingHintFn` instead of introducing new status `SuccessButReject` + +Instead of requeueing Pods based on why it was rejected, we can do the same by introducing separate `QueueingHint` for queueing - `QueueImmediately` and `QueueAfterBackoff`. + +But, as explained in [When to skip/not skip backoff](#when-to-skipnot-skip-backoff), the backoff is a penalty of wasting the scheduling cycle. Also, some few scenario (DRA) don't waste the scheduling cycle, they reject Pods in that scheduling cycle though. + +So, whether skipping backoff or not, it's something very close to why the Pod was rejected, +and thus it's easier to be decided when the Pod is rejected than when the Pod is actually requeued. + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-scheduling/4247-queueinghint/kep.yaml b/keps/sig-scheduling/4247-queueinghint/kep.yaml new file mode 100644 index 000000000000..fe2ec04a9b5b --- /dev/null +++ b/keps/sig-scheduling/4247-queueinghint/kep.yaml @@ -0,0 +1,31 @@ +title: Per-plugin callback functions for efficient enqueueing in the scheduling queue +kep-number: 4247 +authors: + - "@sanposhiho" +owning-sig: sig-scheduling +participating-sigs: + - sig-scheduling +status: provisional +creation-date: 2023-09-30 +reviewers: + - "@alculquicondor" +approvers: + - "@alculquicondor" + +see-also: + - "/keps/sig-node/3063-dynamic-resource-allocation" + +stage: alpha + +latest-milestone: "v1.29" + +milestone: + alpha: "v1.26" # This KEP stems from /keps/sig-node/3063-dynamic-resource-allocation. + beta: "v1.28" + stable: "v1.30" + +feature-gates: + - name: SchedulerQueueingHints + components: + - kube-scheduler +disable-supported: true