-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add(KEP-4247): Per-plugin callback functions for efficient enqueueing in the scheduling queue #4256
Conversation
… in the scheduling queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reviewing. Will continue tomorrow.
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md | ||
--> | ||
|
||
Introduce callback function named `QueueingHint` to `EventsToRegister` so that each plugin can control when to retry Pods scheduling more finely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to make the summary more like a release note. Something end-users would understand.
This sentence is full of implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,3 @@ | |||
kep-number: 4247 | |||
beta: | |||
approver: "@wojtek-t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @wojtek-t
🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#prr-shadow
will rollout across nodes. | ||
--> | ||
|
||
n/a because the scheduler is only the component to rollout this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter we only have a single component affected with this change, it's still possible that this component will be affected and thus can leave the entire cluster unstable. Have a look at the above comment for ideas about what should be considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
--> | ||
|
||
No. This feature is the internal feature of the scheduler and no impact left outside the scheduler. | ||
So, just upgrading it and upgrade->downgrade->upgrade are both the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in the HA case, where you have multiple instances of scheduler, where on some this feature is available and on other is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one scheduler is active on a given moment, so that's fine.
[It may happen that on leader-reelections we will be flapping between FG enabled/disabled, but that also sounds fine]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, I was explicitly alluding to that feature gate flapping on and off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine. The memory is lost during restarts, so calculations start from the beginning.
Worth mentioning in the KEP.
question. | ||
--> | ||
|
||
n/a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For beta graduation it would be good to define a reasonable SLO based on the scheduler_pending_pods
you've used in several questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it very depends on the cluster.
Even if there are many unscheduled Pods, that might be not a bug (e.g., all Nodes are full, etc).
So, I'm not sure if we can provide any good number on that metric.
@alculquicondor What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a different set of metrics (doesn't have to be explicitly the one I've mentioned above). How about looking at pod_scheduler_duration
or pod_scheduling_attempts
, or even a set of values that will allow you to say that the scheduler is stable. Other example: 99% of pods in the queue gets scheduled within x mins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod_scheduling_attempts
schedule_attempts_total
depends on the cluster. If the cluster usually reaches out of capacity, there could be many unschedulable
results.
scheduling_attempt_duration_seconds
The more scheduling constraints the Pods in the cluster, the longer one scheduling cycle can be.
(This enhancement has almost nothing to do with the scheduling cycle anyway)
Other example: 99% of pods in the queue gets scheduled within x mins.
So, the scheduling latency like pod_scheduling_sli_duration_seconds
also depends on the cluster. If the cluster usually reaches out of capacity, it could be longer.
Well... so, all metrics in the scheduler are here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/metrics/metrics.go
I don't see any metrics that can provide the fixed number (not depending on clusters) which we can tell the scheduling queue is working well. Also, so far, I haven't come up with any ideas that we can add new one to the scheduler to have such SLOs.
I checked the existing KEP, and the one that is also related to the scheduling queue didn't have an SLO here actually.
https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3521-pod-scheduling-readiness#what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still have an SLO based on a single scheduling attempt pod_scheduler_duration
, which shouldn't regress.
And then we can say: in a cluster with at least x pending pods in a given window, schedule_attempts_total
shouldn't be below y for the window. In other words, we should support scheduling z pods/s. 100 pods/s should be our target for the load tests in k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Aldo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, you can use this graph https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load
As you can see, we have a throughput of ~150 at the 99 percentile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it based on the input from Aldo. PTAL.
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 | ||
--> | ||
|
||
The unit tests are added in the scheduling queue so that we can make sure that the scheduling queue works fine in both the feature gate enabled/disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests you have in mind are feature tests.
Enablement/disablement means that we effectively switch the FG in the middle of the test to test the workload of enabling/disabling the feature).
Given it's in-memory feature let's just switch to sth like
Given it's purely in-memory feature and enablement/disablement requires restarting the component (to change the value of feature flag), having feature tests is enough.
--> | ||
|
||
No. This feature is the internal feature of the scheduler and no impact left outside the scheduler. | ||
So, just upgrading it and upgrade->downgrade->upgrade are both the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one scheduler is active on a given moment, so that's fine.
[It may happen that on leader-reelections we will be flapping between FG enabled/disabled, but that also sounds fine]
question. | ||
--> | ||
|
||
n/a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
- Testing: Are there any tests for failure mode? If not, describe why. | ||
--> | ||
|
||
n/a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree.
You mention one potential issue yourself in risks section:
If a plugin has QueueingHint and it misses some events which can make Pods schedulable, Pods rejected by it may be stuck in the unschedulable Pod pool.
Please discuss it here according to a template and think through if there are other potential failure modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanposhiho we are likely to miss the enhancements freeze tonight.
Please file an exception. When doing so, highlight that this feature is stemming from the DRA KEP. Maybe highlight that the Alternative listed in the KEP is what is currently implemented in 1.28.
#### GA | ||
|
||
- QueueingHint is implemented in all plugins. | ||
- No bug report for a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove the backoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with great QueueingHInts implemented in all plugins, the scheduler would still get the scheduling failures and the backoff delay would need to exist as a penalty for such failed Pods. That's my current take for backoffQ - we cannot completely remove backoffQ by this QueueingHint work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backoff is to avoid HOL problem, if we have fine-gained control for requeueing, I think we can remove it for higher scheduling throughput, but we should prove this and agree this is not a necessary part for GA.
--> | ||
|
||
No. This feature is the internal feature of the scheduler and no impact left outside the scheduler. | ||
So, just upgrading it and upgrade->downgrade->upgrade are both the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine. The memory is lost during restarts, so calculations start from the beginning.
Worth mentioning in the KEP.
question. | ||
--> | ||
|
||
n/a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still have an SLO based on a single scheduling attempt pod_scheduler_duration
, which shouldn't regress.
And then we can say: in a cluster with at least x pending pods in a given window, schedule_attempts_total
shouldn't be below y for the window. In other words, we should support scheduling z pods/s. 100 pods/s should be our target for the load tests in k8s.
question. | ||
--> | ||
|
||
n/a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, you can use this graph https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load
As you can see, we have a throughput of ~150 at the 99 percentile.
So, whether skipping backoff or not, it's something very close to why the Pod was rejected, | ||
and thus it's easier to be decided when the Pod is rejected than when the Pod is actually requeued. | ||
|
||
### Implement `Blocked` status to block a next scheduling retry until the plugin returns `Queue` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, we don't need Blocked
(at least for the current in-tree plugins), give the reason described here.
Will update kubernetes/kubernetes#119517 once this PR reaches an agreement.
An exception requestion for this KEP is approved. cc @wojtek-t @soltysh
@alculquicondor Updated for the latest comment from you. |
the scheduling queue notifies them through QueueingHint. | ||
If either of QueueingHintFn from `NodeResourceFit` or `NodeAffinity` returns `Queue`, | ||
the Pod is moved to activeQ/backoffQ. | ||
(For example, when `NodeAdded` event happens, the QueueingHint of `NodeResourceFit` return `Queue` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have the question about this case, what if NodeAffinity
returns QueueSkip
here which means the new node doesn't fit the nodeAffinityPlugin
, then the pod is still unschedulable, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should go through all the unschedulable plugins and until all of them return Queue
, then we'll process with requeueing. Take these three nodes for example:
- node1 rejected the pod for nodeResourceFit plugin
- node2 and node3 rejected the pod for nodeAffinity plugin
Then
- A
nodeAdded
event happened, we name it node4- node4 doesn't match with nodeResourceFit hintFunc, then fail
- node4 doesn't match with nodeAffinity hintFunc, fail
- node4 matches with both unschedulable plugins, succeeded, requeueing then
- A
nodeLableUpdate
event happened with node2- node2 matches with nodeResourceFit hintFunc, but dismatch with the nodeAffinity hintFunc, then fail
- node2 matches with nodeResourceFit hintFunc, also the nodeAffinity hintFunc, succeeded, then requeueing
I know a pod maybe rejected by several nodes for different plugins, but a pod may still be rejected by only one node for several different plugins.
Maybe I missed other possibilities and cases, which may lead to some pod schedulable, but blocked in requeueing following my logics, plz correct me if there's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should go through all the unschedulable plugins and until all of them return Queue, then we'll process with requeueing.
At least, your idea cannot simply be achieved. There are problematic scenarios like kubernetes/kubernetes#109437:
- Pod is rejected by NodeAffinity and NodeTaint.
- new Node is created, but unready (= has taint). This Node matches NodeAffinity of the Pod.
- The scheduling queue receives NodeCreated. NodeAffinity returns
Queue
, but NodeTaint returnsQueueSkip
. So, the Pod isn't moved to activeQ. - The Node is now ready. (=untainted)
- The scheduling queue receives NodeUpdated, and NodeTaint returns
Queue
. But, NodeAffinity returnQueueSkip
because it isn't interested in the change from tainted to untainted. So, the Pod isn't moved to activeQ.
Note that each QueueingHint returns Queue
only when the change could make Pod schedulable.
kubernetes/kubernetes#119396 (comment)
So, the Node is matching NodeAffinity, but NodeAffinity returns skip in (5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying it's impossible to improve how hints and unsched plugins are handled in the scheduling queue from the current design in the future. Actually, I remember I mentioned a similar idea somewhere before, like:
- When the Pod is rejected, the scheduling queue tracks which Node is rejected by which plugins. (map[Nodename][]PluginName)
- When the scheduling queue receives the event, the scheduling queue executes QueueingHint. (same as now)
- QueueingHint somehow returns which Node might be schedulable by this change. (like return map[NodeName]QueueingHint ?)
- If one Node seems solve all rejection from plugin, requeue this Pod
Bear with me, it's a super rough idea. I know, for example, (3) is not simple to do with cross-Node constraints like PodTopologySpread.
But, anyway, that kind of enhancement could be a big enough change to be treated as another enhancement after the basis of QueueingHint, discussed in this KEP, is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I thought is since pod is unschedulable, then each node is rejected by one plugin, when requeueing, we'll process hintFunc who only cares about what the pod like right now.
Take NodeAffinity return QueueSkip because it isn't interested in the change from tainted to untainted
for explanation:
- before: we'll return
Queue
only when the oldPod rejected by nodeAffinity plugin but newPod accepted - now: we'll return
Queue
as long as the newPod accepted by the nodeAffinity plugin
Does this solve the problem you described above? Any traps here? 🤯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for cross-node plugins, we may simply return Queue like we do today.
That means cross-node plugins would have to return Queue
to any non-Node events they register. In the above my example, we need to make PodTopologySpread return Queue at (5) although that PodUpdated event has completely nothing to do with PodTopologySpread.
So, always just returning Queue
in such difficult occasions cannot be the answer to this.
And, No, we don't do this today because we focus on the transition of the event only.
In the current implementation, you also can't know whether this will make podA schedulable
So, to be accurate, in the current design, we don't need to know whether the event makes PodA schedulable "on a certain Node(s)". The current design is kind of loose on it compared to yours. It focuses on the transition only, and doesn't check further.
For example, in the current design, PodTopologySpread returns Queue
when the assigned Pod gets changed like from unmatch to match with the PodTopologySpread's selector. So, we know some Nodes may be schedulable by this event, but we don't need to check further about whether this Pod's change makes some Nodes schedulable actually or not.
On the other hand, in your design, we'd need to know how that event affects the possibility of scheduling the Pod into some Nodes actually. (otherwise we miss some possibility as the above my example shows) And, returning Queue
only when we're sure a certain Node(s) can be schedulable(all unsched plugins PoV) after the event, which is difficult to achieve in cross-node constraints. (unless, as you said, we process the whole prefilter / filter)
So, basically, I don't think we need to go away from the current design, which is focusing on the transition of the event.
(Also, I guess we can achieve the similar stuff on the top of the current design.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to make PodTopologySpread return Queue at (5) although that PodUpdated event has completely nothing to do with PodTopologySpread
To clarify, they do related because podTopologySpread cares about the number of pods with specified labels. What I mean always return queue still cares about the related events, but we can't know exactly the result until we run prefilter & filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean always return queue still cares about the related events
We'd have to return Queue to all non-Node events they register. Note that there could be also custom plugins which have Pod related events, but interested in the part other than labels in Pod. Let's say the scheduler has one (weird) custom plugin named WeirdPodAffinity
😅 which does PodAffinity but with Pod's annotation instead of labels.
Here's the scenario same as this comment, but using WeirdPodAffinity instead of PodAffinity.
- In PodA's scheduling: NodeA is rejected by PodTopologySpread. NodeB is rejected by WeirdPodAffinity while it gets OK from PodTopologySpread.
- PodA is queued back into the unsched pod pool with {unschedPlugins: PodTopologySpread / WeirdPodAffinity}.
- PodB, which is already scheduled on NodeB, gets an update on its annotation. (AssignedPodUpdated)
- WeirdPodAffinity returns Queue because PodB's new annotation matches with PodA's WeirdPodAffinity.
- But, PodTopologySpread returns QueueSkip because PodB's update has nothing to do with PodA's PodTopologySpread.
PodTopologySpread should have returned Queue
at (5) because this PodUpdated event makes NodeB schedulable.
So, from PodTopologySpread's PoV, it cannot know how other plugins could return Queue to Pod related events. And it needs to always return Queue
when they receive PodUpdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, cross-nodes is complex ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:nod:
That's too complex to be handled by the current QueueingHint specification.
Let's put off it for now until the current, simple QueueingHint is implemented in all plugins. Then we can start further improvement like what we discuss here. I'll create an issue to remember this idea.
#### GA | ||
|
||
- QueueingHint is implemented in all plugins. | ||
- No bug report for a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backoff is to avoid HOL problem, if we have fine-gained control for requeueing, I think we can remove it for higher scheduling throughput, but we should prove this and agree this is not a necessary part for GA.
information to express the idea and why it was not acceptable. | ||
--> | ||
|
||
### Return `QueueImmediately`, `QueueAfterBackoff`, and `QueueSkip` from `QueueingHintFn` instead of introducing new status `Pending` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I prefer this one because it's more simple and extensible, people can decide the requeueing policy by the hint only, decouple with framework status, e.g. when implementing custom plugins, they can still return unschedulable status but return QueueImmediately
in hintFunc if necessary. Also it's backwards-compatible.
If we want to use Pending
status, can Wait
fit for this, then no need to introduce a new status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they can still return unschedulable status but return QueueImmediately in hintFunc if necessary.
So, when could they want to return QueueImmediately but need to return unschedulable in the scheduling cycle?
More flexibility/extensibility is not always better. All plugins (including custom ones) gather in the scheduler and work together. So, we need a clear rule and guardrail to prevent incorrect usage of when to skip backoff and when to not skip. Otherwise, everyone has a different idea like we saw in NodeAffinity QueueingHint PR and each plugin may be implemented in different strategies.
In the current KEP, if we define the beckoff delay as a penalty of wasting the scheduling cycle, and defining how to requeue Pods (go thru backoff or not) with the returned status makes more sense because that means that when to skip/not skip backoff is coupled with the past scheduling cycle, not the cluster event.
If we want to go away from this design and keep using QueueImmediately
, QueueAfterBackoff
, and QueueSkip
,
we need to have another rule and the guardrail to prevent incorrect usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not against this idea, we just didn't have much feedback from the community so far, so start with a simple way sounds more charming to me. I'm ok with both paths.
And, will we consider Wait
instead of Pending
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we consider Wait instead of Pending
Yes. We're seeking the idea how to name this status. I had SuccessButRejected
and just renamed it to Pending
, but as you said, Pending
sounds very similar to Wait
(at least to me, a non-native Eng speaker).
So, that might be the idea actually to reuse Wait
there.
I guess one point would be whether we want to minimize the number of status as much as possible or create a new status in order not to put several different meanings into one status.
Let's just keep it for now and continue the discussion to kubernetes/kubernetes#119517. We can easily fix the KEP later if we use Wait
or different name instead of Pending
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hints functionality is already complex. Let's ensure it's stable and uses memory efficiently, before we consider more complex scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit from PRR perspective.
I'm approving to not block it, but please fix before merging.
/approve PRR
/hold
automations, so be extremely careful here. | ||
--> | ||
|
||
No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only partially true.
No user-visible changes are expected, but re-queueing of unschedulable pods is different, so it may in theory happen that initially unschedulable pods will get scheduled later than before [in case of bugs].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
/label tide/merge-method-squash |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, sanposhiho, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Other comments:
As described in #4247 (comment), this KEP is follow up one to clarify the position of QueueingHint and started in the beta phase from first while assuming it stems from DRA plugin. The discussion is still on-going about how we treat this KEP though, we agreed with the idea of having a separate KEP for QueueingHint at least: https://kubernetes.slack.com/archives/C5P3FE08M/p1695639140018139?thread_ts=1694167948.846139&cid=C5P3FE08M
Also, this KEP contains what is proposed in kubernetes/kubernetes#119517 as well.