-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3022: Min domains in PodTopologySpread #3030
Conversation
keps/sig-scheduling/3022-tuning-the-number-of-domains-in-pod-topology-spread/README.md
Outdated
Show resolved
Hide resolved
/assign |
62acb98
to
fee195e
Compare
@sanposhiho thanks for driving this. Given that the consensus of #3022 so far is to start with |
@Huang-Wei Sure. Thanks. |
Updated KEP to focus on |
owning-sig: sig-scheduling | ||
status: provisional | ||
creation-date: 2021-10-28 | ||
reviewers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need that many reviewers/approvers :)
Also, damemi is not a KEP approver (yet :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can leave me out of this one.
## Motivation | ||
|
||
Pod Topology Spread has [`maxSkew` parameter](https://github.com/kubernetes/enhancements/tree/11a976c74e1358efccf251d4c7611d05ce27feb3/keps/sig-scheduling/895-pod-topology-spread#maxskew), which control the degree to which Pods may be unevenly distributed. | ||
But, there isn't a way to control the number of domains over which we should spread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain why this is a problem.
The main reason is that, if a domain has 0 nodes, the scheduler would still schedule on existing domains, without giving the cluster autoscaler a chance to provide a node for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll try to be more specific.
|
||
Users can define a minimum number of domains with `minDomains` parameter. | ||
|
||
When the number of domains that have matching Pods is less than `minDomains`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not part of the Goal, this is part of the Design.
// - when `whenUnsatisfiable` equals `ScheduleAnyway`, scheduler prefers Nodes on the domains that don't have matching Pods. | ||
// i.e. scheduler gives a high score to Nodes on the domains that don't have matching Pods in the `Score` phase. | ||
// +optional | ||
MinDomains int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*int32
...... | ||
// MinDomains describes the minimum number of domains. | ||
// When the number of domains that have matching Pods is less than `minDomains`, | ||
// - when `whenUnsatisfiable` equals `DoNotSchedule`, scheduler doesn't schedule a matching Pod to Nodes on the domains that have matching Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm... rethinking a bit more about this, I think the semantics are off.
Let's say maxSkew=2 and there is only one node. I should be able to schedule 2 pods in that one node. The third pod should fail for force a scale up.
So we have to consider how many domains have pods. If it's less than minDomains, the global minimum is 0 https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#spread-constraints-for-pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aldo is right. Currently (without minDomains
), the semantics of "global minimum" literally indicates the minimum number of matching pods among all topology domains.
While with minDomains
, the semantics is sort of dynamic:
- if the number of domains >=
minDomains
, "global minimum" stay the same as before - if the number of domains <
minDomains
, "global minimum" is treated as 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clear explanation.
Okay, what I thought was
maxSkew
does not affect the behavior ofminDomains
.minDomains
tries to increase the number of domains as much as possible, if the number of domains <minDomains
. So, it always tries to bind matching Pods to the domains that don't have any matching Pods in that case.
But, I also think your opinion was better. I'll fix the description. 🙏
```yaml | ||
spec: | ||
topologySpreadConstraints: | ||
- minDomains: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaxSkew is mandatory
|
||
#### Alpha (v1.24): | ||
|
||
- [ ] Add new parameter `NinDomains` to `TopologySpreadConstraint` and future gating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos: MinDomain
and feature
#### Beta (v1.25): | ||
|
||
- [ ] This feature will be enabled by default as a Beta feature in v1.25. | ||
- [ ] Add necessary integration/end-to-end tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually require integration tests for Alpha. It would be great to have E2E tests that include a cluster autoscaler, but I'm not sure if that's possible.
kep-number: 3022 | ||
authors: | ||
- "@sanposhiho" | ||
owning-sig: sig-scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add sig-autoscaling
as participating sig and @MaciekPytel as reviewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
### Feature Enablement and Rollback | ||
|
||
<!-- | ||
This section must be completed when targeting alpha to a release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can leave this for the time when the KEP moves to implementable
, but you can work on it already if you want.
With [Pod Topology Spread](/keps/sig-scheduling/895-pod-topology-spread), users can define the rule to spread pods across your cluster among failure-domains. | ||
And, we propose to add the parameter `minDomains` to limit the minimum number of domains in Pod Topology Spread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The summary section basically describes a fact that XYZ is introduced to achieve a particular goal. So don't need to mention what PodTopologySpread is:
A new field `MinDomains` is introduced to `PodSpec.TopologySpreadConstraint[*]` to limit
the minimum number of topology domains. It functions in a mandatory or best-efforts manner,
depending on the type of `WhenUnsatisfiable`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
||
### Goals | ||
|
||
Users can define a minimum number of domains with `minDomains` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goals usually list a few bullets, each with a neat sentence without too many technical details.
I am using cluster autoscaler and I want to force spreading a deployment over at least 10 Nodes. | ||
|
||
## Design Details | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please complete this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll move current Goals(contains technical details) to this part.
|
||
### API | ||
|
||
New parameter called `MinDomains` is introduced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is introdcued to ...
i.e. scheduler filters Nodes on the domains that have matching Pods in the `Filter` phase. | ||
- when `whenUnsatisfiable` equals `ScheduleAnyway`, scheduler prefers Nodes on the domains that don't have matching Pods. | ||
i.e. scheduler gives a high score to Nodes on the domains that don't have matching Pods in the `Score` phase. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a Non-Goals
section here and mention maxDomains
is not considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
...... | ||
// MinDomains describes the minimum number of domains. | ||
// When the number of domains that have matching Pods is less than `minDomains`, | ||
// - when `whenUnsatisfiable` equals `DoNotSchedule`, scheduler doesn't schedule a matching Pod to Nodes on the domains that have matching Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aldo is right. Currently (without minDomains
), the semantics of "global minimum" literally indicates the minimum number of matching pods among all topology domains.
While with minDomains
, the semantics is sort of dynamic:
- if the number of domains >=
minDomains
, "global minimum" stay the same as before - if the number of domains <
minDomains
, "global minimum" is treated as 0
Users can set `MinDomains` and `whenUnsatisfiable: DoNotSchedule` to achieve it. | ||
|
||
```yaml | ||
spec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a full Deployment example to show the replicas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
spec: | ||
topologySpreadConstraints: | ||
- minDomains: 10 | ||
topologyKey: node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use standard label: kubernetes.io/hostname
@Huang-Wei @alculquicondor |
owning-sig: sig-scheduling | ||
participating-sigs: | ||
- sig-autoscaling | ||
status: provisional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can aim for implementable already so that we don't have to ping the reviewers again.
|
||
Pod Topology Spread has [`maxSkew` parameter](https://github.com/kubernetes/enhancements/tree/11a976c74e1358efccf251d4c7611d05ce27feb3/keps/sig-scheduling/895-pod-topology-spread#maxskew), which control the degree to which Pods may be unevenly distributed. | ||
But, there isn't a way to control the number of domains over which we should spread. | ||
In some cases, users want to limit the number of domains for the cluster autoscaler and force spreading Pods over a minimum number of domains. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases, users want to force spreading Pods over a minimum number of domains and, if there aren't enough already present, make the cluster-autoscaler provision them.
When the number of domains that have matching Pods is less than `minDomains`, | ||
Pod Topology Spread treats "global minimum" as 0. | ||
|
||
As the result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result
foo: bar | ||
``` | ||
|
||
Until 10 Pods have been created, this flow is expected to repeats itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming that the cluster initially only has 0 or 1 node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I changed the explanation a bit. How about this?
MinDomains *int32 | ||
} | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include some implementation details?
Please don't add code to the KEP, but you can have some thoughts on which files/functions need to change and if we expect to have any performance degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I added an implementation detail section that explains how to implement this feature in a bit more detail.
###### How can this feature be enabled / disabled in a live cluster? | ||
|
||
- [ ] Feature gate (also fill in values in `kep.yaml`) | ||
- Feature gate name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fill this up
// - when `whenUnsatisfiable` equals `ScheduleAnyway`, | ||
// scheduler prefers Nodes on the domains that don't have the same number of matching Pods as `maxSkew`. | ||
// +optional | ||
MinDomains *int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default value? Probably zero. Please specify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's 0.
Sure, I'll add the description.
|
||
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
||
N/A. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please answer this question.
|
||
###### Are there any tests for feature enablement/disablement? | ||
|
||
No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be unit and integration
|
||
### Scalability | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leave this for later, but consider working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added some answers.
@alculquicondor |
keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md
Outdated
Show resolved
Hide resolved
- `global min matching num` denotes the minumun number of matching Pods. | ||
|
||
For `whenUnsatisfiable: DoNotSchedule`, Pod Topology Spread treats `global min matching num` as 0 | ||
when the number of domains that have matching Pods is less than `minDomains`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you calculate this? probably in PreFilter? Can it be done without increasing the complexity of the algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Don't add the "that have matching Pods" clause.
when the number of domains that have matching Pods is less than `minDomains`. | |
when the number of domains is less than `minDomains`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you calculate this? probably in PreFilter? Can it be done without increasing the complexity of the algorithm?
The current implementation calculates global min matching num in PreFilre.
And we use this global min matching num for calculation for filtering here.
https://github.com/kubernetes/kubernetes/blob/108c284a330a82ce1a1f80238e4f54bf5e8b045a/pkg/scheduler/framework/plugins/podtopologyspread/filtering.go#L322
So, I think the implementation will be like this. It's simple and can be implemented without increasing the complexity.
minMatchNum := paths[0].MatchNum // https://github.com/kubernetes/kubernetes/blob/108c284a330a82ce1a1f80238e4f54bf5e8b045a/pkg/scheduler/framework/plugins/podtopologyspread/filtering.go#L317
if minMatchNum < *c.MinDomains {
minMatchNum = 0
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I was not clear. If I ask a question, it means that it's not clear in the KEP, and thus it needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I'll update updated🙇
|
||
And, in `preScoreState`, there is the `IgnoredNodes` field and Nodes in `preScoreState.IgnoredNodes` will literally be ignored and the score for those Nodes will be 0. | ||
|
||
For `whenUnsatisfiable: ScheduleAnyway`, Pod Topology Spread adds Nodes on the domains which have the same or more number of matching Pods as `maxSkew` to `preScoreState.IgnoredNodes` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you calculate which nodes to ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that I hadn't thought about it enough, so I revised the whole implementation detail of score phase. 🙏
Here are some examples of implementation details:
Although, I think those changes are trivial in comparison to what you have to do for this KEP. |
|
||
Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain. | ||
|
||
When the number of domains that have matching Pods is less than `minDomains`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not mandatory to have matching pods on a domain (you can have 0 pods on a domain), so
When the number of domains that have matching Pods is less than `minDomains`, | |
When the number of domains is less than `minDomains`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 you are right. Although would "the number of domains with matching nodes" be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you meant "the number of domains with matching nodes topology keys"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, ultimately the domains are defined by node labels. So I'm happy with any of those wordings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Thanks.
I'll change the wording to "the number of domains with matching topology keys".
Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain. | ||
|
||
When the number of domains that have matching Pods is less than `minDomains`, | ||
Pod Topology Spread treats "global minimum" as 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod Topology Spread treats "global minimum" as 0. | |
Pod Topology Spread treats "global minimum" as 0; otherwise, "global minimum" | |
is equal to the number of matching pods on a domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the minimum number of matching pods on a domain.
- `global min matching num` denotes the minumun number of matching Pods. | ||
|
||
For `whenUnsatisfiable: DoNotSchedule`, Pod Topology Spread treats `global min matching num` as 0 | ||
when the number of domains that have matching Pods is less than `minDomains`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Don't add the "that have matching Pods" clause.
when the number of domains that have matching Pods is less than `minDomains`. | |
when the number of domains is less than `minDomains`. |
foo: bar | ||
``` | ||
|
||
Until 10 Pods have been created, scheduler tries to schedule maximum two Pods per Node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd better not mention how many (10 in this case) pods get scheduled b/c it quite depends on how many domains are present, and you didn't mention that.
I'd like to revise this deployment example to be 10 replicas, and assume there are 3 nodes in the cluster. So the first 6 pods will be scheduled, and the rest 4 pods can only be scheduled when 2 more nodes join the cluster (provisioned by CA for instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. fixed it. 🙇
|
||
- [ ] Feature gate (also fill in values in `kep.yaml`) | ||
- Feature gate name: `MinDomainsInPodTopologySpread` | ||
- Components depending on the feature gate: `kube-scheduler` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API server should also be involved, like: if the feature gate is disabled in API server, the field won't be persisted.
Pod Topology Spread changes the evaluation way to this. | ||
|
||
``` | ||
('existing matching num' * 'topology weight' + 'maxSkew' - 1) * ('if existing matching num >= maxSkew (1 or 0)' + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have a discussion about the multiplier which is used if existing matching num >= maxSkew.
doubling it now, but there is no big reason. I just want score to be big if existing matching num >= maxSkew.
If we make it too big, it will affect all final score calculated in Normalized Score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative is to make all the nodes that have existing matching num >= maxSkew
(when number of domains is less than minDomains) have the max Score, so that the normalized score is potentially zero. Although, given normalization, getting a zero normalized score also depends on the minimum score.
To cap, you would do:
min(maxSkew, existingMatchingNum) * topologyWeight + maxSkew - 1
Circling back on what I said earlier, you should prefer not to add code to the KEP. But a not-so-simple formula that has conditionals can be more readable as a code snippet. As long as you are not providing a significant diff, some code is ok in a KEP.
/cc @x13n |
|
||
The feature can be disabled in Alpha and Beta versions. | ||
In terms of Stable versions, users can choose to opt-out by not setting the | ||
`pod.spec.topologySpreadConstraints.maxDomains` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: minDomains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
/assign |
|
||
Users can define a minimum number of domains with `minDomains` parameter. | ||
|
||
Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add:
However, the global minimum is only calculated for the nodes that exist and match the node affinity. In other words, if a topology domain was scaled down to zero (for example, because of low utilization), this topology domain is unknown to the scheduler, thus it's not considered in the global minimum calculations.
The new minDomains
field can help with this problem: [continues the existing text]
``` | ||
|
||
- `existing matching num` denotes the number of current existing matching Pods on the domain. | ||
- `if self-match` denotes if the labels of Pod is match with selector of the constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/is match/matches
|
||
In Score, the score for each constraint is evaluated, and the sum of those scores will be a Node score. | ||
|
||
Basically, the score for each constraint is evaluated this way: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the word Basically
``` | ||
|
||
When the number of domains with matching topology keys is less than `minDomains`, | ||
Pod Topology Spread doubles that score (so that final score will be a low score) if this criteria is met: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/final/normalized
Like preFilter, we need to calculate the number of domains with matching topology keys and the minimum number of matching Pods in preScore, | ||
so that Pod Topology Spread can determine the evaluation way with them. | ||
|
||
This extra calculation may affect the performance of the preScore, because the current preScore only see Nodes which have passed the Filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm.... we need to make sure to do it only if a constraint has minDomains set.
Is it really worth though? Since the original formula already gives a similar semantic, and because scoring is ignored by cluster-autoscaler, I now think we should leave scores unmodified.
@Huang-Wei, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well.. I totally agree with "the original formula already gives a similar semantic". But, isn't it strange for users that there is no score change even after setting minDomains?
Or specify in the documentation that minDomain
has an effect only when using DoNotSchedule
and no actual effect when SchedulingAnyway
?
we need to make sure to do it only if a constraint has minDomains set.
Agree, we should do that if we go the way of changing scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can specify in the documentation that it only applies to DoNotSchedule
.
But, isn't it strange for users that there is no score change even after setting minDomains?
The scores are not user visible regardless. The difference is so subtle that I don't think they would notice a difference in most scenarios even if there is a change.
WDYT @Huang-Wei ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, restrict it to DoNotSchedule
adheres to our initial motivation; while investing efforts on tuning subtle scroring difference doesn't make a lot of sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Understood.
I changed the doc to set the goal to DoNotSchedule only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too much effort, add a summary of the discussion about DoNotSchedule in the "Alternatives" section. Explain what the proposal was, and why ultimately we decided it was not worth pursuing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, while it might feel disappointing that we are discarding the change, it was absolutely crucial that we explored the options. So thank you a lot for your effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a summary of the discussion about DoNotSchedule in the "Alternatives" section. Explain what the proposal was, and why ultimately we decided it was not worth pursuing.
Sure. I added it to Alternatives
. Could you please take a look 🙏
BTW..
happy to contribute to our best choice :)
reviewers: | ||
- "@alculquicondor" | ||
- "@Huang-Wei" | ||
- "@MaciekPytel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swap for @x13n
@alculquicondor |
- Users can specify `minDomains` to limit the number of domains. | ||
- Users can use it as a mandatory requirement with `WhenUnsatisfiable=DoNotSchedule`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Users can specify `minDomains` to limit the number of domains. | |
- Users can use it as a mandatory requirement with `WhenUnsatisfiable=DoNotSchedule`. | |
- Users can specify `minDomains` to limit the number of domains when using `WhenUnsatisfiable=DoNotSchedule`. |
...... | ||
// When the number of domains with matching topology keys is less than `minDomains`, | ||
// Pod Topology Spread treats "global minimum" as 0. | ||
// As a result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line
// As a result, | ||
// As a result, when the number of domains is less than `minDomains`, | ||
// scheduler doesn't schedule a matching Pod to Nodes on the domains that have the same or more number of matching Pods as `maxSkew`. | ||
// Default value is 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add: When value is different than 0, WhenUnsatisfiable must be DoNotSchedule
#### Alpha (v1.24): | ||
|
||
- [ ] Add new parameter `MinDomains` to `TopologySpreadConstraint` and feature gating. | ||
- [ ] Score extension point implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line
keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md
Outdated
Show resolved
Hide resolved
@alculquicondor Thanks. Updated as your suggestions. |
/approve Please squash commits :) ping @wojtek-t for PRR |
Squashed 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two nits - other than that it lgtm.
|
||
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
||
The feature can be disabled in Alpha and Beta versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add sth like:
"The feature can be disabled by restarting kube-apiserver and kube-scheduler with feature-gate off".
|
||
###### Are there any tests for feature enablement/disablement? | ||
|
||
There should be unit and integration tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - tests will be added.
@wojtek-t |
It will need more work for Beta, but it's good for alpha from PRR perspective. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, sanposhiho, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hello team.
KEP-3022: Tuning the number of domains in Pod Topology Spread
Note
This is my first KEP. Please let me know if I'm missing something.