-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node label for policy and condition for CPUPressure #970
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -145,6 +145,10 @@ configuration. The three policies are **none**, **static** and **dynamic**. | |
|
||
The active CPU manager policy is set through a new Kubelet | ||
configuration value `--cpu-manager-policy`. The default value is `none`. | ||
This policy will be reported in the Node resource as a label with key | ||
`alpha.kubernetes.io/cpu-policy`. This allows users with with a policy | ||
preference to use a node selector to ensure the pod lands on a node with a | ||
particular cpu manager policy enabled. | ||
|
||
The number of CPUs that pods may run on can be implicitly controlled using the | ||
existing node-allocatable configuration settings. See the [node allocatable | ||
|
@@ -266,14 +270,22 @@ func (p *staticPolicy) UnregisterContainer(s State, containerID string) error { | |
cpuset for all containers running in the shared pool. | ||
|
||
1. _The shared pool becomes empty._ | ||
1. The CPU manager adds a node condition with effect NoSchedule, | ||
NoExecute that prevents BestEffort and Burstable QoS class pods from | ||
running on the node. BestEffort and Burstable QoS class pods are | ||
evicted from the node. | ||
1. The CPU manager sets a CPUPressure node condition to true that prevents | ||
BestEffort and Burstable QoS class pods with no CPU resource request | ||
from being scheduled on the node. Already running BestEffort and Burstable QoS | ||
class pods without a CPU resource request are evicted from the node. | ||
|
||
NOTE: The decision to allow the shared pool to completely empty is | ||
driven by a desire to keep the scheduler accounting simple. If a | ||
--cpu-policy-static-min-shared-cpus flag were to exist, there is no simple way | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: add backticks to flag name to avoid soft-wrapping. |
||
to convey that information to the scheduler. The scheduler would then need to | ||
know that some portion of the CPU Allocatable is "special", for use only by | ||
non-exlusive containers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Possibly obvious, but consider also mentioning what would happen with a shared pool low-water-mark but without scheduler enhancements -- Kubelet would need to reject some G pods after scheduling, with high potential to confuse users. |
||
|
||
1. _The shared pool becomes nonempty._ | ||
1. The CPU manager removes the node condition with effect NoSchedule, | ||
NoExecute for BestEffort and Burstable QoS class pods. | ||
1. The CPU manager sets a CPUPressure node condition to false that allows | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned in the PR, this node condition isn't necessary if we reserve a core by default which I feel will be user's expectation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xref #49186 (comment) |
||
BestEffort and Burstable QoS class pods with no CPU resource request to | ||
be scheduled on the node. | ||
|
||
#### Policy 3: "dynamic" cpuset control | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are efforts in progress (#911) to remove or restrict the ability of nodes to label themselves, since that allows them to steer workloads to themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow ok. Didn't know about that.
I can either 1) remove this section completely or 2) s/will/can/ if trusted node label mechanism configures it independently rather then the kubelet knowing its own config and reporting it through a label.
What will happen to all the node labels in
kubernetes.io
that the node self-labels now?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll have to whitelist them (which we may do for things like os/arch/instance-type) or silently drop them until we can remove them from the kubelet and get past the version skew window for kubelets that try to set them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using labels for this could still work, I would just expect them to be applied to nodes, rather than having nodes self-report them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the spirit of the design, but it does mean that some fields the kubelet sets today based on its own introspection of the machine need to move to fields. Are we ok with the node controller then converting those fields to labels? Can cpu manager policy be included in that initial whitelist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh - have you been tracking removing the ability for a node to self label? did you have other ideas for converting them to fields and then node controller assigning them as labels/taints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We rely on Node labels heavily to address scheduling constraints. Fields in node object cannot be used for scheduling purposes as of now.
Having users create labels that map kubelet configuration is not a great user experience. I can see the security aspect of having nodes introduce arbitrary labels.
Perhaps a whitelist of label keys is a way to get around the security issue? Ideally such a whitelist would be configurable (Configmaps maybe?) which would give cluster admins the ability to choose what labels they'd like to expose and also extend kubernetes easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekwaynecarr #911 is news to me. I wonder why that proposal wasn't discussed in SIG-node yet.