-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage capacity: object ownership #583
Conversation
/hold Let's merge this one last once all other changes for storage capacity tracking are in. |
/assign |
Just curious: how does a label selector increase API server CPU load? |
factoryForNamespace = informers.NewSharedInformerFactoryWithOptions(clientset, | ||
ctrl.ResyncPeriodOfCsiNodeInformer, | ||
informers.WithNamespace(namespace), | ||
informers.WithTweakListOptions(func(lo *metav1.ListOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how easy it is to detect if the informer doesn't fetch objects we expect, and whether we need to unit test this.
If informer logic is wrong, two things could happen:
- Missing items that are supposed to be included
- May try to create an existing CSIStorageCapacity object. In the logic today this is fine, because nothing is done after object creation (besides logging / error reporting), and the creation error would give us a hint.
- Includes more things than it should
- No failures, just more network traffic.
The latter case is hard to detect. So maybe a unit test is worthwhile (if it's not too much effort)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually since this is just an options configuration, the logic is simple enough, so it's probably low risk. Totally up to you if you want to unit test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to not test this. I suspect that it wouldn't be easy to test comprehensively and in the end is more about correct implementation of the informer and apiserver filtering than it is about correct configuration.
I had exactly the same concern and asked on #sig-scalability. The response was that spending more CPU time on server-side filtering can be assumed to be a net win when it decreases the number of objects that need to be transferred. That costs both network bandwidth and CPU time for conversion. |
When you have a moment can you please address this merge conflict? We are planning to release this tomorrow to align with the Kubernetes 1.21 release. |
Since I'm not familiar with API server code, I was actually wondering exactly how computationally heavy server-side filtering is. What does it do under the hood to do this filtering? |
Optionally, all CSIStorageCapacity objects created by an instance of | ||
the external-provisioner can have an | ||
[owner](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents). | ||
This ensures that the objects get removed automatically when uninstalling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add: "Setting an owner reference is highly recommended whenever possible (i.e. in the most common case that drivers are run inside containers)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the worst case scenario if orphaned objects aren't cleaned up?
For the case when all orphaned CSIStorageCapacity objects report not enough capacity for the volume to be provisioned, even though the driver is no longer there: the pod will fail to schedule and stay pending, whereas if objects are cleaned up properly, the pod is scheduled and the volume fails provioning, also causing the pod to stay pending. Once the driver comes back, CSIStorageCapacity objects will reflect the actual state again. So I think we are good in this case.
When we video chatted, I think we talked about a case where orphaned objects could cause the scheduler to hang? But I totally forgot which case it was :(. Maybe it's actually OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
As you said, orphaned and incorrect objects cause no permanent damage for pods with normal PVCs (which includes generic ephemeral volumes), so it's not too bad. It's worse for CSI ephemeral volumes because then a pod might get scheduled onto a node where the driver then cannot provide the volume and the pod is stuck there (no rescheduling). I think that was what we chatted about, not the scheduler itself hanging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah. Could you maybe mention this in the design doc somewhere (if it's not there already)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is covered by the generic ephemeral volume KEP.
README.md
Outdated
determine with the `POD_NAME/NAMESPACE` environment variables and | ||
the `--capacity-ownerref-level` parameter. | ||
|
||
If ownership is disabled, then the following command can be used to clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add: "If the ownership is disabled, the storage admin is responsible for removing orphaned CSIStorageCapacity objects, and the following..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -92,7 +92,7 @@ var ( | |||
enableCapacity = flag.Bool("enable-capacity", false, "This enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call.") | |||
capacityImmediateBinding = flag.Bool("capacity-for-immediate-binding", false, "Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging.") | |||
capacityPollInterval = flag.Duration("capacity-poll-interval", time.Minute, "How long the external-provisioner waits before checking for storage capacity changes.") | |||
capacityOwnerrefLevel = flag.Int("capacity-ownerref-level", 1, "The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc.") | |||
capacityOwnerrefLevel = flag.Int("capacity-ownerref-level", 1, "The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: -1 for no owner, 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 for a StatefulSet or DaemonSet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@@ -551,13 +561,17 @@ func (c *Controller) syncCapacity(ctx context.Context, item workItem) error { | |||
// would race with receiving that object through the event handler. In the unlikely | |||
// scenario that we end up creating two objects for the same work item, the second | |||
// one will be recognized as duplicate and get deleted again once we receive it. | |||
} else if capacity.Capacity.Value() == quantity.Value() { | |||
} else if capacity.Capacity.Value() == quantity.Value() && | |||
(c.owner == nil || c.isOwnedByUs(capacity)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be isManaged()
? If a driver running in binary mode manages an object which is unchanged, it should still skip right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really is an ownerref check. It covers the scenario where a driver was deployed without ownerref and then gets redeployed with ownerref. The existing objects then need to be updated even when the capacity is the same, which is achieved here by falling through to the final else clause.
I added "same capacity %v and correct owner" to the log message to make this more clear.
pkg/capacity/capacity.go
Outdated
if owner.Controller != nil && *owner.Controller && owner.UID == c.owner.UID { | ||
return true | ||
} | ||
} | ||
return false | ||
} | ||
|
||
// isManaged checks the labels to determine whether this capacity object is managed by | ||
// the controller instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the informer is already filtered by labels in the next commit, and all callers of isManaged()
fetch from the informer, is this mostly here as a safeguard? If so, nit: could you add a comment to clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I added a comment.
@@ -587,9 +601,8 @@ func (c *Controller) deleteCapacity(ctx context.Context, capacity *storagev1alph | |||
// ensures that it gets deleted if no longer needed. Foreign objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update function name in comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pkg/capacity/capacity_test.go
Outdated
@@ -1217,7 +1264,8 @@ func fakeController(ctx context.Context, client *fakeclientset.Clientset, storag | |||
driverName, | |||
client, | |||
queue, | |||
defaultOwner, | |||
&defaultOwner, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a test case for "managed by us but not owned by us"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is "reuse one capacity object, add owner" for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inverse "reuse one capacity object, remove owner" doesn't work because the driver instance without owner configured doesn't know which owner references it is expected to remove. This shouldn't be needed in practice and the documentation mentions it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually referring to a test that captures the container-less driver case, where ownerref stays nil always. IIUC in all the tests so far assumes that capacity objects are eventually managed by a driver in containers (i.e. owner is always set to defaultOwner
).
If you agree this test is necessary I would prefer that the test is added in this PR rather than delaying it after the release, because it is a use case we plan to support. Hopefully it's not a lot of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, that is a supported use case and wasn't covered. I've added two test cases for it which cover the relevant difference (create objects without owner reference, keep existing references).
It has an in-memory representation of the object and then simply checks labels in the meta data. In that sense it's simply brute force. But the in-memory representation is needed anyway and then the check is still faster than the rest of the processing that would need to happen when sending the object. At least that is my understanding... |
I pushed an updated version. As before it is based on PR #586 |
What about updating client-go in external-provisioner to 1.21.0? Kubernetes 1.21 must be released first, then I can update client-go and only then can we release external-provisioner. Or we release external-provisioner with the current code. None of the differences seem relevant. I created the update for rc0 here: #599 |
When the provisioner has created or deleted a volume, storage capacity is likely to be different than before. We have some information (storage class, topology) which allows us to schedule a subset of the CSIStorageCapacity objects for an update sooner than the normal periodic update.
@pohly I had planned on updating to 1.21 when it was released. Thank you for taking care of this. |
Yeah that makes sense. Some sort of index could maybe speed up processing but maybe that's overkill 🤷 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more potential change, everything else lgtm
} else if capacity.Capacity.Value() == quantity.Value() { | ||
klog.V(5).Infof("Capacity Controller: no need to update %s for %+v, same capacity %v", capacity.Name, item, quantity) | ||
} else if capacity.Capacity.Value() == quantity.Value() && | ||
(c.owner == nil || c.isOwnedByUs(capacity)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, and c.owner == nil
covers the case that a driver stays owner-less. Sgtm!
pkg/capacity/capacity_test.go
Outdated
@@ -1217,7 +1264,8 @@ func fakeController(ctx context.Context, client *fakeclientset.Clientset, storag | |||
driverName, | |||
client, | |||
queue, | |||
defaultOwner, | |||
&defaultOwner, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually referring to a test that captures the container-less driver case, where ownerref stays nil always. IIUC in all the tests so far assumes that capacity objects are eventually managed by a driver in containers (i.e. owner is always set to defaultOwner
).
If you agree this test is necessary I would prefer that the test is added in this PR rather than delaying it after the release, because it is a use case we plan to support. Hopefully it's not a lot of work.
/lgtm /assign @msau42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
just some nits
README.md
Outdated
[owner](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents). | ||
This ensures that the objects get removed automatically when uninstalling | ||
the CSI driver. The owner is | ||
determine with the `POD_NAME/NAMESPACE` environment variables and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: determined
README.md
Outdated
@@ -76,7 +76,7 @@ See the [storage capacity section](#capacity-support) below for details. | |||
|
|||
* `--enable-capacity`: This enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. The default is to not produce CSIStorageCapacity objects. | |||
|
|||
* `--capacity-ownerref-level <levels>`: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet). | |||
* `--capacity-ownerref-level <levels>`: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet). Ownership is optional and can be disabled with -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this to also include Daemonset
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The mandatory ownership reference prevented deploying external-provisioner outside of the Kubernetes cluster and made it impossible to reuse objects when rolling out an updated CSI driver which used distributed provisioning. It's better to identify managed objects with labels. This also makes it possible to optimize the watch (to be implemented in a separate commit).
This reduces network traffic and is beneficial for: - multiple CSI drivers in the same namespace (filtered by driver name) - distributed provisioning (filtered by manager ID) CPU load on the kube-apiserver will be higher, but the reduction in network traffic is expected to make this worthwhile.
When the controller runs without an owner configured, it - must not add one (obviously...) - preserve existing owner references
New changes are detected. LGTM label has been removed. |
Both fixed, please lgtm again. /hold cancel |
What type of PR is this?
/kind api-change
What this PR does / why we need it:
This promotes "storage capacity tracking" to beta and changes its usages as agreed upon during KEP review for beta (optional ownership, labels used to identify objects managed by each instance).
Which issue(s) this PR fixes:
Fixes #465
Special notes for your reviewer:
Based on #579
Does this PR introduce a user-facing change?: