Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage capacity: object ownership #583

Merged
merged 4 commits into from
Apr 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 47 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ See the [storage capacity section](#capacity-support) below for details.

* `--enable-capacity`: This enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. The default is to not produce CSIStorageCapacity objects.

* `--capacity-ownerref-level <levels>`: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet).
* `--capacity-ownerref-level <levels>`: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet and DaemonSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet). Ownership is optional and can be disabled with -1.

* `--capacity-threads <num>`: Number of simultaneously running threads, handling CSIStorageCapacity objects. Defaults to `1`.

Expand Down Expand Up @@ -136,34 +136,59 @@ No | Irrelevant | No | No | `Requisite` and `Preferred` both nil

### Capacity support

> :warning: *Warning:* This is an alpha feature and only supported by
> Kubernetes >= 1.19 if the `CSIStorageCapacity` feature gate is
> enabled.

The external-provisioner can be used to create CSIStorageCapacity
objects that hold information about the storage capacity available
through the driver. The Kubernetes scheduler then [uses that
information](https://kubernetes.io/docs/concepts/storage/storage-capacity)
when selecting nodes for pods with unbound volumes that wait for the
first consumer.

Currently, all CSIStorageCapacity objects created by an instance of
the external-provisioner must have the same
[owner](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents). That
owner is how external-provisioner distinguishes between objects that
it must manage and those that it must leave alone. The owner is
determine with the `POD_NAME/POD_NAMESPACE` environment variables and
the `--capacity-ownerref-level` parameter. Other solutions will be
added in the future.
All CSIStorageCapacity objects created by an instance of
the external-provisioner have certain labels:

* `csi.storage.k8s.io/drivername`: the CSI driver name
* `csi.storage.k8s.io/managed-by`: external-provisioner for central
provisioning, external-provisioner-<node name> for distributed
provisioning

They get created in the namespace identified with the `NAMESPACE`
environment variable.

Each external-provisioner instance manages exactly those objects with
the labels that correspond to the instance.

Optionally, all CSIStorageCapacity objects created by an instance of
the external-provisioner can have an
[owner](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents).
This ensures that the objects get removed automatically when uninstalling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: "Setting an owner reference is highly recommended whenever possible (i.e. in the most common case that drivers are run inside containers)."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the worst case scenario if orphaned objects aren't cleaned up?

For the case when all orphaned CSIStorageCapacity objects report not enough capacity for the volume to be provisioned, even though the driver is no longer there: the pod will fail to schedule and stay pending, whereas if objects are cleaned up properly, the pod is scheduled and the volume fails provioning, also causing the pod to stay pending. Once the driver comes back, CSIStorageCapacity objects will reflect the actual state again. So I think we are good in this case.

When we video chatted, I think we talked about a case where orphaned objects could cause the scheduler to hang? But I totally forgot which case it was :(. Maybe it's actually OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

As you said, orphaned and incorrect objects cause no permanent damage for pods with normal PVCs (which includes generic ephemeral volumes), so it's not too bad. It's worse for CSI ephemeral volumes because then a pod might get scheduled onto a node where the driver then cannot provide the volume and the pod is stuck there (no rescheduling). I think that was what we chatted about, not the scheduler itself hanging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah. Could you maybe mention this in the design doc somewhere (if it's not there already)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is covered by the generic ephemeral volume KEP.

the CSI driver. The owner is
determined with the `POD_NAME/NAMESPACE` environment variables and
the `--capacity-ownerref-level` parameter. Setting an owner reference is highly
recommended whenever possible (i.e. in the most common case that drivers are
run inside containers).

If ownership is disabled the storage admin is responsible for removing
orphaned CSIStorageCapacity objects, and the following command can be
used to clean up orphaned objects of a driver:

```
kubectl delete csistoragecapacities -l csi.storage.k8s.io/drivername=my-csi.example.com
```

When switching from a deployment without ownership to one with
ownership, managed objects get updated such that they have the
configured owner. When switching in the other direction, the owner
reference is not removed because the new deployment doesn't know
what the old owner was.

To enable this feature in a driver deployment with a central controller (see also the
[`deploy/kubernetes/storage-capacity.yaml`](deploy/kubernetes/storage-capacity.yaml)
example):

- Set the `POD_NAME` and `POD_NAMESPACE` environment variables like this:
- Set the `POD_NAME` and `NAMESPACE` environment variables like this:
```yaml
env:
- name: POD_NAMESPACE
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
Expand Down Expand Up @@ -240,10 +265,15 @@ other driver.
The deployment with [distributed
provisioning](#distributed-provisioning) is almost the same as above,
with some minor change:
- Use `--capacity-ownerref-level=0` and the `POD_NAMESPACE/POD_NAME`
variables to make the pod that contains the external-provisioner
- Use `--capacity-ownerref-level=1` and the `NAMESPACE/POD_NAME`
variables to make the DaemonSet that contains the external-provisioner
the owner of CSIStorageCapacity objects for the node.

Deployments of external-provisioner outside of the Kubernetes cluster
are also possible, albeit only without an owner for the objects.
`NAMESPACE` still needs to be set to some existing namespace also
in this case.

### CSI error and timeout handling
The external-provisioner invokes all gRPC calls to CSI driver with timeout provided by `--timeout` command line argument (15 seconds by default).

Expand Down
88 changes: 56 additions & 32 deletions cmd/csi-provisioner/csi-provisioner.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ import (
v1 "k8s.io/api/core/v1"
storagev1 "k8s.io/api/storage/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/labels"
"k8s.io/apimachinery/pkg/runtime/schema"
utilfeature "k8s.io/apiserver/pkg/util/feature"
"k8s.io/client-go/informers"
Expand Down Expand Up @@ -92,7 +93,7 @@ var (
enableCapacity = flag.Bool("enable-capacity", false, "This enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call.")
capacityImmediateBinding = flag.Bool("capacity-for-immediate-binding", false, "Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging.")
capacityPollInterval = flag.Duration("capacity-poll-interval", time.Minute, "How long the external-provisioner waits before checking for storage capacity changes.")
capacityOwnerrefLevel = flag.Int("capacity-ownerref-level", 1, "The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc.")
capacityOwnerrefLevel = flag.Int("capacity-ownerref-level", 1, "The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: -1 for no owner, 0 for the pod itself, 1 for a StatefulSet or DaemonSet, 2 for a Deployment, etc.")

enableNodeDeployment = flag.Bool("node-deployment", false, "Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes.")
nodeDeploymentImmediateBinding = flag.Bool("node-deployment-immediate-binding", true, "Determines whether immediate binding is supported when deployed on each node.")
Expand Down Expand Up @@ -393,39 +394,30 @@ func main() {
nodeDeployment,
)

provisionController = controller.NewProvisionController(
clientset,
provisionerName,
csiProvisioner,
serverVersion.GitVersion,
provisionerOptions...,
)

csiClaimController := ctrl.NewCloningProtectionController(
clientset,
claimLister,
claimInformer,
claimQueue,
controllerCapabilities,
)

var capacityController *capacity.Controller
if *enableCapacity {
podName := os.Getenv("POD_NAME")
namespace := os.Getenv("POD_NAMESPACE")
if podName == "" || namespace == "" {
klog.Fatalf("need POD_NAMESPACE/POD_NAME env variables, have only POD_NAMESPACE=%q and POD_NAME=%q", namespace, podName)
namespace := os.Getenv("NAMESPACE")
if namespace == "" {
klog.Fatal("need NAMESPACE env variable for CSIStorageCapacity objects")
}
controller, err := owner.Lookup(config, namespace, podName,
schema.GroupVersionKind{
Group: "",
Version: "v1",
Kind: "Pod",
}, *capacityOwnerrefLevel)
if err != nil {
klog.Fatalf("look up owner(s) of pod: %v", err)
var controller *metav1.OwnerReference
if *capacityOwnerrefLevel >= 0 {
podName := os.Getenv("POD_NAME")
if podName == "" {
klog.Fatal("need POD_NAME env variable to determine CSIStorageCapacity owner")
}
var err error
controller, err = owner.Lookup(config, namespace, podName,
schema.GroupVersionKind{
Group: "",
Version: "v1",
Kind: "Pod",
}, *capacityOwnerrefLevel)
if err != nil {
klog.Fatalf("look up owner(s) of pod: %v", err)
}
klog.Infof("using %s/%s %s as owner of CSIStorageCapacity objects", controller.APIVersion, controller.Kind, controller.Name)
}
klog.Infof("using %s/%s %s as owner of CSIStorageCapacity objects", controller.APIVersion, controller.Kind, controller.Name)

var topologyInformer topology.Informer
if nodeDeployment == nil {
Expand All @@ -447,11 +439,23 @@ func main() {
topologyInformer = topology.NewFixedNodeTopology(&segment)
}

managedByID := "external-provisioner"
if *enableNodeDeployment {
managedByID += "-" + node
}

// We only need objects from our own namespace. The normal factory would give
// us an informer for the entire cluster.
// us an informer for the entire cluster. We can further restrict the
// watch to just those objects with the right labels.
factoryForNamespace = informers.NewSharedInformerFactoryWithOptions(clientset,
ctrl.ResyncPeriodOfCsiNodeInformer,
informers.WithNamespace(namespace),
informers.WithTweakListOptions(func(lo *metav1.ListOptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how easy it is to detect if the informer doesn't fetch objects we expect, and whether we need to unit test this.

If informer logic is wrong, two things could happen:

  • Missing items that are supposed to be included
    • May try to create an existing CSIStorageCapacity object. In the logic today this is fine, because nothing is done after object creation (besides logging / error reporting), and the creation error would give us a hint.
  • Includes more things than it should
    • No failures, just more network traffic.

The latter case is hard to detect. So maybe a unit test is worthwhile (if it's not too much effort)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually since this is just an options configuration, the logic is simple enough, so it's probably low risk. Totally up to you if you want to unit test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not test this. I suspect that it wouldn't be easy to test comprehensively and in the end is more about correct implementation of the informer and apiserver filtering than it is about correct configuration.

lo.LabelSelector = labels.Set{
capacity.DriverNameLabel: provisionerName,
capacity.ManagedByLabel: managedByID,
}.AsSelector().String()
}),
)

capacityController = capacity.NewCentralCapacityController(
Expand All @@ -460,7 +464,8 @@ func main() {
clientset,
// Metrics for the queue is available in the default registry.
workqueue.NewNamedRateLimitingQueue(rateLimiter, "csistoragecapacity"),
*controller,
controller,
managedByID,
namespace,
topologyInformer,
factory.Storage().V1().StorageClasses(),
Expand All @@ -469,8 +474,27 @@ func main() {
*capacityImmediateBinding,
)
legacyregistry.CustomMustRegister(capacityController)

// Wrap Provision and Delete to detect when it is time to refresh capacity.
csiProvisioner = capacity.NewProvisionWrapper(csiProvisioner, capacityController)
}

provisionController = controller.NewProvisionController(
clientset,
provisionerName,
csiProvisioner,
serverVersion.GitVersion,
provisionerOptions...,
)

csiClaimController := ctrl.NewCloningProtectionController(
clientset,
claimLister,
claimInformer,
claimQueue,
controllerCapabilities,
)

// Start HTTP server, regardless whether we are the leader or not.
if addr != "" {
// To collect metrics data from the metric handler itself, we
Expand Down
4 changes: 2 additions & 2 deletions deploy/kubernetes/storage-capacity.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,12 @@ spec:
args:
- "--csi-address=$(ADDRESS)"
- "--leader-election"
- "--enable-capacity=central"
- "--enable-capacity"
- "--capacity-ownerref-level=2"
env:
- name: ADDRESS
value: /var/lib/csi/sockets/pluginproxy/mock.socket
- name: POD_NAMESPACE
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
Expand Down
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ require (
github.com/prometheus/common v0.19.0 // indirect
github.com/prometheus/procfs v0.6.0 // indirect
github.com/spf13/pflag v1.0.5
github.com/stretchr/testify v1.7.0
golang.org/x/crypto v0.0.0-20210317152858-513c2a44f670 // indirect
golang.org/x/net v0.0.0-20210316092652-d523dce5a7f4 // indirect
golang.org/x/oauth2 v0.0.0-20210313182246-cd4f82c27b84 // indirect
Expand Down
3 changes: 2 additions & 1 deletion go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -685,8 +685,9 @@ github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXf
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4=
github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA=
github.com/stretchr/testify v1.6.1 h1:hDPOHmpOpP40lSULcqw7IrRb/u7w6RpDC9399XyoNd0=
github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.7.0 h1:nwc3DEeHmmLAfoZucVR881uASk0Mfjw8xYJ99tb5CcY=
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/subosito/gotenv v1.2.0/go.mod h1:N0PQaV/YGNqwC0u51sEeR/aUtSLEXKX9iv69rRypqCw=
github.com/syndtr/gocapability v0.0.0-20180916011248-d98352740cb2/go.mod h1:hkRG7XYTFWNJGYcbNJQlaLq0fg1yr4J4t/NcTQtrfww=
github.com/thecodeteam/goscaleio v0.1.0/go.mod h1:68sdkZAsK8bvEwBlbQnlLS+xU+hvLYM/iQ8KXej1AwM=
Expand Down
Loading