Return success in CreateVolume when disk is READY #527

saikat-royc · 2020-06-12T00:40:48Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
In PD CSI Driver CreateVolume call, wait for disk to reach a READY status, before returning success to caller.
Which issue(s) this PR fixes:

Fixes #526

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

In GCE PersistentDisk CSI Driver CreateVolume call, wait for disk to reach a READY status, before returning success to caller.

k8s-ci-robot · 2020-06-12T00:40:56Z

Hi @saikat-royc. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msau42 · 2020-06-12T00:51:23Z

/ok-to-test
will let @mattcary review first :-)

mattcary

/lgtm

Thanks!

pkg/gce-cloud-provider/compute/cloud-disk.go

k8s-ci-robot · 2020-06-12T01:41:58Z

@mattcary: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Thanks!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

saikat-royc · 2020-06-12T23:18:13Z

UTs and sanity test needs to be looked into, before final review

msau42 · 2020-06-16T00:06:15Z

pkg/gce-pd-csi-driver/controller.go

@@ -187,6 +220,15 @@ func (gceCS *GCEControllerServer) CreateVolume(ctx context.Context, req *csi.Cre
 	default:
 		return nil, status.Error(codes.InvalidArgument, fmt.Sprintf("CreateVolume replication type '%s' is not supported", params.ReplicationType))
 	}
+
+	ready, err := isDiskReady(disk)


Can we potentially extend waitForOp instead, so that we don't immediately fail? I imagine this delay will happen frequently anytime someone restores from a snapshot.

Will look into it. I was following the snapshot model of ready check.

Ah the snapshot api specifically says that it should return once it's cut and before it's ready, especially if it would take a signficant amount of time to get ready. But the createvolume api doesn't have a requirement like that.

Looking into the code and test logs, insert call returns an insert op, that the driver polls to reach completion. Here is the timeout behavior:
The timeout for the insert operation polling is 5 min (polled every 3 secs).
However the context is provided a deadline of 10 secs by the caller (csi-provisioner side car in this case). Which means the poll operation itself gets cancelled at the end of 10 secs.

The current implementation and placement of isDiskReady(in CreateVolume call) looks to me the right place, as its a common place for both regional/zonal disks.

Possible changes in external provisioner to reduce number of errors:
We can look at changing the timeout to a larger value (in the order of minutes say 1-2 mins). An additional optimization can be done to use a large timeout only for the scenario where the a volume datasource is provided for provisioning (https://github.com/saikat-royc/external-provisioner/blob/master/pkg/controller/controller.go#L459).
To keep the external provisioner code generic, a new flag (say provisionFromDatasourceOperationTimeout = 10 secs default) can be exposed (https://github.com/saikat-royc/external-provisioner/blob/master/cmd/csi-provisioner/csi-provisioner.go#L65) and set in the PD CSI driver pod spec (https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/deploy/kubernetes/base/controller/controller.yaml#L34)

Let me know your thoughts? @msau42 @mattcary

I think we can use the same timeout value for all operations that the provisioner calls. 10 seconds is probably too short.

+1 Increasing the timeout is straightforward.

Can we get stats from GKE clusters to see how long attach time typically is? (not for snapshots, just to get a sense of how often these errors are happening)

Saikat explained to me that the scenario we're seeing right now is that the waitforop times out after 10 seconds and then we run into the "check if disk already exists" logic that returns immediately. And if we increase the timeout then the waitForOp here should not return until the disk is ready. So given that, I think increasing the timeout is sufficient and we don't need to add an additional wait for ready (returning error is fine).

pkg/gce-pd-csi-driver/controller.go

pkg/gce-pd-csi-driver/controller_test.go

msau42 · 2020-06-18T21:13:07Z

/lgtm
/approve

k8s-ci-robot · 2020-06-18T21:13:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, saikat-royc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from jingxu97 and msau42 June 12, 2020 00:40

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 12, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2020

mattcary reviewed Jun 12, 2020

View reviewed changes

pkg/gce-cloud-provider/compute/cloud-disk.go Outdated Show resolved Hide resolved

saikat-royc force-pushed the issue-526 branch 2 times, most recently from 224ea5e to c5960a5 Compare June 12, 2020 23:12

saikat-royc force-pushed the issue-526 branch from c5960a5 to 0bf6df4 Compare June 15, 2020 18:44

msau42 reviewed Jun 16, 2020

View reviewed changes

msau42 reviewed Jun 17, 2020

View reviewed changes

pkg/gce-pd-csi-driver/controller.go Show resolved Hide resolved

saikat-royc mentioned this pull request Jun 17, 2020

Volume restored from snapshot is not ready for use after provisioning #482

Closed

saikat-royc force-pushed the issue-526 branch from 0bf6df4 to 8ce32f8 Compare June 17, 2020 23:51

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2020

saikat-royc force-pushed the issue-526 branch 2 times, most recently from 0ce2aac to ca085e4 Compare June 18, 2020 04:33

msau42 reviewed Jun 18, 2020

View reviewed changes

pkg/gce-pd-csi-driver/controller_test.go Outdated Show resolved Hide resolved

Return success in CreateVolume when disk is READY

6962e03

saikat-royc force-pushed the issue-526 branch from ca085e4 to 6962e03 Compare June 18, 2020 20:26

k8s-ci-robot assigned msau42 Jun 18, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2020

k8s-ci-robot merged commit c16e7d1 into kubernetes-sigs:master Jun 18, 2020

msau42 mentioned this pull request Aug 5, 2020

Tests for idempotency #568

Open

arianitu mentioned this pull request Jan 14, 2021

Behaviour of controller.CreateVolume when a Snapshot is Not Ready? #694

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return success in CreateVolume when disk is READY #527

Return success in CreateVolume when disk is READY #527

saikat-royc commented Jun 12, 2020

k8s-ci-robot commented Jun 12, 2020

msau42 commented Jun 12, 2020

mattcary left a comment

k8s-ci-robot commented Jun 12, 2020

saikat-royc commented Jun 12, 2020

msau42 Jun 16, 2020

saikat-royc Jun 16, 2020

msau42 Jun 16, 2020

saikat-royc Jun 16, 2020

msau42 Jun 17, 2020

mattcary Jun 17, 2020

msau42 Jun 17, 2020

msau42 commented Jun 18, 2020

k8s-ci-robot commented Jun 18, 2020

Return success in CreateVolume when disk is READY #527

Return success in CreateVolume when disk is READY #527

Conversation

saikat-royc commented Jun 12, 2020

k8s-ci-robot commented Jun 12, 2020

msau42 commented Jun 12, 2020

mattcary left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 12, 2020

saikat-royc commented Jun 12, 2020

msau42 Jun 16, 2020

Choose a reason for hiding this comment

saikat-royc Jun 16, 2020

Choose a reason for hiding this comment

msau42 Jun 16, 2020

Choose a reason for hiding this comment

saikat-royc Jun 16, 2020

Choose a reason for hiding this comment

msau42 Jun 17, 2020

Choose a reason for hiding this comment

mattcary Jun 17, 2020

Choose a reason for hiding this comment

msau42 Jun 17, 2020

Choose a reason for hiding this comment

msau42 commented Jun 18, 2020

k8s-ci-robot commented Jun 18, 2020