Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return success in CreateVolume when disk is READY #527

Merged
merged 1 commit into from
Jun 18, 2020

Conversation

saikat-royc
Copy link
Member

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
In PD CSI Driver CreateVolume call, wait for disk to reach a READY status, before returning success to caller.
Which issue(s) this PR fixes:

Fixes #526

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

In GCE PersistentDisk CSI Driver CreateVolume call, wait for disk to reach a READY status, before returning success to caller.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @saikat-royc. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested review from jingxu97 and msau42 June 12, 2020 00:40
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 12, 2020
@msau42
Copy link
Contributor

msau42 commented Jun 12, 2020

/ok-to-test
will let @mattcary review first :-)

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2020
Copy link
Contributor

@mattcary mattcary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks!

@k8s-ci-robot
Copy link
Contributor

@mattcary: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Thanks!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@saikat-royc saikat-royc force-pushed the issue-526 branch 2 times, most recently from 224ea5e to c5960a5 Compare June 12, 2020 23:12
@saikat-royc
Copy link
Member Author

UTs and sanity test needs to be looked into, before final review

@@ -187,6 +220,15 @@ func (gceCS *GCEControllerServer) CreateVolume(ctx context.Context, req *csi.Cre
default:
return nil, status.Error(codes.InvalidArgument, fmt.Sprintf("CreateVolume replication type '%s' is not supported", params.ReplicationType))
}

ready, err := isDiskReady(disk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we potentially extend waitForOp instead, so that we don't immediately fail? I imagine this delay will happen frequently anytime someone restores from a snapshot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into it. I was following the snapshot model of ready check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah the snapshot api specifically says that it should return once it's cut and before it's ready, especially if it would take a signficant amount of time to get ready. But the createvolume api doesn't have a requirement like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into the code and test logs, insert call returns an insert op, that the driver polls to reach completion. Here is the timeout behavior:
The timeout for the insert operation polling is 5 min (polled every 3 secs).
However the context is provided a deadline of 10 secs by the caller (csi-provisioner side car in this case). Which means the poll operation itself gets cancelled at the end of 10 secs.

The current implementation and placement of isDiskReady(in CreateVolume call) looks to me the right place, as its a common place for both regional/zonal disks.

Possible changes in external provisioner to reduce number of errors:
We can look at changing the timeout to a larger value (in the order of minutes say 1-2 mins). An additional optimization can be done to use a large timeout only for the scenario where the a volume datasource is provided for provisioning (https://github.com/saikat-royc/external-provisioner/blob/master/pkg/controller/controller.go#L459).
To keep the external provisioner code generic, a new flag (say provisionFromDatasourceOperationTimeout = 10 secs default) can be exposed (https://github.com/saikat-royc/external-provisioner/blob/master/cmd/csi-provisioner/csi-provisioner.go#L65) and set in the PD CSI driver pod spec (https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/deploy/kubernetes/base/controller/controller.yaml#L34)

Let me know your thoughts? @msau42 @mattcary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the same timeout value for all operations that the provisioner calls. 10 seconds is probably too short.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Increasing the timeout is straightforward.

Can we get stats from GKE clusters to see how long attach time typically is? (not for snapshots, just to get a sense of how often these errors are happening)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saikat explained to me that the scenario we're seeing right now is that the waitforop times out after 10 seconds and then we run into the "check if disk already exists" logic that returns immediately. And if we increase the timeout then the waitForOp here should not return until the disk is ready. So given that, I think increasing the timeout is sufficient and we don't need to add an additional wait for ready (returning error is fine).

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2020
@saikat-royc saikat-royc force-pushed the issue-526 branch 2 times, most recently from 0ce2aac to ca085e4 Compare June 18, 2020 04:33
@msau42
Copy link
Contributor

msau42 commented Jun 18, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, saikat-royc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2020
@k8s-ci-robot k8s-ci-robot merged commit c16e7d1 into kubernetes-sigs:master Jun 18, 2020
@msau42 msau42 mentioned this pull request Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PD CSI Driver CreateVolume should wait for Disk to reach READY status before returning success
4 participants