-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return success in CreateVolume when disk is READY #527
Conversation
Hi @saikat-royc. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks!
@mattcary: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
224ea5e
to
c5960a5
Compare
UTs and sanity test needs to be looked into, before final review |
@@ -187,6 +220,15 @@ func (gceCS *GCEControllerServer) CreateVolume(ctx context.Context, req *csi.Cre | |||
default: | |||
return nil, status.Error(codes.InvalidArgument, fmt.Sprintf("CreateVolume replication type '%s' is not supported", params.ReplicationType)) | |||
} | |||
|
|||
ready, err := isDiskReady(disk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we potentially extend waitForOp
instead, so that we don't immediately fail? I imagine this delay will happen frequently anytime someone restores from a snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into it. I was following the snapshot model of ready check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah the snapshot api specifically says that it should return once it's cut and before it's ready, especially if it would take a signficant amount of time to get ready. But the createvolume api doesn't have a requirement like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into the code and test logs, insert call returns an insert op, that the driver polls to reach completion. Here is the timeout behavior:
The timeout for the insert operation polling is 5 min (polled every 3 secs).
However the context is provided a deadline of 10 secs by the caller (csi-provisioner side car in this case). Which means the poll operation itself gets cancelled at the end of 10 secs.
The current implementation and placement of isDiskReady(in CreateVolume call) looks to me the right place, as its a common place for both regional/zonal disks.
Possible changes in external provisioner to reduce number of errors:
We can look at changing the timeout to a larger value (in the order of minutes say 1-2 mins). An additional optimization can be done to use a large timeout only for the scenario where the a volume datasource is provided for provisioning (https://github.com/saikat-royc/external-provisioner/blob/master/pkg/controller/controller.go#L459).
To keep the external provisioner code generic, a new flag (say provisionFromDatasourceOperationTimeout = 10 secs default) can be exposed (https://github.com/saikat-royc/external-provisioner/blob/master/cmd/csi-provisioner/csi-provisioner.go#L65) and set in the PD CSI driver pod spec (https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/deploy/kubernetes/base/controller/controller.yaml#L34)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use the same timeout value for all operations that the provisioner calls. 10 seconds is probably too short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Increasing the timeout is straightforward.
Can we get stats from GKE clusters to see how long attach time typically is? (not for snapshots, just to get a sense of how often these errors are happening)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saikat explained to me that the scenario we're seeing right now is that the waitforop times out after 10 seconds and then we run into the "check if disk already exists" logic that returns immediately. And if we increase the timeout then the waitForOp here should not return until the disk is ready. So given that, I think increasing the timeout is sufficient and we don't need to add an additional wait for ready (returning error is fine).
0ce2aac
to
ca085e4
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, saikat-royc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
What this PR does / why we need it:
In PD CSI Driver CreateVolume call, wait for disk to reach a READY status, before returning success to caller.
Which issue(s) this PR fixes:
Fixes #526
Special notes for your reviewer:
Does this PR introduce a user-facing change?: