Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeout on short CSI calls during attacher initialization #561

Conversation

Fricounet
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

The same context.WithTimeout was used in different places when calling the CSI drivers in the main function. With recent changes, the code execution started taking longer and the context would hit its timeout on the GetPluginCapabilities driver call. For instance with when running with the azuredisk-csi-driver (I've seen this on the aws-ebs and the gcp-pd drivers too):

❯ k logs -c csi-attacher csi-azuredisk-controller-69fdb787c6-k59fw
I0529 12:26:30.367859       1 main.go:109] "Version" version="v4.6.0-0-gc58f4277"
I0529 12:26:30.369489       1 connection.go:234] "Connecting" address="unix:///csi/csi.sock"
I0529 12:26:31.370120       1 common.go:145] "Probing CSI driver for readiness"
I0529 12:26:31.370148       1 connection.go:264] "GRPC call" method="/csi.v1.Identity/Probe" request="{}"
I0529 12:26:31.371908       1 connection.go:270] "GRPC response" response="{\"ready\":{\"value\":true}}" err=null
I0529 12:26:31.371934       1 connection.go:264] "GRPC call" method="/csi.v1.Identity/GetPluginInfo" request="{}"
I0529 12:26:31.372446       1 connection.go:270] "GRPC response" response="{\"name\":\"disk.csi.azure.com\",\"vendor_version\":\"v1.30.1\"}" err=null
I0529 12:26:31.372466       1 main.go:169] "CSI driver name" driver="disk.csi.azure.com"
I0529 12:26:31.373062       1 connection.go:234] "Connecting" address="unix:///csi/csi.sock"
I0529 12:26:32.374364       1 common.go:145] "Probing CSI driver for readiness"
I0529 12:26:32.374396       1 connection.go:264] "GRPC call" method="/csi.v1.Identity/Probe" request="{}"
I0529 12:26:32.374869       1 connection.go:270] "GRPC response" response="{\"ready\":{\"value\":true}}" err=null
I0529 12:26:32.374921       1 connection.go:264] "GRPC call" driver="disk.csi.azure.com" method="/csi.v1.Identity/GetPluginCapabilities" request="{}"
I0529 12:26:32.375015       1 main.go:196] "ServeMux listening" driver="disk.csi.azure.com" address=":22012" metricsPath="/metrics"
I0529 12:26:32.375048       1 connection.go:270] "GRPC response" driver="disk.csi.azure.com" response="{}" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0529 12:26:32.375069       1 main.go:207] "Failed to check if the CSI Driver supports the CONTROLLER_SERVICE" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" driver="disk.csi.azure.com"

This change uses a separate context.WithTimeout for each short CSI call using the default csiTimeout defined in main.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixed an issue where the attacher would see its context timeout on startup when calling the CSI driver.

The same context.WithTimeout was used in different places when calling the CSI drivers. With recent changes, the code execution started taking longer and the context would hit its timeout on the 2nd driver call. This change use a separate context.WithTimeout for each CSI call
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 29, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @Fricounet!

It looks like this is your first PR to kubernetes-csi/external-attacher 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-csi/external-attacher has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 29, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Fricounet. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 29, 2024
@Fricounet Fricounet changed the title [main] Fix timeout on short CSI calls Fix timeout on short CSI calls during attacher initialization May 29, 2024
@huww98
Copy link

huww98 commented May 29, 2024

connection.go:234] "Connecting" address="unix:///csi/csi.sock" this seems taking exactly 1 second from your logs. Do you know why? It looks like the CSI response is very fast, and 1 second timeout should be enough for both RPC calls.

@xing-yang
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 29, 2024
@Fricounet
Copy link
Contributor Author

@huww98 good point. I had not paid attention that both CSI connection calls were taking 1s each. But I think this might just be the backoff for a failed call to csi-lib-utils Connect kicking in. So its possibly just the csi-driver that failed to respond on the first call and the next call succeeded after the backoff. I will investigate a bit more but I still feel that each csi call that are supposed to be 'short' should use their own timeouts instead of sharing the same one. Especially when there are calls in the middle that don't have those timeouts (like the one to connect to the driver).

@Fricounet
Copy link
Contributor Author

I've figured where this delay comes from and It is not an issue with the connection to the driver but with a change to the Probe function that is run just after. In the latest version of csi-lib-utils, it was changed to use a Ticker instead of the sleep it was using before between each probe. As a result, the first probe is only launched after 1s instead of immediately resulting in the delay we see

@Fricounet
Copy link
Contributor Author

I've submitted a PR in kubernetes-csi/csi-lib-utils#175 to restore the previous behavior of the Probe. But this PR stays relevant IMHO because the timeout should be used for short csi calls and having it shared for 2 calls separated by potentially long running calls (Connection and Probe for instance) feels wrong.

@jsafrane
Copy link
Contributor

/lgtm
/approve

You're right both about the ProbeForever being slower and that the timeout context should start just before corresponding call.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Fricounet, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2024
@k8s-ci-robot k8s-ci-robot merged commit 84b7c0c into kubernetes-csi:master May 30, 2024
6 checks passed
@jsafrane
Copy link
Contributor

BTW, thanks a ton for testing the latest sidecars and reporting and even fixing the issue!

@Fricounet
Copy link
Contributor Author

And thanks for all our work on CSI and storage!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants