-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When mount dies, it is not remounted #164
Comments
pasted from Steven French: SMB3 has very cool features there, and many of them have been implemented in cifs.ko for a very long time. Some specific features beyond support for SMB3 ‘persistsent handles’:
An easy way to think about this is that if the network connection goes down – the Linux SMB3 client reopens the files, reacquires byte range locks, and since the Azure server supports persistent handles, there are more guarantees about reconnect surviving races with other clients. |
Unfortunately I don't control the server and I don't know what happened. I am running a 5.6.19-Kernel though, which ought to be new enough for these features. The bug still happened, that for three straight days, no pods could be created on affected nodes. Umounting all cifs-mounts by hand allowed the pods to run again. |
cc @smfrench |
It happened again. I get the following log lines (a lot of them):
I am pretty sure that it is induced by an unreliable server. But the problem is that csi-driver-smb does not recover. This is on |
In the past I have only seen that for the case where the userid or password
are misconfigured (password changed on server e.g.).
…On Mon, Nov 30, 2020 at 7:12 AM Tobias Florek ***@***.***> wrote:
It happened again. I get the following log lines (a lot of them):
Status code returned 0xc000006d STATUS_LOGON_FAILURE
CIFS VFS: \\XXX Send error in SessSetup = -13
I am pretty sure that it is induced by an unreliable server. But the
problem is that csi-driver-smb does not recover. This is on
5.6.19-300.fc32.x86_64
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADSTN5VBDNH64PAWBPMQPD3SSOK3JANCNFSM4T7US4DA>
.
--
Thanks,
Steve
|
Well, it mounted without any problems after a manual umount. |
The command New-SmbGlobalMapping -RemotePath must have in the "-RequirePrivacy $true" otherwise the SMB channel will be reset after 15min and you'll loose access. New-SmbGlobalMapping -RemotePath '\\FQDN\share\Directory' -Credential $credential -LocalPath G: -RequirePrivacy $true -ErrorAction Stop |
seems related to this issue: https://github.com/MicrosoftDocs/Virtualization-Documentation-Private/issues/1300 and moby/moby#37863, I will fix it in csi-proxy first, thanks! @marciogmorales |
BTW, original issue is on linux node, this is another issue on Windows, that's two different issues. |
worked out a PR to fix in k/k first: kubernetes/kubernetes#99550 |
Yes, I am having these issues (regularly!) on a linux node. |
about the |
would be fixed by this PR: kubernetes/kubernetes#101305 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@andyzhangx should this be reopened for follow up on linux nodes? |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is still an issue. @andyzhangx |
Master to main cleanup
I am also seeing the
The biggest problem here is that this failure mode is completely silent; the PV/PVCS/drivers all report healthy, and the pod only crashes if it isn't robust enough to catch an fs issue and tries to read/write the mount. The only fix seems to be to delete the PV / PVC, then delete the pod, wait for the pv to close, then recreate everything which is really awful. Is there a way to force the csi driver to recreate everything? Alternatively, a workaround might be to deploy a sidecar to the smb-node drivers and either force remount the cifs shares or at the very least change the health status to unhealthy in order to help detect this problem. @andyzhangx can you please reopen this? |
could you share what's the linux kernel version and k8s version when you hit |
Ubuntu 18.04.4 LTS The network storage that we are using supports smb version <= 2.0 |
4.15 kernel is more than four and a half years old, are you able to upgrade to 5.x kernel? The CSI driver relies on smb kernel driver to do the reconnect. |
@andyzhangx unfortunately no, we have 20+ nodes running Ubuntu 18.04, and migrating to a different distro or kernel version is not currently feasible. Automatically reconnecting is not the biggest issue in my eyes, it's the fact that the failure is completely silent. Is there any process or section of the driver that periodically checks the mounts that could be improved? I'd be interested in opening a PR, but not entirely sure where to start. Documenting the absolute minimum kernel version would be good idea here, but it's still kind of lame there isn't a way to make a request to the CSI driver to force remount the volumes. I'll try to find the minimum kernel version tomorrow unless you know it off the top of your head. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
This is still a problem, and is a massive pain anytime the csi drivers are redeployed or upgraded. We now have 20 nodes that can't be upgraded, and we currently have no alternative solutions. |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Any update on this? I am encountering the same error on EKS. If I reboot the windows EC2 instance that serves the shares the pods are still healthy but they cannot access the files in the target mounted share. |
What happened:
At one point in time the mount died (most likely due to an unrelated server issue). Every pod using the smb-pv could not be started with an error like the following.
Umounting any CIFS-Mount allowed new pods to be deployed.
What you expected to happen:
Detecting that the mount died and remounting without having to restart pods.
How to reproduce it:
I did not try to reproduce yet.
Anything else we need to know?:
Environment:
mcr.microsoft.com/k8s/csi/smb-csi:v0.4.0
kubectl version
):uname -a
):5.6.19-300.fc32.x86_64
The text was updated successfully, but these errors were encountered: