When mount dies, it is not remounted #164

ibotty · 2020-11-23T15:36:50Z

What happened:

At one point in time the mount died (most likely due to an unrelated server issue). Every pod using the smb-pv could not be started with an error like the following.

MountVolume.MountDevice fail
ed for volume "pvc-cc41658a-8d11-49e1-8536-d2f73cfe829a" : stat /var/lib/kubelet/plugins/kubernetes.io/cs
i/pv/pvc-cc41658a-8d11-49e1-8536-d2f73cfe829a/globalmount: host is down

Umounting any CIFS-Mount allowed new pods to be deployed.

What you expected to happen:
Detecting that the mount died and remounting without having to restart pods.

How to reproduce it:
I did not try to reproduce yet.

Anything else we need to know?:

Environment:

CSI Driver version: mcr.microsoft.com/k8s/csi/smb-csi:v0.4.0
Kubernetes version (use kubectl version):

Server Version: 4.5.0-0.okd-2020-10-15-235428
Kubernetes Version: v1.18.3

OS (e.g. from /etc/os-release):

NAME=Fedora
VERSION="32.20200629.3.0 (CoreOS)"

Kernel (e.g. uname -a): 5.6.19-300.fc32.x86_64

The text was updated successfully, but these errors were encountered:

andyzhangx · 2020-11-24T09:45:03Z

pasted from Steven French:
Remount is going to be possible with changes Ronnie at Redhat is working on (for the new mount API support for cifs), but remount should not be needed in the case where a server goes down.

SMB3 has very cool features there, and many of them have been implemented in cifs.ko for a very long time. Some specific features beyond support for SMB3 ‘persistsent handles’:

In 5.0 kernel reconnect support for cases where server IP address changed was added. 5.0 also added some important reconnect bug fixes relating to crediting
4.20 kernel adding dynamic tracing for various events relating to reconnects and why they were triggered
SMB3 has a feature called ‘persistent handles’ that allows reconnecting state (locks etc.) more safely during reconnect, a feature was added in the 5.1 kernel to allow the persistent handle timeout to be configurable (new mount parm “handletimeout=”)

An easy way to think about this is that if the network connection goes down – the Linux SMB3 client reopens the files, reacquires byte range locks, and since the Azure server supports persistent handles, there are more guarantees about reconnect surviving races with other clients.

ibotty · 2020-11-24T13:51:50Z

Unfortunately I don't control the server and I don't know what happened.

I am running a 5.6.19-Kernel though, which ought to be new enough for these features. The bug still happened, that for three straight days, no pods could be created on affected nodes. Umounting all cifs-mounts by hand allowed the pods to run again.

andyzhangx · 2020-11-24T14:29:37Z

cc @smfrench

ibotty · 2020-11-30T13:12:05Z

It happened again. I get the following log lines (a lot of them):

Status code returned 0xc000006d STATUS_LOGON_FAILURE
CIFS VFS: \\XXX Send error in SessSetup = -13

I am pretty sure that it is induced by an unreliable server. But the problem is that csi-driver-smb does not recover. This is on 5.6.19-300.fc32.x86_64

smfrench · 2020-11-30T15:51:48Z

In the past I have only seen that for the case where the userid or password are misconfigured (password changed on server e.g.).

…

On Mon, Nov 30, 2020 at 7:12 AM Tobias Florek ***@***.***> wrote: It happened again. I get the following log lines (a lot of them): Status code returned 0xc000006d STATUS_LOGON_FAILURE CIFS VFS: \\XXX Send error in SessSetup = -13 I am pretty sure that it is induced by an unreliable server. But the problem is that csi-driver-smb does not recover. This is on 5.6.19-300.fc32.x86_64 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSTN5VBDNH64PAWBPMQPD3SSOK3JANCNFSM4T7US4DA> .

-- Thanks, Steve

ibotty · 2020-11-30T16:05:48Z

Well, it mounted without any problems after a manual umount.

marciogmorales · 2021-02-26T22:01:36Z

The command New-SmbGlobalMapping -RemotePath must have in the "-RequirePrivacy $true" otherwise the SMB channel will be reset after 15min and you'll loose access.

New-SmbGlobalMapping -RemotePath '\\FQDN\share\Directory' -Credential $credential -LocalPath G: -RequirePrivacy $true -ErrorAction Stop

andyzhangx · 2021-02-28T13:12:41Z

seems related to this issue: https://github.com/MicrosoftDocs/Virtualization-Documentation-Private/issues/1300 and moby/moby#37863, I will fix it in csi-proxy first, thanks! @marciogmorales

andyzhangx · 2021-02-28T13:13:39Z

BTW, original issue is on linux node, this is another issue on Windows, that's two different issues.

andyzhangx · 2021-02-28T13:24:33Z

worked out a PR to fix in k/k first: kubernetes/kubernetes#99550

ibotty · 2021-02-28T14:27:44Z

Yes, I am having these issues (regularly!) on a linux node.

andyzhangx · 2021-04-20T12:07:55Z

about the Host is down issue which could lead to pod in Terminating status forever, there is already a PR to address this issue: kubernetes/utils#203 (comment)

andyzhangx · 2021-04-21T03:52:07Z

would be fixed by this PR: kubernetes/kubernetes#101305

fejta-bot · 2021-07-20T04:32:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

k8s-triage-robot · 2021-08-19T04:58:09Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-09-18T05:33:59Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-09-18T05:34:02Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

faandg · 2022-01-25T13:39:47Z

@andyzhangx should this be reopened for follow up on linux nodes?

k8s-triage-robot · 2022-02-24T14:25:04Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-02-24T14:25:06Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

NotANormalNerd · 2022-03-20T13:03:24Z

This is still an issue. @andyzhangx

Master to main cleanup

MiddleMan5 · 2022-07-20T19:47:42Z

I am also seeing the Host is down issue whenever one of the following occurs:

Upgrade csi-driver-smb deployment or restart of csi-smb-node pod
Network connection between NAS and cluster is interrupted temporarily

The biggest problem here is that this failure mode is completely silent; the PV/PVCS/drivers all report healthy, and the pod only crashes if it isn't robust enough to catch an fs issue and tries to read/write the mount.

The only fix seems to be to delete the PV / PVC, then delete the pod, wait for the pv to close, then recreate everything which is really awful. Is there a way to force the csi driver to recreate everything?

Alternatively, a workaround might be to deploy a sidecar to the smb-node drivers and either force remount the cifs shares or at the very least change the health status to unhealthy in order to help detect this problem.

@andyzhangx can you please reopen this?

andyzhangx · 2022-07-21T00:03:49Z

could you share what's the linux kernel version and k8s version when you hit host is down? There is autoreconnect in smb kernel driver

MiddleMan5 · 2022-07-21T16:32:24Z

Ubuntu 18.04.4 LTS
Linux version 4.15.0-189-generic
Kubernetes 1.23.0

The network storage that we are using supports smb version <= 2.0

andyzhangx · 2022-07-22T02:09:08Z

4.15 kernel is more than four and a half years old, are you able to upgrade to 5.x kernel? The CSI driver relies on smb kernel driver to do the reconnect.

MiddleMan5 · 2022-07-22T02:23:13Z

@andyzhangx unfortunately no, we have 20+ nodes running Ubuntu 18.04, and migrating to a different distro or kernel version is not currently feasible.

Automatically reconnecting is not the biggest issue in my eyes, it's the fact that the failure is completely silent.

Is there any process or section of the driver that periodically checks the mounts that could be improved? I'd be interested in opening a PR, but not entirely sure where to start.

Documenting the absolute minimum kernel version would be good idea here, but it's still kind of lame there isn't a way to make a request to the CSI driver to force remount the volumes.

I'll try to find the minimum kernel version tomorrow unless you know it off the top of your head.

k8s-triage-robot · 2022-10-20T02:44:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

MiddleMan5 · 2022-10-20T02:48:42Z

This is still a problem, and is a massive pain anytime the csi drivers are redeployed or upgraded.

We now have 20 nodes that can't be upgraded, and we currently have no alternative solutions.

MiddleMan5 · 2022-10-20T02:49:20Z

/remove-lifecycle stale

k8s-triage-robot · 2023-01-18T03:41:09Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-02-17T04:35:47Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-03-19T04:58:50Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-03-19T04:58:52Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

davgia · 2024-12-10T11:48:49Z

Any update on this? I am encountering the same error on EKS. If I reboot the windows EC2 instance that serves the shares the pods are still healthy but they cannot access the files in the target mounted share.

andyzhangx added the kind/support Categorizes issue or PR as a support question. label Nov 24, 2020

drigz mentioned this issue Feb 3, 2021

SubPath unmount fails with "directory not empty" after smb-server Service IP changes #222

Closed

andyzhangx mentioned this issue May 19, 2021

fix: set "host is down" as corrupted mount #268

Merged

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 19, 2021

k8s-ci-robot closed this as completed Sep 18, 2021

andyzhangx reopened this Jan 25, 2022

alexyao2015 mentioned this issue Feb 10, 2022

Stale NFS file handle #419

Closed

k8s-ci-robot closed this as completed Feb 24, 2022

andyzhangx pushed a commit to andyzhangx/csi-driver-smb that referenced this issue May 1, 2022

Merge pull request kubernetes-csi#164 from anubha-v-ardhan/patch-1

c0a4fb1

Master to main cleanup

andyzhangx reopened this Jul 21, 2022

andyzhangx removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 21, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2022

andyzhangx mentioned this issue Dec 7, 2022

SMB Pod Doesn't Recover After SMB Share Failover #563

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 17, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2023

mloskot mentioned this issue May 24, 2024

[Feedback] lstat /var/lib/kubelet/pods/9c7edeed-db1a-47d8-bf56-2295553a7b79/volumes/kubernetes.io~csi/pvc-d6e6360a-f58c-4079-bd7a-afc7ed5a657f/mount: host is down Azure/AKS#4310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When mount dies, it is not remounted #164

When mount dies, it is not remounted #164

ibotty commented Nov 23, 2020

andyzhangx commented Nov 24, 2020 •

edited

Loading

ibotty commented Nov 24, 2020

andyzhangx commented Nov 24, 2020

ibotty commented Nov 30, 2020 •

edited

Loading

smfrench commented Nov 30, 2020 via email

ibotty commented Nov 30, 2020

marciogmorales commented Feb 26, 2021

andyzhangx commented Feb 28, 2021 •

edited

Loading

andyzhangx commented Feb 28, 2021

andyzhangx commented Feb 28, 2021

ibotty commented Feb 28, 2021

andyzhangx commented Apr 20, 2021

andyzhangx commented Apr 21, 2021

fejta-bot commented Jul 20, 2021

k8s-triage-robot commented Aug 19, 2021

k8s-triage-robot commented Sep 18, 2021

k8s-ci-robot commented Sep 18, 2021

faandg commented Jan 25, 2022

k8s-triage-robot commented Feb 24, 2022

k8s-ci-robot commented Feb 24, 2022

NotANormalNerd commented Mar 20, 2022 •

edited

Loading

MiddleMan5 commented Jul 20, 2022 •

edited

Loading

andyzhangx commented Jul 21, 2022 •

edited

Loading

MiddleMan5 commented Jul 21, 2022 •

edited

Loading

andyzhangx commented Jul 22, 2022

MiddleMan5 commented Jul 22, 2022

k8s-triage-robot commented Oct 20, 2022

MiddleMan5 commented Oct 20, 2022

MiddleMan5 commented Oct 20, 2022

k8s-triage-robot commented Jan 18, 2023

k8s-triage-robot commented Feb 17, 2023

k8s-triage-robot commented Mar 19, 2023

k8s-ci-robot commented Mar 19, 2023

davgia commented Dec 10, 2024 •

edited

Loading

When mount dies, it is not remounted #164

When mount dies, it is not remounted #164

Comments

ibotty commented Nov 23, 2020

andyzhangx commented Nov 24, 2020 • edited Loading

ibotty commented Nov 24, 2020

andyzhangx commented Nov 24, 2020

ibotty commented Nov 30, 2020 • edited Loading

smfrench commented Nov 30, 2020 via email

ibotty commented Nov 30, 2020

marciogmorales commented Feb 26, 2021

andyzhangx commented Feb 28, 2021 • edited Loading

andyzhangx commented Feb 28, 2021

andyzhangx commented Feb 28, 2021

ibotty commented Feb 28, 2021

andyzhangx commented Apr 20, 2021

andyzhangx commented Apr 21, 2021

fejta-bot commented Jul 20, 2021

k8s-triage-robot commented Aug 19, 2021

k8s-triage-robot commented Sep 18, 2021

k8s-ci-robot commented Sep 18, 2021

faandg commented Jan 25, 2022

k8s-triage-robot commented Feb 24, 2022

k8s-ci-robot commented Feb 24, 2022

NotANormalNerd commented Mar 20, 2022 • edited Loading

MiddleMan5 commented Jul 20, 2022 • edited Loading

andyzhangx commented Jul 21, 2022 • edited Loading

MiddleMan5 commented Jul 21, 2022 • edited Loading

andyzhangx commented Jul 22, 2022

MiddleMan5 commented Jul 22, 2022

k8s-triage-robot commented Oct 20, 2022

MiddleMan5 commented Oct 20, 2022

MiddleMan5 commented Oct 20, 2022

k8s-triage-robot commented Jan 18, 2023

k8s-triage-robot commented Feb 17, 2023

k8s-triage-robot commented Mar 19, 2023

k8s-ci-robot commented Mar 19, 2023

davgia commented Dec 10, 2024 • edited Loading

andyzhangx commented Nov 24, 2020 •

edited

Loading

ibotty commented Nov 30, 2020 •

edited

Loading

andyzhangx commented Feb 28, 2021 •

edited

Loading

NotANormalNerd commented Mar 20, 2022 •

edited

Loading

MiddleMan5 commented Jul 20, 2022 •

edited

Loading

andyzhangx commented Jul 21, 2022 •

edited

Loading

MiddleMan5 commented Jul 21, 2022 •

edited

Loading

davgia commented Dec 10, 2024 •

edited

Loading