-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set READONLY flag in CSI PV based on PVC accessmode #469
Conversation
@msau42 this PR try to address the volume readonly issue or AI item mentioned in sig call. As mentioned in one of the sig call I still have doubt on :kubernetes/kubernetes#70505 (comment). Appreciated if you can share your thoughts on this PR and the mentioned comment. |
232c23e
to
b653dc2
Compare
/release-note-none |
/assign @xing-yang |
pvReadOnly = true | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if presence of ROX accessMode means volume should be controller published as readonly. To me, these can be orthogonal concerns. cc @bswartz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true.. but regardless of how controller publish, the PV source could still set it RO .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but interpreting ROX as readonly in PV source should be done with caution. It could make the PV unmountable in case journal needs to be replayed or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was going to argue the same. Volumes should not be controller-published as read-only unless the PV is marked read-only. There are too many good reasons for publishing them writable to the node and then having the node re-mount them as read-only to the pod.
Journal-replays are one use case. Pending expand operations are another.
I interpret PVs with the readOnly: true flag to be PERMANENTLY read-only, which you almost never want in practice. We might want to make it easier to cause that PV flag to get set to true in some cases, but I think the decision for ControllerPublishVolume should look at that flag only and not look at access modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the FS operations may need it RW while controller publish. But SP driver is free to do the operation or default to RW while controller publishing or node staging. Isnt it ?
I interpret PVs with the readOnly: true flag to be PERMANENTLY read-only, which you almost never want in practice.
I could think of a scenario where user provision a new volume as "RO" with dataSource set to PVC
or Snapshot
. Isnt it a valid use case where user want a PV to be READONLY?
@saad-ali @msau42 @jsafrane @xing-yang inviting your thoughts .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the pod tries to come up, the mount fails because the kernel is unable to replay the journal on a readonly device
I'd like to question this part. I expect filesystems to be prepared for this case. Consider a normal device that degrades such that writes start to fail. The normal reaction of the kernel is to remount the filesystem as read-only. Being able to mount read-only after a system reboot to extract more data seems like a sensible feature to support - much better than refusing to mount completely.
Whether data and files are consistent is another question, but it's not different from protecting against a sudden power loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, I want to be clear that I'm not trying to block this issue. I think it's a good change, and I'd like to see it merge.
I was just trying to answer the question of what potential downsides this change could have, because it does introduce this new corner case with journaled filesystems. We shouldn't be surprised to see support cases from people with problems mounting readonly volumes that were cloned from in-use filesystems or snapshots taken of in-use filesystems, if journaling is in use.
There are some workarounds we can either recommend to users, or possibly even build into our CSI drivers to automatically overcome this class of problems. The most obvious workaround is to create a volume that's writable and use that instead. Or, create a writable volume, attach that to a pod that does nothing (to let the journal replay), then clone that volume after the pod has terminated cleanly (so the journal is empty).
The specific kernel error I see when I reproduce this problem with ext4 is:
[587843.214656] EXT4-fs (nbd2): INFO: recovery required on readonly filesystem
[587843.214657] EXT4-fs (nbd2): write access unavailable, cannot proceed (try mounting with noload)
Possibly CSI drivers could use the -o noload option with ext4 filesystems when mounting readonly to avoid the problem.
When using XFS, the kernel errors look like this:
[588571.823845] XFS (nbd1): Mounting V5 Filesystem
[588571.881098] XFS (nbd1): recovery required on read-only device.
[588571.881100] XFS (nbd1): write access unavailable, cannot proceed.
[588571.881101] XFS (nbd1): log mount/recovery failed: error -30
[588571.881123] XFS (nbd1): log mount failed
Some searching turns up the -o norecovery options for XFS which might also work.
In summary, I like this change but I'd urge that CSI driver authors consider special handling for mounting of volumes published read-only at the controller level. And if users do get stuck with unmountable volumes, be ready to suggest the workaround of using a writable volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bswartz @pohly thanks for the inputs.. First of all good discussion about the concerned area.
Create an writable volume, with an ext4 (or other journaling) filesystem
Pod attaches to the volume, and begins writing data
Node containing that pod crashes, loses power, or otherwise dies suddenly
Take a snapshot of the volume before it is attached to another pod
....
Let me share some thoughts here: In general, a storage vendor does seizing of in flight i/o
s and also put write barriers
in place to make sure the data is persisted correctly while snapshot is taken. Thats one of the main requirement of consistent snapshots. Thus, in above scenario, it would be ideally achieved by the SP. From CSI layer, I assume it should not be of much concern. iow, a snapshot taken on an in used volume
or snapshot taken after a crash or node/pod loss comes pretty much to the same situation and SP or any layer responsible at storage low level should take care of it. While it comes to controller publish/node stage, again, decision can be made by SP to node stage or controller publish either by respecting the pv spec or not.
Regardless, the mention about the possibilities of a 'unclean journal' situation and a workaround/solution is indeed a good point 👍 . I am not sure about the different mount options of ext4 or xfs mentioned, have to look into the details of these.
About the direction of this PR and its completion, I feel its good wait some more input from storage sig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the best thing to do before merging this fix would be to add documentation to the CSI spec around the PUBLISH_READONLY controller capability. Drivers that assert that capability will need to be prepared to mount unmodifiable volumes on the nodes, which means understanding how to deal with unclean journals of each of the fstypes that your driver supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we wait for others to share their thoughts, let me rebase the PR and keep it updated with latest code.
@msau42 @xing-yang @saad-ali @jsafrane ptal 👍 |
@humblec In the proposal described in kubernetes/kubernetes#70505, there is also the following. I don't see that is addressed by this PR?
|
@xing-yang I was in an impression that, the SC param is an optional one. Or is it mandatory that, we will expose the SC param too ? |
@humblec based on kubernetes/kubernetes#70505, "Add new csiControllerPublishVolumeReadOnly StorageClass parameter to CSI external-provisioner" should be addressed as well. @saad-ali can you chime in as you wrote the original proposal. |
Somehow I was in an impression that its either 1 or 2 in that proposal :), so I was trying clarify. I will wait for @saad-ali thought to double confirm.
|
[Changes in the last refresh compared to previous version]
PTAL.. Thanks 👍 |
@@ -103,17 +103,13 @@ var ( | |||
nodeDeploymentImmediateBinding = flag.Bool("node-deployment-immediate-binding", true, "Determines whether immediate binding is supported when deployed on each node.") | |||
nodeDeploymentBaseDelay = flag.Duration("node-deployment-base-delay", 20*time.Second, "Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding.") | |||
nodeDeploymentMaxDelay = flag.Duration("node-deployment-max-delay", 60*time.Second, "Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding.") | |||
publishROXVol = flag.Bool("publish-rox-vol", false, "This option enables PV to be marked as readonly at nodepublish call if PVC accessmode has been set to ROX.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the analysis at kubernetes/kubernetes#70505, this impacts ControllerPublish, not NodePublish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.. updated , thanks @msau42 . 👍
leaderelection interface has been defined which is not in use at the moment. This commit remove the same. Signed-off-by: Humble Chirammal <[email protected]>
@humblec: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@@ -103,17 +103,13 @@ var ( | |||
nodeDeploymentImmediateBinding = flag.Bool("node-deployment-immediate-binding", true, "Determines whether immediate binding is supported when deployed on each node.") | |||
nodeDeploymentBaseDelay = flag.Duration("node-deployment-base-delay", 20*time.Second, "Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding.") | |||
nodeDeploymentMaxDelay = flag.Duration("node-deployment-max-delay", 60*time.Second, "Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding.") | |||
publishROXVol = flag.Bool("publish-rox-vol", false, "This option enables PV to be marked as readonly at controllerpublish call if PVC accessmode has been set to ROX.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of calling the flag something like "controller-publish-readonly".
Also in the help text, use the full csi call name "ControllerPublishVolume"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42 yeah, thats a good suggestion, considering we are on controller publish, controller-publish-readonly
suite better. Let me update the flag and README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42 done.. thanks
@@ -417,7 +417,7 @@ func TestCreateDriverReturnsInvalidCapacityDuringProvision(t *testing.T) { | |||
|
|||
pluginCaps, controllerCaps := provisionCapabilities() | |||
csiProvisioner := NewCSIProvisioner(clientSet, 5*time.Second, "test-provisioner", "test", | |||
5, csiConn.conn, nil, driverName, pluginCaps, controllerCaps, "", false, true, csitrans.New(), nil, nil, nil, nil, nil, false, defaultfsType, nil) | |||
5, csiConn.conn, nil, driverName, pluginCaps, controllerCaps, "", false, true, csitrans.New(), nil, nil, nil, nil, nil, false, defaultfsType, nil, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should investigate this separately, but I think the number of function arguments is getting to the point that we should probably have an options struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completely agree here, the arg has grown a lot and even the test case organization is difficult in general with current state. I will revisit this part in a followup PR. thanks.
based on newly added "controller-publish-read-only" and PVC accessmode setting `ReadOnly` field in CSI persistent volume source has been set to `true`. Signed-off-by: Humble Chirammal <[email protected]>
@humblec, please update |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: humblec, msau42, saad-ali The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Removing now that PR is LGTM'd and approved. /hold cancel |
/kind feature |
Release tools update
Refer: kubernetes/kubernetes#70505
Signed-off-by: Humble Chirammal [email protected]