-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rootfs: switch ms_private remount of oldroot to ms_slave #1500
Conversation
Using MS_PRIVATE meant that there was a race between the mount(2) and the umount2(2) calls where runc inadvertently has a live reference to a mountpoint that existed on the host (which the host cannot kill implicitly through an unmount and peer sharing). In particular, this means that if we have a devicemapper mountpoint and the host is trying to delete the underlying device, the delete will fail because it is "in use" during the race. While the race is _very_ small (and libdm actually retries to avoid these sorts of cases) this appears to manifest in various cases. Signed-off-by: Aleksa Sarai <[email protected]>
This is a proposed fix for moby/moby#33846, I'm just posting it here because I'm not actually sure why we're using |
@rhvgoyal can you take a look at this? |
This patch makes sense to me. Making it slave will make sure if host unmounts it, it goes away from here. What I am not sure about is if mount propagation is synchronous. IOW, by the time umount() in host is finished, is it guaranteed that mount point as gone away in other slave mount namespace or not. Or it is asynchronous and mount will soon be propagated. In case of later, a small race window will still exist and will require retrying device removal in a loop. Which we already do. So it probably is fine. So as long as testing of this patch passes, it looks fine to me. LGTM |
I will ask some of our kernel folks, as the VFS code has always been particularly hairy to read through (for me at least). I would hope it is synchronous, but I imagine there might be some races that force it to be asynchronous. |
I have a sneaking suspicion that this patch will actually also help fix the reproducer for moby/moby#34542. Note that it doesn't fix the /underlying/ problem. |
@cyphar If it just a small race, then it should not matter as devicemapper graph driver tries device deactivation in a loop. If we can't deactivate device in first try, we will do it in second try? So this is not a must, AFAIU. |
@rhvgoyal We've been hitting kernel oopses with |
@cyphar You probably know this but we are using oci-umount to clean up these leaks as an OCI Hook |
@cyphar Can you provide backtrace of that oops. Looking at it might give some clues. If devicemapper oopses, then we should fix devicemapper as that's the root cause of the issue. Mounting being present in more than one mount namespace with private propagation should be just fine. |
@rhatdan Yeah, I'm aware of that. @rhvgoyal It appears to be an XFS bug, there are a couple of SUSE kernel engineers already working on it. While all of that is well and good, the current scheme still has the race condition which may confuse any number of |
@cyphar so has it been verified that mount propagation is synchronous? If not, then this patch does not solve the race condition. |
I'm not an approver here, but this LGTM. I think this also fixes a theoretical race under docker+overlay (per my comment/explanation here). |
@cyphar Should we merge this or you still need to verify? |
We can merge this IMO. |
rootfs: switch ms_private remount of oldroot to ms_slave LGTMs: @crosbymichael @hqhq Closes #1500
/cc @cpuguy83 @kolyshkin |
Makes sense, I think we have a similar issue in our |
Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).
In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is very small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.
Signed-off-by: Aleksa Sarai [email protected]