rootfs: switch ms_private remount of oldroot to ms_slave #1500

cyphar · 2017-06-28T15:31:34Z

Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).

In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is very small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.

Signed-off-by: Aleksa Sarai [email protected]

Using MS_PRIVATE meant that there was a race between the mount(2) and the umount2(2) calls where runc inadvertently has a live reference to a mountpoint that existed on the host (which the host cannot kill implicitly through an unmount and peer sharing). In particular, this means that if we have a devicemapper mountpoint and the host is trying to delete the underlying device, the delete will fail because it is "in use" during the race. While the race is _very_ small (and libdm actually retries to avoid these sorts of cases) this appears to manifest in various cases. Signed-off-by: Aleksa Sarai <[email protected]>

cyphar · 2017-06-28T15:32:19Z

This is a proposed fix for moby/moby#33846, I'm just posting it here because I'm not actually sure why we're using MS_PRIVATE.

crosbymichael · 2017-06-28T16:57:41Z

@rhvgoyal can you take a look at this?

rhvgoyal · 2017-06-28T20:16:42Z

This patch makes sense to me. Making it slave will make sure if host unmounts it, it goes away from here.

What I am not sure about is if mount propagation is synchronous. IOW, by the time umount() in host is finished, is it guaranteed that mount point as gone away in other slave mount namespace or not. Or it is asynchronous and mount will soon be propagated.

In case of later, a small race window will still exist and will require retrying device removal in a loop. Which we already do. So it probably is fine.

So as long as testing of this patch passes, it looks fine to me.

LGTM

cyphar · 2017-06-29T10:13:32Z

@rhvgoyal

What I am not sure about is if mount propagation is synchronous.

I will ask some of our kernel folks, as the VFS code has always been particularly hairy to read through (for me at least). I would hope it is synchronous, but I imagine there might be some races that force it to be asynchronous.

cyphar · 2017-08-19T09:26:27Z

I have a sneaking suspicion that this patch will actually also help fix the reproducer for moby/moby#34542. Note that it doesn't fix the /underlying/ problem.

crosbymichael · 2017-08-21T20:54:46Z

LGTM

oh devmapper

rhvgoyal · 2017-08-21T20:58:28Z

@cyphar If it just a small race, then it should not matter as devicemapper graph driver tries device deactivation in a loop. If we can't deactivate device in first try, we will do it in second try? So this is not a must, AFAIU.

cyphar · 2017-08-22T11:06:10Z

@rhvgoyal We've been hitting kernel oopses with devmapper recently because it looks like the namespace in which the actual kernel-side cleanup happens might be more important than it first seems. It's not clear whether this patch fixes the issue (we still can't reproduce it effectively) but it seems like namespace leaking can cause more than just annoyances, it can actually trigger panics.

rhatdan · 2017-08-22T12:03:39Z

@cyphar You probably know this but we are using oci-umount to clean up these leaks as an OCI Hook

rhvgoyal · 2017-08-22T12:14:28Z

@cyphar Can you provide backtrace of that oops. Looking at it might give some clues. If devicemapper oopses, then we should fix devicemapper as that's the root cause of the issue. Mounting being present in more than one mount namespace with private propagation should be just fine.

cyphar · 2017-08-23T01:18:23Z

@rhatdan Yeah, I'm aware of that.

@rhvgoyal It appears to be an XFS bug, there are a couple of SUSE kernel engineers already working on it. While all of that is well and good, the current scheme still has the race condition which may confuse any number of runc users (such cri-o and so on). I'll paste the backtrace in a minute after I download the coredump we got sent.

rhvgoyal · 2017-08-23T12:30:47Z

@cyphar so has it been verified that mount propagation is synchronous? If not, then this patch does not solve the race condition.

euank · 2017-09-22T22:39:48Z

I'm not an approver here, but this LGTM.

I think this also fixes a theoretical race under docker+overlay (per my comment/explanation here).

hqhq · 2017-09-25T04:11:09Z

LGTM

mrunalp · 2017-10-10T19:39:03Z

@cyphar Should we merge this or you still need to verify?

cyphar · 2017-10-13T22:31:59Z

We can merge this IMO.

@crosbymichael

rootfs: switch ms_private remount of oldroot to ms_slave LGTMs: @crosbymichael @hqhq Closes #1500

thaJeztah · 2017-10-15T07:10:36Z

/cc @cpuguy83 @kolyshkin

cpuguy83 · 2017-10-16T17:46:17Z

Makes sense, I think we have a similar issue in our chrootarchive package.

cyphar mentioned this pull request Aug 19, 2017

devicemapper: protect mounts against volume leaks moby/moby#34542

Closed

This was referenced Sep 22, 2017

libcontainer: default mount propagation correctly #1598

Merged

overlay2 + linux v4.13: error creating overlay mount to /var/lib/docker/overlay2/ID/merged: device or resource busy moby/moby#34672

Closed

cyphar merged commit 117c927 into opencontainers:master Oct 13, 2017

cyphar added a commit that referenced this pull request Oct 13, 2017

merge branch 'pr-1500'

2430a98

rootfs: switch ms_private remount of oldroot to ms_slave LGTMs: @crosbymichael @hqhq Closes #1500

cyphar deleted the rootfs-use-ms_slave branch October 13, 2017 22:33

cyphar mentioned this pull request Feb 24, 2018

VERSION: release v1.0.0-rc5 #1739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rootfs: switch ms_private remount of oldroot to ms_slave #1500

rootfs: switch ms_private remount of oldroot to ms_slave #1500

cyphar commented Jun 28, 2017

cyphar commented Jun 28, 2017

crosbymichael commented Jun 28, 2017

rhvgoyal commented Jun 28, 2017

cyphar commented Jun 29, 2017

cyphar commented Aug 19, 2017

crosbymichael commented Aug 21, 2017 •

edited by caniszczyk

Loading

rhvgoyal commented Aug 21, 2017

cyphar commented Aug 22, 2017 •

edited

Loading

rhatdan commented Aug 22, 2017

rhvgoyal commented Aug 22, 2017

cyphar commented Aug 23, 2017

rhvgoyal commented Aug 23, 2017

euank commented Sep 22, 2017

hqhq commented Sep 25, 2017 •

edited by caniszczyk

Loading

mrunalp commented Oct 10, 2017

cyphar commented Oct 13, 2017

thaJeztah commented Oct 15, 2017

cpuguy83 commented Oct 16, 2017

rootfs: switch ms_private remount of oldroot to ms_slave #1500

rootfs: switch ms_private remount of oldroot to ms_slave #1500

Conversation

cyphar commented Jun 28, 2017

cyphar commented Jun 28, 2017

crosbymichael commented Jun 28, 2017

rhvgoyal commented Jun 28, 2017

cyphar commented Jun 29, 2017

cyphar commented Aug 19, 2017

crosbymichael commented Aug 21, 2017 • edited by caniszczyk Loading

rhvgoyal commented Aug 21, 2017

cyphar commented Aug 22, 2017 • edited Loading

rhatdan commented Aug 22, 2017

rhvgoyal commented Aug 22, 2017

cyphar commented Aug 23, 2017

rhvgoyal commented Aug 23, 2017

euank commented Sep 22, 2017

hqhq commented Sep 25, 2017 • edited by caniszczyk Loading

mrunalp commented Oct 10, 2017

cyphar commented Oct 13, 2017

thaJeztah commented Oct 15, 2017

cpuguy83 commented Oct 16, 2017

crosbymichael commented Aug 21, 2017 •

edited by caniszczyk

Loading

cyphar commented Aug 22, 2017 •

edited

Loading

hqhq commented Sep 25, 2017 •

edited by caniszczyk

Loading