Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootfs: switch ms_private remount of oldroot to ms_slave #1500

Merged
merged 1 commit into from
Oct 13, 2017
Merged

rootfs: switch ms_private remount of oldroot to ms_slave #1500

merged 1 commit into from
Oct 13, 2017

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Jun 28, 2017

Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).

In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is very small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.

Signed-off-by: Aleksa Sarai [email protected]

Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).

In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is _very_ small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar
Copy link
Member Author

cyphar commented Jun 28, 2017

This is a proposed fix for moby/moby#33846, I'm just posting it here because I'm not actually sure why we're using MS_PRIVATE.

@crosbymichael
Copy link
Member

@rhvgoyal can you take a look at this?

@rhvgoyal
Copy link
Contributor

This patch makes sense to me. Making it slave will make sure if host unmounts it, it goes away from here.

What I am not sure about is if mount propagation is synchronous. IOW, by the time umount() in host is finished, is it guaranteed that mount point as gone away in other slave mount namespace or not. Or it is asynchronous and mount will soon be propagated.

In case of later, a small race window will still exist and will require retrying device removal in a loop. Which we already do. So it probably is fine.

So as long as testing of this patch passes, it looks fine to me.

LGTM

@cyphar
Copy link
Member Author

cyphar commented Jun 29, 2017

@rhvgoyal

What I am not sure about is if mount propagation is synchronous.

I will ask some of our kernel folks, as the VFS code has always been particularly hairy to read through (for me at least). I would hope it is synchronous, but I imagine there might be some races that force it to be asynchronous.

@cyphar
Copy link
Member Author

cyphar commented Aug 19, 2017

I have a sneaking suspicion that this patch will actually also help fix the reproducer for moby/moby#34542. Note that it doesn't fix the /underlying/ problem.

@crosbymichael
Copy link
Member

crosbymichael commented Aug 21, 2017

LGTM

oh devmapper

Approved with PullApprove

@rhvgoyal
Copy link
Contributor

@cyphar If it just a small race, then it should not matter as devicemapper graph driver tries device deactivation in a loop. If we can't deactivate device in first try, we will do it in second try? So this is not a must, AFAIU.

@cyphar
Copy link
Member Author

cyphar commented Aug 22, 2017

@rhvgoyal We've been hitting kernel oopses with devmapper recently because it looks like the namespace in which the actual kernel-side cleanup happens might be more important than it first seems. It's not clear whether this patch fixes the issue (we still can't reproduce it effectively) but it seems like namespace leaking can cause more than just annoyances, it can actually trigger panics.

@rhatdan
Copy link
Contributor

rhatdan commented Aug 22, 2017

@cyphar You probably know this but we are using oci-umount to clean up these leaks as an OCI Hook

@rhvgoyal
Copy link
Contributor

@cyphar Can you provide backtrace of that oops. Looking at it might give some clues. If devicemapper oopses, then we should fix devicemapper as that's the root cause of the issue. Mounting being present in more than one mount namespace with private propagation should be just fine.

@cyphar
Copy link
Member Author

cyphar commented Aug 23, 2017

@rhatdan Yeah, I'm aware of that.

@rhvgoyal It appears to be an XFS bug, there are a couple of SUSE kernel engineers already working on it. While all of that is well and good, the current scheme still has the race condition which may confuse any number of runc users (such cri-o and so on). I'll paste the backtrace in a minute after I download the coredump we got sent.

@rhvgoyal
Copy link
Contributor

@cyphar so has it been verified that mount propagation is synchronous? If not, then this patch does not solve the race condition.

@euank
Copy link
Contributor

euank commented Sep 22, 2017

I'm not an approver here, but this LGTM.

I think this also fixes a theoretical race under docker+overlay (per my comment/explanation here).

@hqhq
Copy link
Contributor

hqhq commented Sep 25, 2017

LGTM

Approved with PullApprove

@mrunalp
Copy link
Contributor

mrunalp commented Oct 10, 2017

@cyphar Should we merge this or you still need to verify?

@cyphar
Copy link
Member Author

cyphar commented Oct 13, 2017

We can merge this IMO.

@cyphar cyphar merged commit 117c927 into opencontainers:master Oct 13, 2017
cyphar added a commit that referenced this pull request Oct 13, 2017
  rootfs: switch ms_private remount of oldroot to ms_slave

LGTMs: @crosbymichael @hqhq
Closes #1500
@cyphar cyphar deleted the rootfs-use-ms_slave branch October 13, 2017 22:33
@thaJeztah
Copy link
Member

/cc @cpuguy83 @kolyshkin

@cpuguy83
Copy link
Contributor

Makes sense, I think we have a similar issue in our chrootarchive package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants