Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootfs: make pivot_root not use a temporary directory #1125

Merged
merged 1 commit into from
Oct 24, 2016
Merged

rootfs: make pivot_root not use a temporary directory #1125

merged 1 commit into from
Oct 24, 2016

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Oct 18, 2016

Namely, use an undocumented feature of pivot_root(2) where
pivot_root(".", ".") is actually a feature and allows you to make the
old_root be tied to your /proc/self/cwd in a way that makes unmounting
easy. Thanks a lot to the LXC developers which came up with this idea
first.

This is the first step of many to allowing runC to work with a
completely read-only rootfs.

Fixes #1122.

Signed-off-by: Aleksa Sarai [email protected]

@mrunalp
Copy link
Contributor

mrunalp commented Oct 18, 2016

cool, testing on RHEL.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 18, 2016

@rhvgoyal could you also try this out?

@mrunalp
Copy link
Contributor

mrunalp commented Oct 18, 2016

okay, this messed up the mounts on my host while testing.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 18, 2016

After running the tests on the host, this is all that remained.

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=922032k,nr_inodes=230508,mode=755)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755)
/dev/vda1 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)

@cyphar
Copy link
Member Author

cyphar commented Oct 18, 2016

Ah, fun. Yeah, this is kinda what I was worried about. I'll need to play around with this more -- the code for sys_pivot_root is not very fun code. My guess is that there's something very wrong with how umount(.) is acting. However, I'm really worried the issue is that one thread does a Chdir but another thread then continues the rest of the mounting.

@cyphar
Copy link
Member Author

cyphar commented Oct 18, 2016

@mrunalp Does this patch help?

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 4f21416ce233..0014cb82e6e6 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -10,6 +10,7 @@ import (
        "os/exec"
        "path"
        "path/filepath"
+       "runtime"
        "strings"
        "syscall"
        "time"
@@ -566,6 +567,13 @@ func setupPtmx(config *configs.Config, console *linuxConsole) error {
 // pivotRoot will call pivot_root such that rootfs becomes the new root
 // filesystem, and everything else is cleaned up.
 func pivotRoot(rootfs string) error {
+       // We change the cwd during this function. Because Go is multithreaded and
+       // it doesn't appear to correctly implement POSIX thread semantics, we have
+       // to ensure we don't switch threads here. Otherwise we start napalming the
+       // host's filesystem.
+       runtime.LockOSThread()
+       defer runtime.UnlockOSThread()
+
        // While the documentation may claim otherwise, pivot_root(".", ".") is
        // actually valid. What this results in is / being the new root but
        // /proc/self/cwd being the old root. Since we can play around with the cwd

@mrunalp
Copy link
Contributor

mrunalp commented Oct 18, 2016

@cyphar Nope, that patch doesn't help :/

// Make pivotDir rprivate to make sure any of the unmounts don't
// propagate to parent.
if err := syscall.Mount("", pivotDir, "", syscall.MS_PRIVATE|syscall.MS_REC, ""); err != nil {
if err := syscall.Unmount(".", syscall.MNT_DETACH); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to mark all old mounts as private before umounting them
syscall.Mount("", ".", "", syscall.MS_PRIVATE|syscall.MS_REC, "")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avagin Yes, we were making the old mounts private before but it got removed in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, whoops. I'll fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied this fix. PTAL.

@avagin
Copy link
Contributor

avagin commented Oct 18, 2016

@cyphar @mrunalp mount(NULL, ".", NULL, MS_REC | MS_PRIVATE, NULL) before umount(".") should solve this problem

@rhvgoyal
Copy link
Contributor

This looks like a cool idea. Will try tomorrow morning.

return err
}
syscall.Close(oldroot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have a defer syscall.Close(oldroot). I am wondering why this call is required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was to make it clear that we didn't need that fd after that point, but I've removed it.


// We cannot umount(".") because currently our . is newroot. So we need to
// switch back to oldroot before doing a MNT_DETACH (since we still have an
// open fd to it).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looks like you are changing cwd of the process and root of the process continues to be rootfs. IOW, . being new root is not the problem. Probably problem is that . is cwd of the process and it will keep mounts busy. Hence you are changing cwd to oldroot so that unmount succeeds.

If this is correct, it might be a good idea to modify comment a bit to reflect that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MNT_DETACH should allow us to unmount it without changing directories, from my reading of pivot_root it does a lot of tomfoolery that I believe might change the cwd. But I'll double check.

Copy link
Member Author

@cyphar cyphar Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused -- I just removed this line and the code still works. So I think you're right (that . is still oldroot) but you're wrong that MNT_DETACH will fail if we unmount without changing directories.

@crosbymichael
Copy link
Member

code looks good.

Is it safe to test on my machine? i'm scared....

@rhvgoyal
Copy link
Contributor

I tested it on my machine and it worked for me.

Namely, use an undocumented feature of pivot_root(2) where
pivot_root(".", ".") is actually a feature and allows you to make the
old_root be tied to your /proc/self/cwd in a way that makes unmounting
easy. Thanks a lot to the LXC developers which came up with this idea
first.

This is the first step of many to allowing runC to work with a
completely read-only rootfs.

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar
Copy link
Member Author

cyphar commented Oct 20, 2016

@rhvgoyal While the fchdir(oldroot) isn't necessary (from what I can see) I decided to play it safe and keep it anyway (updating the comment to note that it isn't really necessary) -- since pivot_root doesn't really guarantee what your cwd will be after you call it.

@hqhq
Copy link
Contributor

hqhq commented Oct 22, 2016

I think it's safer to keep fchdir(oldroot) in case kernel pivot_root change behavior.

LGTM

Approved with PullApprove

if err := syscall.Chdir("/"); err != nil {
return fmt.Errorf("chdir / %s", err)

// Currently our "." is oldroot (according to the current kernel code).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what does this comment mean. I suspect you are saying that cwd of process has not changed and oldroot is continuing to be cwd of the process?

If yes, old root is "/" and that's was not necessarily cwd of the process at the time of call to pivot_root().

May be say something like. pivot_root() does not guarantee what will happen to cwd of the calling process. It is a possibility that pivot_root() changed cwd to new root and in that case following umount() will fail. So to be safe, change dir to oldroot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're right (I thought about it some more). pivot_root(a, b) will make a the new root and mount the old root at b. What we're doing here is unmounting the old root, because the old root is also on b. Since b is . it matters whether or not pivot_root has changed your cwd. It turns out that this does change your cwd in the current Linux implementation, but I don't want to depend on that (the mount logic is very dodgy in the kernel).

@mrunalp
Copy link
Contributor

mrunalp commented Oct 24, 2016

LGTM

Approved with PullApprove

@cyphar cyphar merged commit f8e6b5a into opencontainers:master Oct 24, 2016
cyphar added a commit that referenced this pull request Oct 24, 2016
@cyphar cyphar deleted the pivot_root-without-tmpdir branch October 25, 2016 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants