Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsenter: guarantee correct user namespace ordering #977

Merged
merged 3 commits into from
Oct 26, 2016
Merged

nsenter: guarantee correct user namespace ordering #977

merged 3 commits into from
Oct 26, 2016

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Aug 8, 2016

Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should always be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.

This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.

In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.

TODO:

  • Tag each namespace we want to join, to avoid future bugs.
  • Guarantee correct namespace ordering.

Signed-off-by: Aleksa Sarai [email protected]

This is based on #950. Base has been merged.

@cyphar
Copy link
Member Author

cyphar commented Aug 17, 2016

Alright @opencontainers/runc-maintainers this is the next PR in the "rewriting nsenter" series of patches. PTAL, the changes here are quite a bit more intrusive than #950.


/* This *must* be called before we touch gid_map. */
static void update_setgroups(int pid, bool setgroup)
static void update_setgroups(int pid, enum policy_t setgroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we ever set this to deny or noop when we go into here ?, seems like something like allow_setgroup(pid) is simpler.

Copy link
Member Author

@cyphar cyphar Aug 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rootless containers, we need to set deny. This patch is based on the rootless containers patch, and I don't see why we have to restrict it (it'll be useful once I rebase the rootless containers PR on top of the nsenter cleanup -- something I'm not looking forward to). If we remove it now, I'll just have to re-add it later.

@cyphar
Copy link
Member Author

cyphar commented Aug 21, 2016

Rebase'd.

@crosbymichael
Copy link
Member

getting a ton of issues building this patch.

../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:200:4: note: in expansion of macro ‘bail’
    bail("failed to write '%s' to /proc/%d/setgroups", policy, pid);
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:137:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &ret, sizeof(ret));  \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:200:4: note: in expansion of macro ‘bail’
    bail("failed to write '%s' to /proc/%d/setgroups", policy, pid);
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c: In function ‘update_uidmap’:
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:136:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &s, sizeof(s));   \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:210:3: note: in expansion of macro ‘bail’
   bail("failed to update /proc/%d/uid_map", pid);
   ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:137:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &ret, sizeof(ret));  \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:210:3: note: in expansion of macro ‘bail’
   bail("failed to update /proc/%d/uid_map", pid);
   ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c: In function ‘update_gidmap’:
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:136:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &s, sizeof(s));   \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:219:3: note: in expansion of macro ‘bail’
   bail("failed to update /proc/%d/gid_map", pid);
   ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:137:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &ret, sizeof(ret));  \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:219:3: note: in expansion of macro ‘bail’
   bail("failed to update /proc/%d/gid_map", pid);

@cyphar
Copy link
Member Author

cyphar commented Aug 24, 2016

None of that is meant to happen. I'll fix those up today. :P

@cyphar
Copy link
Member Author

cyphar commented Aug 26, 2016

Alright, I've fixed them up. Really the warnings aren't an issue (they're happening in an error path and we can't do anything if the writes fail).

@hqhq
Copy link
Contributor

hqhq commented Aug 30, 2016

@cyphar Janky failed seems related.

@cyphar
Copy link
Member Author

cyphar commented Aug 30, 2016

I remember fixing the same issue a week ago. It might be a dodgy rebase. :/

@cyphar
Copy link
Member Author

cyphar commented Aug 31, 2016

@hqhq Alright, I fixed that issue. It was because I had misplaced brackets in the bail macro. Dammit. 😒

PTAL /cc @opencontainers/runc-maintainers

@@ -0,0 +1,49 @@
/*
* Copyright 2014, 2015 Docker Inc.
* Copyright 2015, 2016 The Linux Foundation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct, the original copyright in the license stays as it is, it doesn't just stop because the repo changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm also not a fan of starting to add these on the files, we have a overall license and it applies to all files in the repo, we don't need them per file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, it's a matter of taste I guess. I'll drop the headers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar thanks

@cyphar cyphar mentioned this pull request Sep 4, 2016
9 tasks
@cyphar cyphar mentioned this pull request Sep 12, 2016
46 tasks
@@ -87,7 +87,7 @@ func TestNsenterInvalidPaths(t *testing.T) {

namespaces := []string{
// join pid ns of the current process
fmt.Sprintf("/proc/%d/ns/pid", -1),
fmt.Sprintf("pid:/proc/%d/ns/pid", -1),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should try "pid:/proc/%d/ns/net" here to test the ns path validation?

Copy link
Member Author

@cyphar cyphar Sep 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea. I've added an extra test for this, so we can test both cases (invalid type specifier and invalid path).

cyphar added 2 commits October 4, 2016 16:17
This avoids us from running into cases where libcontainer thinks that a
particular namespace file is a different type, and makes it a fatal
error rather than causing broken functionality.

Signed-off-by: Aleksa Sarai <[email protected]>
Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should *always* be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.

This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.

In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar
Copy link
Member Author

cyphar commented Oct 4, 2016

Pinging this since it's needed for #975 (which is blocking -rc3).

/cc @opencontainers/runc-maintainers

@crosbymichael
Copy link
Member

crosbymichael commented Oct 7, 2016

LGTM

Approved with PullApprove

@mrunalp
Copy link
Contributor

mrunalp commented Oct 7, 2016

I tested on RHEL 7.2. It is failing :/

 oci-runtime-tool generate --tty --output=config.json --uidmappings 1000:0:32000 --gidmappings 1000:0:32000

strace -f -o strace.log runc run 1234
unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWNET) = -1 EPERM (Operation not permitted)

@cyphar
Copy link
Member Author

cyphar commented Oct 8, 2016

@mrunalp Wasn't that fixed by #975? It was split out of this PR.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 8, 2016

Pretty sure I had run the tests with 975 on RHEL once earlier hence this surprised me. I will debug further.

On Oct 7, 2016, at 9:55 PM, Aleksa Sarai [email protected] wrote:

@mrunalp Wasn't that fixed by #975? It was split out of this PR.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@cyphar
Copy link
Member Author

cyphar commented Oct 12, 2016

@mrunalp Any update?

@mrunalp
Copy link
Contributor

mrunalp commented Oct 12, 2016

@cyphar I see the same issue with #975 as well. I have reached out to RHEL kernel team and will get back once I hear their recommendations.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 12, 2016

I played around a bit and I can get past this issue with this diff. I don't think this fixes #959. It will most likely need changes similar to what I posted in #959. I will test that and get back in a bit.

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index d3a50b0..a28c202 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -626,10 +626,9 @@ void nsexec(void)
                         * affect our ability to unshare other namespaces and are used as
                         * context for privilege checks.
                         */
+                       if (unshare(config.cloneflags) < 0)
+                               bail("failed to unshare namespaces");
                        if (config.cloneflags & CLONE_NEWUSER) {
-                               /* Create a new user namespace. */
-                               if (unshare(CLONE_NEWUSER) < 0)
-                                       bail("failed to unshare user namespace");

                                /*
                                 * We don't have the privileges to do any mapping here (see the
@@ -647,17 +646,8 @@ void nsexec(void)
                                if (s != SYNC_USERMAP_ACK)
                                        bail("failed to sync with parent: SYNC_USERMAP_ACK: got %u", s);

-                               config.cloneflags &= ~CLONE_NEWUSER;
                        }

-                       /*
-                        * Now we can unshare the rest of the namespaces. We can't be sure if the
-                        * current kernel supports clone(CLONE_PARENT | CLONE_NEWPID), so we'll
-                        * just do it the long way anyway.
-                        */
-                       if (unshare(config.cloneflags) < 0)
-                               bail("failed to unshare namespaces");
-
                        /* TODO: What about non-namespace clone flags that we're dropping here? */
                        child = clone_parent(&env, JUMP_INIT);
                        if (child < 0)

@mrunalp
Copy link
Contributor

mrunalp commented Oct 12, 2016

I confirm that #959 isn't fixed with my diff above. The changes have to be ported exactly from #959. If you want I can create a PR for that.

@cyphar
Copy link
Member Author

cyphar commented Oct 12, 2016

:/ I really don't like the code in #959, because it breaks up unsharing of IPC. Unfortunately I don't really have a system to test that on...

@cyphar
Copy link
Member Author

cyphar commented Oct 12, 2016

@mrunalp Since the SELinux issue still happens on master, is it possible for us to punt on fixing the IPC issue for now (there are other PRs that depend on this one)? I'll take some time to take a look at another way of implementing your delay IPC logic that doesn't require special-casing IPC. Is there a nice way of testing an SELinux setup without access to a RHEL machine (a Fedora VM or something?) -- I don't have a SUSE setup with SELinux atm.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 12, 2016

@cyphar I am fine with fixing #959 in a separate PR. I fear there is really no other way besides what I did :). You can use a Fedora 24 VM to reproduce the issue. Just need to compile runc with selinux build tag and use the config in #959.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 12, 2016

I will run other tests on RHEL7 and report back here to check that there are no regressions vs master.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 17, 2016

@cyphar I ran the tests on master vs this PR + my diff and they are the same (There are 4 failures) unrelated to this PR. If you can update the PR to include my suggested changes, we can merge this.

@cyphar
Copy link
Member Author

cyphar commented Oct 18, 2016

@mrunalp Can this issue be reproduced on a stock Fedora / CentOS VM? I'd like to play around with it a bit to see if I can get another solution to work. If you also know approximately where in the kernel the relevant checks are (presumably in the bowels of SELinux) that'd be great.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 18, 2016

I meant the diff I posted for combining the unshare of all namespaces without which we don't get test parity with master on RHEL 7. For 959 which we can handle separately you can use a Fedora VM to debug.

On Oct 17, 2016, at 5:14 PM, Aleksa Sarai [email protected] wrote:

@mrunalp Can this issue be reproduced on a stock Fedora / CentOS VM? I'd like to play around with it a bit to see if I can get another solution to work.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@cyphar
Copy link
Member Author

cyphar commented Oct 18, 2016

@mrunalp AFAICS that patch will break rootless containers (#774), but I'll apply it and I can look at what we can do to fix that issue later.

Without this patch applied, RHEL's SELinux policies cause container
creation to not really work. Unfortunately this might be an issue for
rootless containers (#774) but we'll cross that
bridge when we come to it.

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar
Copy link
Member Author

cyphar commented Oct 18, 2016

@mrunalp I've added that patch as a separate commit so that we can track the old way of doing it in the future. PTAL, it should be good to merge now.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 18, 2016

LGTM

Approved with PullApprove

@mrunalp
Copy link
Contributor

mrunalp commented Oct 19, 2016

@crosbymichael This needs LGTM again if you have no more comments.

/*
* Okay, so this is quite annoying.
*
* In order to make sure that deal with older kernels (when CLONE_NEWUSER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version of old kernels are we talking about here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mentioned later in the comment.

This was fixed in upstream commit 1f7f4dde5c945f41a7abc2285be43d918029ecc5, and was introduced by 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that's for CLONE_PATENT and CLONE_NEWPID thing, is that the same thing as CLONE_NEWUSER not be handled first?

I was thinking if it's not a too new version, maybe we can claim it as a limitation, making the logic much more complicated to handle such a legacy issue seems not a worthy tradeoff to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. The actual reason for the split is the various issues described underneath the section. This split is the whole reason for this PR (so removing the split would make this PR kinda pointless), as there are several legitimate non-legacy issues fixed by this split such as #975.

I can fix the comment, but I don't really want to wait another two weeks for two LGTMs because people are quite busy. I can fix the comment in a later PR if it's not a huge issue for you right now.

@hqhq
Copy link
Contributor

hqhq commented Oct 26, 2016

Over all it looks good to me, nothing big worth to be a blocker, I found some issues which can all be fixed in following up PRs, it's been quite a while and block lots of other things. Let's move on.

LGTM

Approved with PullApprove

@hqhq hqhq merged commit 157a96a into opencontainers:master Oct 26, 2016
@cyphar cyphar deleted the nsenter-userns-ordering branch October 26, 2016 08:45
@cyphar
Copy link
Member Author

cyphar commented Oct 26, 2016

@hqhq ❤️ I can open a PR with some of your nits later.

wking added a commit to wking/opencontainer-runtime-spec that referenced this pull request Oct 27, 2016
Namespaces do not need repeated entries and the ordering is handled by
the runtime regardless of the spec ordering (e.g. in runC [1]).  Using
an object leans on the new wording from eeaccfa (glossary: Make
objects explicitly unordered and forbid duplicate names, 2016-09-27,
opencontainers#584) to make both of those points explicit.

[1]: opencontainers/runc#977
     Subject: nsenter: guarantee correct user namespace ordering

Signed-off-by: W. Trevor King <[email protected]>
wking added a commit to wking/opencontainer-runtime-spec that referenced this pull request Oct 27, 2016
Namespaces do not need repeated entries and the ordering is handled by
the runtime regardless of the spec ordering (e.g. in runC [1]).  Using
an object leans on the new wording from eeaccfa (glossary: Make
objects explicitly unordered and forbid duplicate names, 2016-09-27,
opencontainers#584) to make both of those points explicit.

[1]: opencontainers/runc#977
     Subject: nsenter: guarantee correct user namespace ordering

Signed-off-by: W. Trevor King <[email protected]>
wking added a commit to wking/opencontainer-runtime-spec that referenced this pull request Oct 27, 2016
Namespaces do not need repeated entries and the ordering is handled by
the runtime regardless of the spec ordering (e.g. in runC [1]).  Using
an object leans on the new wording from eeaccfa (glossary: Make
objects explicitly unordered and forbid duplicate names, 2016-09-27,
opencontainers#584) to make both of those points explicit.

[1]: opencontainers/runc#977
     Subject: nsenter: guarantee correct user namespace ordering

Signed-off-by: W. Trevor King <[email protected]>
* way anyway.
*/
if (unshare(config.cloneflags) < 0)
bail("failed to unshare namespaces");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar I didn't notice you append this commit before, does this mean we actually still can't guarantee correct user namespace ordering for old kernels which don't create user namespace first when it's specified along with other CLONE_NEW* flags?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an SELinux issue that meant that splitting it would cause some issues. Ask @mrunalp for more information. So, while this currently doesn't fix the actual issue I listed in the PR description there are a bunch of plumbing changes that make rootless containers as well as #975 and #976 possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see that after revisiting comments, so looks like we don't have any better ideas to fix the issue (user namespace joining ordering on old kernels) for now. Can you open a PR to update comments accordingly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I've opened #1165 accordingly. PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants