nsenter: guarantee correct user namespace ordering #977

cyphar · 2016-08-08T12:21:42Z

Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should always be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.

This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.

In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.

TODO:

Tag each namespace we want to join, to avoid future bugs.
Guarantee correct namespace ordering.

Signed-off-by: Aleksa Sarai [email protected]

~~This is based on #950.~~ Base has been merged.

cyphar · 2016-08-17T08:31:12Z

Alright @opencontainers/runc-maintainers this is the next PR in the "rewriting nsenter" series of patches. PTAL, the changes here are quite a bit more intrusive than #950.

dqminh · 2016-08-17T22:00:09Z

libcontainer/nsenter/nsexec.c


 /* This *must* be called before we touch gid_map. */
-static void update_setgroups(int pid, bool setgroup)
+static void update_setgroups(int pid, enum policy_t setgroup)


i dont think we ever set this to deny or noop when we go into here ?, seems like something like allow_setgroup(pid) is simpler.

For rootless containers, we need to set deny. This patch is based on the rootless containers patch, and I don't see why we have to restrict it (it'll be useful once I rebase the rootless containers PR on top of the nsenter cleanup -- something I'm not looking forward to). If we remove it now, I'll just have to re-add it later.

cyphar · 2016-08-21T15:26:51Z

Rebase'd.

crosbymichael · 2016-08-23T17:29:27Z

getting a ton of issues building this patch.

../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:200:4: note: in expansion of macro ‘bail’
    bail("failed to write '%s' to /proc/%d/setgroups", policy, pid);
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:137:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &ret, sizeof(ret));  \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:200:4: note: in expansion of macro ‘bail’
    bail("failed to write '%s' to /proc/%d/setgroups", policy, pid);
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c: In function ‘update_uidmap’:
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:136:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &s, sizeof(s));   \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:210:3: note: in expansion of macro ‘bail’
   bail("failed to update /proc/%d/uid_map", pid);
   ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:137:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &ret, sizeof(ret));  \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:210:3: note: in expansion of macro ‘bail’
   bail("failed to update /proc/%d/uid_map", pid);
   ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c: In function ‘update_gidmap’:
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:136:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &s, sizeof(s));   \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:219:3: note: in expansion of macro ‘bail’
   bail("failed to update /proc/%d/gid_map", pid);
   ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:137:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
    write(syncfd, &ret, sizeof(ret));  \
    ^
../development/gocode/src/github.com/opencontainers/runc/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:219:3: note: in expansion of macro ‘bail’
   bail("failed to update /proc/%d/gid_map", pid);

cyphar · 2016-08-24T08:33:34Z

None of that is meant to happen. I'll fix those up today. :P

cyphar · 2016-08-26T15:17:18Z

Alright, I've fixed them up. Really the warnings aren't an issue (they're happening in an error path and we can't do anything if the writes fail).

hqhq · 2016-08-30T09:14:41Z

@cyphar Janky failed seems related.

cyphar · 2016-08-30T11:14:09Z

I remember fixing the same issue a week ago. It might be a dodgy rebase. :/

cyphar · 2016-08-31T12:35:36Z

@hqhq Alright, I fixed that issue. It was because I had misplaced brackets in the bail macro. Dammit. 😒

PTAL /cc @opencontainers/runc-maintainers

crosbymichael · 2016-08-31T18:11:31Z

libcontainer/nsenter/namespace.h

@@ -0,0 +1,49 @@
+/*
+ * Copyright 2014, 2015 Docker Inc.
+ * Copyright 2015, 2016 The Linux Foundation


I don't think this is correct, the original copyright in the license stays as it is, it doesn't just stop because the repo changed.

i'm also not a fan of starting to add these on the files, we have a overall license and it applies to all files in the repo, we don't need them per file

Fair enough, it's a matter of taste I guess. I'll drop the headers.

@cyphar thanks

x1022as · 2016-09-13T12:48:02Z

libcontainer/nsenter/nsenter_test.go

@@ -87,7 +87,7 @@ func TestNsenterInvalidPaths(t *testing.T) {

 	namespaces := []string{
 		// join pid ns of the current process
-		fmt.Sprintf("/proc/%d/ns/pid", -1),
+		fmt.Sprintf("pid:/proc/%d/ns/pid", -1),


should try "pid:/proc/%d/ns/net" here to test the ns path validation?

Yes, good idea. I've added an extra test for this, so we can test both cases (invalid type specifier and invalid path).

This avoids us from running into cases where libcontainer thinks that a particular namespace file is a different type, and makes it a fatal error rather than causing broken functionality. Signed-off-by: Aleksa Sarai <[email protected]>

Depending on your SELinux setup, the order in which you join namespaces can be important. In general, user namespaces should *always* be joined and unshared first because then the other namespaces are correctly pinned and you have the right priviliges within them. This also is very useful for rootless containers, as well as older kernels that had essentially broken unshare(2) and clone(2) implementations. This also includes huge refactorings in how we spawn processes for complicated reasons that I don't want to get into because it will make me spiral into a cloud of rage. The reasoning is in the giant comment in clone_parent. Have fun. In addition, because we now create multiple children with CLONE_PARENT, we cannot wait for them to SIGCHLD us in the case of a death. Thus, we have to resort to having a child kindly send us their exit code before they die. Hopefully this all works okay, but at this point there's not much more than we can do. Signed-off-by: Aleksa Sarai <[email protected]>

cyphar · 2016-10-04T05:25:03Z

Pinging this since it's needed for #975 (which is blocking -rc3).

/cc @opencontainers/runc-maintainers

crosbymichael · 2016-10-07T22:20:29Z

LGTM

mrunalp · 2016-10-07T23:25:37Z

I tested on RHEL 7.2. It is failing :/

 oci-runtime-tool generate --tty --output=config.json --uidmappings 1000:0:32000 --gidmappings 1000:0:32000

strace -f -o strace.log runc run 1234
unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWNET) = -1 EPERM (Operation not permitted)

cyphar · 2016-10-08T04:55:12Z

@mrunalp Wasn't that fixed by #975? It was split out of this PR.

mrunalp · 2016-10-08T05:09:16Z

Pretty sure I had run the tests with 975 on RHEL once earlier hence this surprised me. I will debug further.

On Oct 7, 2016, at 9:55 PM, Aleksa Sarai [email protected] wrote:

@mrunalp Wasn't that fixed by #975? It was split out of this PR.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

cyphar · 2016-10-12T09:23:12Z

@mrunalp Any update?

mrunalp · 2016-10-12T19:17:49Z

@cyphar I see the same issue with #975 as well. I have reached out to RHEL kernel team and will get back once I hear their recommendations.

mrunalp · 2016-10-12T19:58:54Z

I played around a bit and I can get past this issue with this diff. I don't think this fixes #959. It will most likely need changes similar to what I posted in #959. I will test that and get back in a bit.

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index d3a50b0..a28c202 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -626,10 +626,9 @@ void nsexec(void)
                         * affect our ability to unshare other namespaces and are used as
                         * context for privilege checks.
                         */
+                       if (unshare(config.cloneflags) < 0)
+                               bail("failed to unshare namespaces");
                        if (config.cloneflags & CLONE_NEWUSER) {
-                               /* Create a new user namespace. */
-                               if (unshare(CLONE_NEWUSER) < 0)
-                                       bail("failed to unshare user namespace");

                                /*
                                 * We don't have the privileges to do any mapping here (see the
@@ -647,17 +646,8 @@ void nsexec(void)
                                if (s != SYNC_USERMAP_ACK)
                                        bail("failed to sync with parent: SYNC_USERMAP_ACK: got %u", s);

-                               config.cloneflags &= ~CLONE_NEWUSER;
                        }

-                       /*
-                        * Now we can unshare the rest of the namespaces. We can't be sure if the
-                        * current kernel supports clone(CLONE_PARENT | CLONE_NEWPID), so we'll
-                        * just do it the long way anyway.
-                        */
-                       if (unshare(config.cloneflags) < 0)
-                               bail("failed to unshare namespaces");
-
                        /* TODO: What about non-namespace clone flags that we're dropping here? */
                        child = clone_parent(&env, JUMP_INIT);
                        if (child < 0)

mrunalp · 2016-10-12T20:08:13Z

I confirm that #959 isn't fixed with my diff above. The changes have to be ported exactly from #959. If you want I can create a PR for that.

cyphar · 2016-10-12T20:56:01Z

:/ I really don't like the code in #959, because it breaks up unsharing of IPC. Unfortunately I don't really have a system to test that on...

cyphar · 2016-10-12T21:21:48Z

@mrunalp Since the SELinux issue still happens on master, is it possible for us to punt on fixing the IPC issue for now (there are other PRs that depend on this one)? I'll take some time to take a look at another way of implementing your delay IPC logic that doesn't require special-casing IPC. Is there a nice way of testing an SELinux setup without access to a RHEL machine (a Fedora VM or something?) -- I don't have a SUSE setup with SELinux atm.

mrunalp · 2016-10-12T22:59:27Z

@cyphar I am fine with fixing #959 in a separate PR. I fear there is really no other way besides what I did :). You can use a Fedora 24 VM to reproduce the issue. Just need to compile runc with selinux build tag and use the config in #959.

mrunalp · 2016-10-12T22:59:54Z

I will run other tests on RHEL7 and report back here to check that there are no regressions vs master.

mrunalp · 2016-10-17T21:55:11Z

@cyphar I ran the tests on master vs this PR + my diff and they are the same (There are 4 failures) unrelated to this PR. If you can update the PR to include my suggested changes, we can merge this.

cyphar · 2016-10-18T00:14:41Z

@mrunalp Can this issue be reproduced on a stock Fedora / CentOS VM? I'd like to play around with it a bit to see if I can get another solution to work. If you also know approximately where in the kernel the relevant checks are (presumably in the bowels of SELinux) that'd be great.

mrunalp · 2016-10-18T04:48:59Z

I meant the diff I posted for combining the unshare of all namespaces without which we don't get test parity with master on RHEL 7. For 959 which we can handle separately you can use a Fedora VM to debug.

On Oct 17, 2016, at 5:14 PM, Aleksa Sarai [email protected] wrote:

@mrunalp Can this issue be reproduced on a stock Fedora / CentOS VM? I'd like to play around with it a bit to see if I can get another solution to work.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

cyphar · 2016-10-18T07:20:33Z

@mrunalp AFAICS that patch will break rootless containers (#774), but I'll apply it and I can look at what we can do to fix that issue later.

Without this patch applied, RHEL's SELinux policies cause container creation to not really work. Unfortunately this might be an issue for rootless containers (#774) but we'll cross that bridge when we come to it. Signed-off-by: Aleksa Sarai <[email protected]>

cyphar · 2016-10-18T07:27:52Z

@mrunalp I've added that patch as a separate commit so that we can track the old way of doing it in the future. PTAL, it should be good to merge now.

mrunalp · 2016-10-18T17:31:38Z

LGTM

mrunalp · 2016-10-19T17:29:12Z

@crosbymichael This needs LGTM again if you have no more comments.

hqhq · 2016-10-26T02:12:22Z

libcontainer/nsenter/nsexec.c

+	/*
+	 * Okay, so this is quite annoying.
+	 *
+	 * In order to make sure that deal with older kernels (when CLONE_NEWUSER


What version of old kernels are we talking about here?

It's mentioned later in the comment.

This was fixed in upstream commit 1f7f4dde5c945f41a7abc2285be43d918029ecc5, and was introduced by 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e.

I thought that's for CLONE_PATENT and CLONE_NEWPID thing, is that the same thing as CLONE_NEWUSER not be handled first?

I was thinking if it's not a too new version, maybe we can claim it as a limitation, making the logic much more complicated to handle such a legacy issue seems not a worthy tradeoff to me.

Yeah, you're right. The actual reason for the split is the various issues described underneath the section. This split is the whole reason for this PR (so removing the split would make this PR kinda pointless), as there are several legitimate non-legacy issues fixed by this split such as #975.

I can fix the comment, but I don't really want to wait another two weeks for two LGTMs because people are quite busy. I can fix the comment in a later PR if it's not a huge issue for you right now.

hqhq · 2016-10-26T08:45:04Z

Over all it looks good to me, nothing big worth to be a blocker, I found some issues which can all be fixed in following up PRs, it's been quite a while and block lots of other things. Let's move on.

LGTM

cyphar · 2016-10-26T08:46:08Z

@hqhq ❤️ I can open a PR with some of your nits later.

Namespaces do not need repeated entries and the ordering is handled by the runtime regardless of the spec ordering (e.g. in runC [1]). Using an object leans on the new wording from eeaccfa (glossary: Make objects explicitly unordered and forbid duplicate names, 2016-09-27, opencontainers#584) to make both of those points explicit. [1]: opencontainers/runc#977 Subject: nsenter: guarantee correct user namespace ordering Signed-off-by: W. Trevor King <[email protected]>

hqhq · 2016-10-31T09:12:43Z

libcontainer/nsenter/nsexec.c

+			 * way anyway.
+			 */
+			if (unshare(config.cloneflags) < 0)
+				bail("failed to unshare namespaces");


@cyphar I didn't notice you append this commit before, does this mean we actually still can't guarantee correct user namespace ordering for old kernels which don't create user namespace first when it's specified along with other CLONE_NEW* flags?

There was an SELinux issue that meant that splitting it would cause some issues. Ask @mrunalp for more information. So, while this currently doesn't fix the actual issue I listed in the PR description there are a bunch of plumbing changes that make rootless containers as well as #975 and #976 possible.

Yeah I see that after revisiting comments, so looks like we don't have any better ideas to fix the issue (user namespace joining ordering on old kernels) for now. Can you open a PR to update comments accordingly?

Yup, I've opened #1165 accordingly. PTAL.

GordonTheTurtle added the status/0-triage label Aug 8, 2016

This was referenced Aug 8, 2016

nsenter: major cleanups #950

Merged

nsenter: set {uid,gid} explicitly around namespace creation #975

Closed

nsenter: correctly handle pidns orphaning #976

Closed

cyphar added this to the 1.0.0 milestone Aug 9, 2016

dqminh reviewed Aug 17, 2016
View reviewed changes

crosbymichael reviewed Aug 31, 2016
View reviewed changes

cyphar mentioned this pull request Sep 4, 2016

Consoles, consoles, consoles. #1018

Merged

9 tasks

cyphar mentioned this pull request Sep 12, 2016

Rootless Containers #774

Merged

46 tasks

x1022as reviewed Sep 13, 2016
View reviewed changes

cyphar added 2 commits October 4, 2016 16:17

nsenter: specify namespace type in setns()

ed053a7

This avoids us from running into cases where libcontainer thinks that a particular namespace file is a different type, and makes it a fatal error rather than causing broken functionality. Signed-off-by: Aleksa Sarai <[email protected]>

hqhq reviewed Oct 26, 2016

View reviewed changes

hqhq merged commit 157a96a into opencontainers:master Oct 26, 2016

cyphar deleted the nsenter-userns-ordering branch October 26, 2016 08:45

wking mentioned this pull request Oct 27, 2016

config-linux: Convert linux.namespaces from an array to an object opencontainers/runtime-spec#598

Closed

hqhq reviewed Oct 31, 2016

View reviewed changes

utam0k mentioned this pull request Aug 12, 2021

youki exec did not join the correct pid namespace youki-dev/youki#185

Closed

nsenter: guarantee correct user namespace ordering #977

nsenter: guarantee correct user namespace ordering #977

Conversation

cyphar commented Aug 8, 2016 • edited Loading

cyphar commented Aug 17, 2016

Choose a reason for hiding this comment

cyphar Aug 21, 2016 • edited Loading

Choose a reason for hiding this comment

cyphar commented Aug 21, 2016

crosbymichael commented Aug 23, 2016

cyphar commented Aug 24, 2016

cyphar commented Aug 26, 2016

hqhq commented Aug 30, 2016

cyphar commented Aug 30, 2016

cyphar commented Aug 31, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar Sep 13, 2016 • edited Loading

Choose a reason for hiding this comment

cyphar commented Oct 4, 2016

crosbymichael commented Oct 7, 2016 • edited by caniszczyk Loading

mrunalp commented Oct 7, 2016

cyphar commented Oct 8, 2016

mrunalp commented Oct 8, 2016

cyphar commented Oct 12, 2016

mrunalp commented Oct 12, 2016

mrunalp commented Oct 12, 2016

mrunalp commented Oct 12, 2016

cyphar commented Oct 12, 2016

cyphar commented Oct 12, 2016 • edited Loading

mrunalp commented Oct 12, 2016

mrunalp commented Oct 12, 2016

mrunalp commented Oct 17, 2016

cyphar commented Oct 18, 2016 • edited Loading

mrunalp commented Oct 18, 2016

cyphar commented Oct 18, 2016

cyphar commented Oct 18, 2016

mrunalp commented Oct 18, 2016 • edited by caniszczyk Loading

mrunalp commented Oct 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hqhq commented Oct 26, 2016 • edited by caniszczyk Loading

cyphar commented Oct 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Aug 8, 2016 •

edited

Loading

cyphar Aug 21, 2016 •

edited

Loading

cyphar commented Aug 31, 2016 •

edited

Loading

cyphar Sep 13, 2016 •

edited

Loading

crosbymichael commented Oct 7, 2016 •

edited by caniszczyk

Loading

cyphar commented Oct 12, 2016 •

edited

Loading

cyphar commented Oct 18, 2016 •

edited

Loading

mrunalp commented Oct 18, 2016 •

edited by caniszczyk

Loading

hqhq commented Oct 26, 2016 •

edited by caniszczyk

Loading