A container can join pid namespace of another container. #609

dqminh · 2015-06-01T19:37:57Z

Fix #604

This supports setting configs.NEWPID to an existing PID namespace for a container's init process. It leverages the same C code infrastructure for execin to clone a new init process with the wanted PID namespace.

Edit: Because of some sharing in code between init and execin, execin also supports joining userns of the init process now. Init process cant join mnt/userns of another init process yet.

mrunalp · 2015-06-01T21:33:50Z

nsenter/nsexec.c

@@ -60,6 +60,62 @@ static int clone_parent(jmp_buf * env)
 	return child;
 }

+void nsinit_pid()


Could we combine this logic into nsexec() itself?

i tried that but the end result wasnt as clean. Since spawning an init process with a known pid namespace and exec a new process inside another container are two very different operations , i think that having two functions describes the intent better.

avagin · 2015-06-02T15:27:04Z

The same hack is required for mount and user namespaces, because a multithreaded process can't change them.
http://man7.org/linux/man-pages/man2/setns.2.html
'''
A process may not be reassociated with a new mount namespace if it is multithreaded.
'''

What do you think about creating a directory with namespaces in container's root directory. We can bind-mount all required namespaces there before starting the container and then bind-mount container's namespace files to use them for executing processes inside CT. In this case we can use the same code of nsexec.c for starting the first and other processes.

LK4D4 · 2015-06-02T16:00:25Z

I agree with @avagin that bind-mounting namespace can be right way to do such stuff, because nsexec.c supposed to be very small layer only for setting namespaces and expanding it can lead us to support pretty big C code in golang runtime, which probably will hit some limitations.
I'm not against merging this though if this feature is needed by someone right now :)

dqminh · 2015-06-03T03:32:07Z

@LK4D4 There's a few related issues in docker moby/moby#13453 and moby/moby#10163. Overall i think it's better to do this properly (if this is not the best way) since 1.7 is already in RC now so we have time :)

@avagin I dont understand the part about bind-mounting namespaces. Do you have an example for that ? My current understanding of that is that we have a special directory that contains namespace descriptors that we can detect when starting the container and use them to setns with some code from nsexec.c ?

mrunalp · 2015-06-03T03:47:49Z

@dqminh The suggestion is to bind mount /proc/self/ns/* files into some directory for each container (Just like how ip netns add works. It bind mounts net namespace fds under /var/run/netns). After that is done, new processes could use those paths to join the container. This could also be used for joining subset of other container's namespaces.

@avagin could correct me if I am wrong.

mrunalp · 2015-06-03T03:51:57Z

@dqminh Example https://gist.github.com/mrunalp/5290c92b32fe709aadd3

LK4D4 · 2015-06-03T04:00:58Z

Also, you can destroy original namespace, so bind-mounting is like "copy" namespace.

mrunalp · 2015-06-03T04:02:26Z

Right, as long the mounted namespace fds exist, the original container can just die without affecting those that joined those namespace fds.

mrunalp · 2015-06-03T04:06:15Z

Not sure how that will behave will pid namespaces though since pid 1 going down brings down the other processes.

dqminh · 2015-06-03T05:15:44Z

The suggestion is to bind mount /proc/self/ns/* files into some directory for each container (Just like how ip netns add works

Ah i see. But that would not change the current implementation right, since all libcontainer needs to know is still the paths to the namespace descriptors.

Not sure how that will behave will pid namespaces though since pid 1 going down brings down the other processes.

i think pid 1 going down will kill the rest of the processes, and you cant join using the open file descriptor anymore (http://man7.org/linux/man-pages/man7/pid_namespaces.7.html).

avagin · 2015-06-03T05:22:45Z

Ah i see. But that would not change the current implementation right, since all libcontainer needs to know is still the paths to the namespace descriptors.

@dqminh I don't understand what you mean. Could you elaborate?

LK4D4 · 2015-06-03T05:30:58Z

@dqminh for main process we use Cloneflags for creating namespaces. New idea is to create namespaces separately and then pass them even for first process.

avagin · 2015-06-03T05:34:48Z

@LK4D4 @dqminh I don't suggest to create namespaces separately. For example, It's impossible for pidns. For the init process we will mount all existent namespaces there and set clone flags for others.

LK4D4 · 2015-06-03T05:47:42Z

@avagin What point of bind-mounting then? Just to collect them in one directory?
I think then we can pass just space-separated namespace list to nsexec.c where it will be generated from config.Namespaces for init and from state.NamespacePaths for other.

dqminh · 2015-06-03T06:20:41Z

Right, we cant really seperate the creation of namespaces and processes at least for pid namespace. It always requires an anchor process.

The current way joining namespaces works for init process is that container has config.Namespaces which keys values mapping of namespace to their descriptor path. The value can either be empty ( which then the key is used in CloneFlags ), or points to a valid namespace descriptor ( which we use to join namespace in joinExistingNamespaces ). So from libcontainer, the only requirements for init to join different namespaces is to set the key-value pair in config.Namespaces

This works for net ( the only current usecase afaik ), but doesnt work for pid ( and probably mount/userns as @avagin pointed out ). This patch adds another intermediate clone to put the init process in correct pid namespaces.

My understanding is that even if we bind-mount namespaces, we will still pass the mount destination as value to config.Namespaces and perform setns before exec-ing the target process, so it will not be different from the current way.

I think then we can pass just space-separated namespace list to nsexec.c where it will be generated from config.Namespaces for init and from state.NamespacePaths for other.

Hmm yes, that is another way to do it so we can reuse that blocks of code between init and execin. It also helps when we want to join another namespaces besides pid. Probably joinExistingNamespaces can be removed in that case.

avagin · 2015-06-03T06:28:32Z

@avagin What point of bind-mounting then? Just to collect them in one directory?

collect in one directory
be independent from the init process. If we have a container without
pidns, we will able to enter into it even when the first process died. For
container with pidns, this approach avoids a problem when the init process
died and another process is created with the same pid.

I'm not sure that we need to do this. It's just an idea how it can be fixed.

I think the we can pass just space-separated namespace list to nsexec.c
where it will be generated from config.Namespaces for init and from
state.NamespacePaths for other.

Yes, we can.

dqminh · 2015-06-10T10:58:51Z

@LK4D4 @avagin i added some code to pass a list of namespaces to be joined to both init process and execin process. So they behave the same now on setns part.

All setns now is done in C layer (joinExistingNamespaces is not necessary)

avagin · 2015-06-10T12:04:15Z

nsenter/nsexec.c

-		fd = openat(tfd, namespaces[i], O_RDONLY);
+	char *ns, *saveptr;
+	while ((ns = strtok_r(nspaths, ",", &saveptr))) {
+		fd = open(ns, O_RDONLY);


If you enter into a mount namespace, you can't be sure that you will able to enter in other namespaces, because nspaths may be inaccessible from this mount namespace.

right, I think the order of nspaths has to be set from the caller ( which we are not doing now for init, but we do for exec ). wdyt ?

I agree that a caller can do this.

done. The order of namespaces is now enforced in orderNamespacePaths. I also tried to add NEWUSER there but some tests failed. Added a note on NEWUSER and will try again in another patch only for NEWUSER.

dqminh · 2015-06-15T09:15:42Z

@avagin I pushed a new patch that unifies the ordering of namespaces in init and exec, and a few more changes. PTAL.

Also i notice that joining mount namespace for init process will not work right now because it's still trying to setupRootfs. I can write a new PR to fix that ( if we want a init process to join mnt namespace of another container ? )

avagin · 2015-06-21T20:38:09Z

container_linux.go

+	if len(nsMaps) > 0 {
+		nsPaths, err := orderNamespacePaths(nsMaps)
+		if err != nil {
+			return nil, newSystemError(err)


orderNamepsapcePaths() returns SystemError

avagin · 2015-06-23T21:44:55Z

container_linux.go

+		configs.NEWNS,
+	}
+	// For now, only join user namespace if this is an exec in process and the
+	// container supports user namespace


I would like to have explanation about this restriction.

avagin · 2015-06-23T22:14:17Z

LGTM (except a few minor comments).

hqhq · 2015-07-08T01:04:19Z

@dqminh Can you backport this PR to runc? Thanks.

dqminh · 2015-07-08T11:02:28Z

@hqhq Thanks for reminding me !. Things are a bit hectic on my side this week, but i should be able get it done by the weekend (or sooner if possible)

An init process can join other namespaces (pidns, ipc etc.). This leverages C code defined in nsenter package to spawn a process with correct namespaces and clone if necessary. When a shared-pidns container exits, the original container will keep running. This removes joinExistingNamespaces in Go layer because all namespaces will be setns in C layer. This also changes setns process to requires a list of namespaces to be joined rather than requires only the pid of init process and deduct the namespace from it. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

if nsinit specifies `--id` option, use that id as cgroups' name instead of the directory name. It's helpful where we want to start multiple containers from the same directory for example. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

- check if namespace is supported by the current kernel - check that the pathname doesnt contain comma `,`. This is the character we used to join paths together and pass it over as a single env variable so we can split it out in C layer. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

Doing this helps with ordering of NEWNS, since we dont have to worry about set mount namespace too early leadings to old /proc becomes inaccessible. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

execin process can join user namespace of the init process. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

execin process can join user namespace of the parent container only when the init process has joined user namespace explicitly. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

LK4D4 · 2015-07-14T01:36:33Z

ported to runc

GordonTheTurtle added the status/0-needs-triage label Jun 1, 2015

mrunalp reviewed Jun 1, 2015
View reviewed changes

dqminh force-pushed the pidns branch 2 times, most recently from 3d824b5 to 629eca4 Compare June 10, 2015 10:53

avagin reviewed Jun 10, 2015
View reviewed changes

dqminh force-pushed the pidns branch from 629eca4 to 85d76e7 Compare June 15, 2015 09:08

dqminh mentioned this pull request Jun 15, 2015

Propogate container's mountpoint to the host #632

Closed

dqminh force-pushed the pidns branch from 85d76e7 to 8119b4f Compare June 18, 2015 05:04

hqhq mentioned this pull request Jun 18, 2015

Allow PID namespace to be shared between container moby/moby#13453

Closed

avagin reviewed Jun 21, 2015
View reviewed changes

dqminh force-pushed the pidns branch from 8119b4f to 12e0e9d Compare June 22, 2015 04:25

avagin reviewed Jun 23, 2015
View reviewed changes

vishvananda mentioned this pull request Jul 8, 2015

Cannot run container in existing user namespace opencontainers/runc#101

Closed

dqminh added 6 commits July 9, 2015 11:39

use custom id as cgroups name for nsinit

510aae6

if nsinit specifies `--id` option, use that id as cgroups' name instead of the directory name. It's helpful where we want to start multiple containers from the same directory for example. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

open all ns descriptors before setns

67e420d

Doing this helps with ordering of NEWNS, since we dont have to worry about set mount namespace too early leadings to old /proc becomes inaccessible. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

add NEWUSER to supported namespaces to be joined

2290160

execin process can join user namespace of the init process. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

join userns only for execin process

7860be1

execin process can join user namespace of the parent container only when the init process has joined user namespace explicitly. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>

dqminh force-pushed the pidns branch from 7038929 to 7860be1 Compare July 9, 2015 11:39

dqminh mentioned this pull request Jul 9, 2015

A container can join namespaces of another container opencontainers/runc#105

Closed

LK4D4 closed this Jul 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A container can join pid namespace of another container. #609

A container can join pid namespace of another container. #609

dqminh commented Jun 1, 2015

mrunalp Jun 1, 2015

dqminh Jun 2, 2015

avagin commented Jun 2, 2015

LK4D4 commented Jun 2, 2015

dqminh commented Jun 3, 2015

mrunalp commented Jun 3, 2015

mrunalp commented Jun 3, 2015

LK4D4 commented Jun 3, 2015

mrunalp commented Jun 3, 2015

mrunalp commented Jun 3, 2015

dqminh commented Jun 3, 2015

avagin commented Jun 3, 2015

LK4D4 commented Jun 3, 2015

avagin commented Jun 3, 2015

LK4D4 commented Jun 3, 2015

dqminh commented Jun 3, 2015

avagin commented Jun 3, 2015

dqminh commented Jun 10, 2015

avagin Jun 10, 2015

dqminh Jun 11, 2015

avagin Jun 11, 2015

dqminh Jun 15, 2015

dqminh commented Jun 15, 2015

avagin Jun 21, 2015

dqminh Jun 22, 2015

avagin Jun 23, 2015

avagin commented Jun 23, 2015

hqhq commented Jul 8, 2015

dqminh commented Jul 8, 2015

LK4D4 commented Jul 14, 2015

A container can join pid namespace of another container. #609

A container can join pid namespace of another container. #609

Conversation

dqminh commented Jun 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avagin commented Jun 2, 2015

LK4D4 commented Jun 2, 2015

dqminh commented Jun 3, 2015

mrunalp commented Jun 3, 2015

mrunalp commented Jun 3, 2015

LK4D4 commented Jun 3, 2015

mrunalp commented Jun 3, 2015

mrunalp commented Jun 3, 2015

dqminh commented Jun 3, 2015

avagin commented Jun 3, 2015

LK4D4 commented Jun 3, 2015

avagin commented Jun 3, 2015

LK4D4 commented Jun 3, 2015

dqminh commented Jun 3, 2015

avagin commented Jun 3, 2015

dqminh commented Jun 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dqminh commented Jun 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avagin commented Jun 23, 2015

hqhq commented Jul 8, 2015

dqminh commented Jul 8, 2015

LK4D4 commented Jul 14, 2015