-
Notifications
You must be signed in to change notification settings - Fork 316
A container can join pid namespace of another container. #609
Conversation
@@ -60,6 +60,62 @@ static int clone_parent(jmp_buf * env) | |||
return child; | |||
} | |||
|
|||
void nsinit_pid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we combine this logic into nsexec() itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried that but the end result wasnt as clean. Since spawning an init process with a known pid namespace and exec a new process inside another container are two very different operations , i think that having two functions describes the intent better.
The same hack is required for mount and user namespaces, because a multithreaded process can't change them. What do you think about creating a directory with namespaces in container's root directory. We can bind-mount all required namespaces there before starting the container and then bind-mount container's namespace files to use them for executing processes inside CT. In this case we can use the same code of nsexec.c for starting the first and other processes. |
I agree with @avagin that bind-mounting namespace can be right way to do such stuff, because |
@LK4D4 There's a few related issues in docker moby/moby#13453 and moby/moby#10163. Overall i think it's better to do this properly (if this is not the best way) since 1.7 is already in RC now so we have time :) @avagin I dont understand the part about bind-mounting namespaces. Do you have an example for that ? My current understanding of that is that we have a special directory that contains namespace descriptors that we can detect when starting the container and use them to setns with some code from nsexec.c ? |
@dqminh The suggestion is to bind mount /proc/self/ns/* files into some directory for each container (Just like how ip netns add works. It bind mounts net namespace fds under /var/run/netns). After that is done, new processes could use those paths to join the container. This could also be used for joining subset of other container's namespaces. @avagin could correct me if I am wrong. |
Also, you can destroy original namespace, so bind-mounting is like "copy" namespace. |
Right, as long the mounted namespace fds exist, the original container can just die without affecting those that joined those namespace fds. |
Not sure how that will behave will pid namespaces though since pid 1 going down brings down the other processes. |
Ah i see. But that would not change the current implementation right, since all libcontainer needs to know is still the paths to the namespace descriptors.
i think pid 1 going down will kill the rest of the processes, and you cant join using the open file descriptor anymore (http://man7.org/linux/man-pages/man7/pid_namespaces.7.html). |
@dqminh I don't understand what you mean. Could you elaborate? |
@dqminh for main process we use Cloneflags for creating namespaces. New idea is to create namespaces separately and then pass them even for first process. |
@avagin What point of bind-mounting then? Just to collect them in one directory? |
Right, we cant really seperate the creation of namespaces and processes at least for pid namespace. It always requires an anchor process. The current way joining namespaces works for init process is that container has This works for net ( the only current usecase afaik ), but doesnt work for pid ( and probably mount/userns as @avagin pointed out ). This patch adds another intermediate clone to put the init process in correct pid namespaces. My understanding is that even if we bind-mount namespaces, we will still pass the mount destination as value to
Hmm yes, that is another way to do it so we can reuse that blocks of code between init and execin. It also helps when we want to join another namespaces besides pid. Probably joinExistingNamespaces can be removed in that case. |
I'm not sure that we need to do this. It's just an idea how it can be fixed.
Yes, we can. |
3d824b5
to
629eca4
Compare
fd = openat(tfd, namespaces[i], O_RDONLY); | ||
char *ns, *saveptr; | ||
while ((ns = strtok_r(nspaths, ",", &saveptr))) { | ||
fd = open(ns, O_RDONLY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you enter into a mount namespace, you can't be sure that you will able to enter in other namespaces, because nspaths may be inaccessible from this mount namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I think the order of nspaths has to be set from the caller ( which we are not doing now for init, but we do for exec ). wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a caller can do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. The order of namespaces is now enforced in orderNamespacePaths
. I also tried to add NEWUSER there but some tests failed. Added a note on NEWUSER and will try again in another patch only for NEWUSER.
@avagin I pushed a new patch that unifies the ordering of namespaces in init and exec, and a few more changes. PTAL. Also i notice that joining mount namespace for init process will not work right now because it's still trying to setupRootfs. I can write a new PR to fix that ( if we want a init process to join mnt namespace of another container ? ) |
if len(nsMaps) > 0 { | ||
nsPaths, err := orderNamespacePaths(nsMaps) | ||
if err != nil { | ||
return nil, newSystemError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orderNamepsapcePaths() returns SystemError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
configs.NEWNS, | ||
} | ||
// For now, only join user namespace if this is an exec in process and the | ||
// container supports user namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have explanation about this restriction.
LGTM (except a few minor comments). |
@dqminh Can you backport this PR to runc? Thanks. |
@hqhq Thanks for reminding me !. Things are a bit hectic on my side this week, but i should be able get it done by the weekend (or sooner if possible) |
An init process can join other namespaces (pidns, ipc etc.). This leverages C code defined in nsenter package to spawn a process with correct namespaces and clone if necessary. When a shared-pidns container exits, the original container will keep running. This removes joinExistingNamespaces in Go layer because all namespaces will be setns in C layer. This also changes setns process to requires a list of namespaces to be joined rather than requires only the pid of init process and deduct the namespace from it. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
if nsinit specifies `--id` option, use that id as cgroups' name instead of the directory name. It's helpful where we want to start multiple containers from the same directory for example. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
- check if namespace is supported by the current kernel - check that the pathname doesnt contain comma `,`. This is the character we used to join paths together and pass it over as a single env variable so we can split it out in C layer. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
Doing this helps with ordering of NEWNS, since we dont have to worry about set mount namespace too early leadings to old /proc becomes inaccessible. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
execin process can join user namespace of the init process. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
execin process can join user namespace of the parent container only when the init process has joined user namespace explicitly. Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
ported to runc |
Fix #604
This supports setting
configs.NEWPID
to an existing PID namespace for a container's init process. It leverages the same C code infrastructure forexecin
to clone a new init process with the wanted PID namespace.Edit: Because of some sharing in code between init and execin, execin also supports joining userns of the init process now. Init process cant join mnt/userns of another init process yet.