Carry #781: Add support for cgroup namespace #1184

yummypeng · 2016-11-09T09:31:30Z

Carrying #781. Thanks for @cyphar 's work. 😄

Cgroup namespace can be configured in config.json as other
namespaces. Here is an example:

"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper CgroupsPath of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng [email protected]

cyphar

The reason why I didn't unshare(CLONE_NEWCGROUP) in nsexec.c is that unshare() makes all your current cgroups your new cgroup root. If I unshare the child process to join a new cgroup ns before father progress puts it in the proper cgroup subdirectories(in function p.manager.Apply(p.pid())), then we could see part of cgroup paths(not / as expected) inside a container.

The correct solution for this is to make sure that we do p.manager.Apply executes before we hit the unshare code. Which is not really nice to deal with but it can be done with a simple pipe that the nsexec.c code will block on until we've setup cgroups.

But yeah, the solution is not to do Unshare in Go code. That's just not going to end well.

Aside from that the code looks okay (I'll review it again after you make the above fix). Thanks for carrying this for me 😸.

cyphar · 2016-11-09T14:04:49Z

libcontainer/standard_init_linux.go

+
+	// Use Unshare() to enter a new cgroup_namespace before InitializeMountNamespace()
+	if l.config.Config.Namespaces.Contains(configs.NEWCGROUP) && l.config.Config.Namespaces.PathOf(configs.NEWCGROUP) == "" {
+		if err := syscall.Unshare(configs.Syscall_CLONE_NEWCGROUP); err != nil {


This is seriously not safe. Do not ever call Unshare from inside Go code. The Unshare interfaces do not work in a multithreaded process (and Go programs are multithreaded). This is why nsexec.c exists in the first place.

OK, I also have considered the way of using pipe to block nsexec.c code but got stuck someway, so I just sent this implementation for discussion. Though I would try to work it out.

I'd be happy to give you a hand if you have any questions, I didn't mean to scare you off. 😸

Haha, yeah, I know. I'm going to play around with it myself first. I would turn to you for help without hesitation if I have any questions. Thanks in advance 😜

cyphar · 2016-11-09T14:05:49Z

libcontainer/specconv/spec_linux.go

@@ -190,6 +195,17 @@ func CreateLibcontainerConfig(opts *CreateOpts) (*configs.Config, error) {
 		if config.Namespaces.Contains(t) {
 			return nil, fmt.Errorf("malformed spec file: duplicated ns %q", ns)
 		}
+		// if this container is going to have shared cgroup namespace with host,
+		// we just skip the following .Add()


I have found that docker just uses delNamespace() to delete a namespace shared with host. If a user runs a container without adding cgroup ns in config.json, then this container will share cgroup ns with host and its cgroups are bind-mounted. I think it is the same with the scenario that we setup cgroup ns in config.json but set cgroup path as /proc/1/ns/cgroup. So I chose not to add cgroup ns into config.Namespaces. Then its cgroups will be bind-mounted without adding more codes.

My point is that you're adding more code to special case something that I don't understand why you need to special case it. If you do a setns(...) for a namespace that you've already joined then it doesn't matter anyway.

But also I don't understand why you're comparing the link to /proc/1/ns/cgroup -- how do you know that runC is inside the same cgroup namespace as the init in it's current PID namespace (that doesn't need to be true)? From what I can see you're going to skip joining cgroups people explicitly asked you to.

Originally, I added this "skipping" logic because I thought a container's cgroups should be bind-mounted if it shares cgroup ns with host, but now it seems that I have went the wrong way. 😥

So, you mean that it makes no difference whether a container shares cgroup ns with host or not? And, we also need to mount new cgroups (not bind-mount) in a container which shares cgroup ns with host?

yummypeng · 2016-11-15T00:48:00Z

@cyphar updated! 😉

yummypeng · 2016-11-15T01:02:48Z

Oops, it seems that our janky server hasn't supported cgroup namespace yet.

cyphar · 2016-11-15T07:52:27Z

@yummypeng Yeah, I would make the tests conditional (do t.Skip() if /proc/self/ns/cgroup doesn't exist). We do a similar thing for the PIDs cgroup controller.

cyphar

See my comments. My main gripes are just stylistic, but there is one thing about the whole "skipping" logic that I don't like.

cyphar · 2016-11-15T08:23:45Z

libcontainer/configs/namespaces_syscall.go

@@ -8,13 +8,17 @@ func (n *Namespace) Syscall() int {
 	return namespaceInfo[n.Type]
 }

+// This is not yet in the Go stdlib.
+const Syscall_CLONE_NEWCGROUP = (1 << 25)


This shouldn't be exported.

cyphar · 2016-11-15T08:24:27Z

libcontainer/integration/exec_test.go

+	defer remove(rootfs)
+
+	l, err := os.Readlink("/proc/1/ns/cgroup")
+	ok(t, err)


As I mentioned above, this should be a t.Skip if os.IsNotExist(err).

cyphar · 2016-11-15T08:24:41Z

libcontainer/nsenter/nsexec.c

@@ -38,6 +38,9 @@ enum sync_t {
 	SYNC_ERR = 0xFF, /* Fatal error, no turning back. The error code follows. */
 };

+/* Synchronisation value for cgroup namespace setup */
+#define SYNC_PRIVATE_CGNS 0x80


Since this synchronisation is different to the previous ones, we should consider naming it something distinctive. I actually didn't catch that this was a different type of synchronisation to the previous one on my first reading of this.

cyphar · 2016-11-15T08:25:56Z

libcontainer/nsenter/nsexec.c

@@ -46,6 +49,9 @@ enum sync_t {
 /* JSON buffer. */
 #define JSON_MAX 4096

+/* buffer for synchronisation value */
+#define SYNC_BUF_LEN 10


Since SYNC_PRIVATE_CGNS is a single byte, can you just follow the example of all of the other synchronisation code and not use a buffer (use a single character local variable).

cyphar · 2016-11-15T08:26:47Z

libcontainer/nsenter/nsexec.c

+				len = read(pipenum, buf, SYNC_BUF_LEN);
+				if (len < 0)
+					bail("read synchronisation value failed");
+				value = strtol(buf, &endptr, 10);


Yeah, this shouldn't be necessary if you just write a single byte. Also, why is value not local to this block?

cyphar · 2016-11-15T08:29:39Z

libcontainer/process_linux.go

@@ -19,6 +19,9 @@ import (
 	"github.com/opencontainers/runc/libcontainer/utils"
 )

+// Synchronisation value for cgroup namespace setup.
+const privateCgroupns int = (1 << 7)


I'm a bit mixed on this. On the one hand I think this should be implemented with an #import using cgo because I don't really want to duplicate this (especially since it's not really possible to grep for both usages of this constant). But then cgo is bad to use for something like that. 😉

If we have to leave it here then please name it something similar to SYNC_PRIVATE_CGNS (after you rename that as per my comments above). Then at least it will be possible to grep for this constant.

cyphar · 2016-11-15T08:33:07Z

libcontainer/nsenter/nsexec.c

+				if (*endptr != '\0')
+					bail("unable to parse synchronisation value");
+				if (value == SYNC_PRIVATE_CGNS) {
+					if (unshare(CLONE_NEWCGROUP) < 0)


I think that RHEL might have issues with this (WDYT @mrunalp). Also this is a bit ugly but I'm guessing you can't combine this with the rest of the unshare code because we're currently sending the PID of JUMP_INIT not JUMP_CHILD to the .Apply code

Yeah, as far as I can see, unshare(CLONE_NEWCGROUP) must be done in the final child process. Or if you have any better suggestions? 😁

Well, you /could/ do it by passing two different PIDs to the parent. But that's just a bad idea.

This is fine, don't worry about my gripes.

cyphar · 2016-11-15T08:34:33Z

libcontainer/process_linux.go

@@ -257,6 +260,13 @@ func (p *initProcess) start() error {
 	if err := p.manager.Apply(p.pid()); err != nil {
 		return newSystemErrorWithCause(err, "applying cgroup configuration for process")
 	}
+	// Now it's time to setup cgroup namesapce
+	if p.config.Config.Namespaces.Contains(configs.NEWCGROUP) && p.config.Config.Namespaces.PathOf(configs.NEWCGROUP) == "" {
+		if _, err := p.parentPipe.WriteString(strconv.Itoa(privateCgroupns)); err != nil {


As I mentioned this should be a single byte write.

cyphar · 2016-11-15T08:40:42Z

libcontainer/specconv/spec_linux.go

@@ -190,6 +195,17 @@ func CreateLibcontainerConfig(opts *CreateOpts) (*configs.Config, error) {
 		if config.Namespaces.Contains(t) {
 			return nil, fmt.Errorf("malformed spec file: duplicated ns %q", ns)
 		}
+		// if this container is going to have shared cgroup namespace with host,
+		// we just skip the following .Add()


My point is that you're adding more code to special case something that I don't understand why you need to special case it. If you do a setns(...) for a namespace that you've already joined then it doesn't matter anyway.

But also I don't understand why you're comparing the link to /proc/1/ns/cgroup -- how do you know that runC is inside the same cgroup namespace as the init in it's current PID namespace (that doesn't need to be true)? From what I can see you're going to skip joining cgroups people explicitly asked you to.

cyphar · 2016-11-15T12:38:50Z

@@ -190,6 +195,17 @@ func CreateLibcontainerConfig(opts _CreateOpts) (_configs.Config, error) {
if config.Namespaces.Contains(t) {
return nil, fmt.Errorf("malformed spec file: duplicated ns %q", ns)
}
// if this container is going to have shared cgroup namespace with host,
// we just skip the following .Add()
Originally, I added this "skipping" logic because I thought a
container's cgroups should be bind-mounted if it shares cgroup ns with
host, but now it seems that I have went the wrong way. 😥

So, you mean that it makes no difference whether a container shares
cgroup ns with host or not? And, we also need to mount new cgroups (not
bind-mount) in a container which shares cgroup ns with host?

When you do a setns(2) to a cgroup namespace, you're going to share
their namespace (meaning that the scoping and mounting rules will be
shared with you). In order to setns(2) to a namespace you need to have
certain privileges (ns_capable(CAP_SYS_ADMIN) in the user namespace of
the cgroup namespace) -- which is the same privilege requirement as
mounting the cgroupfs filesystem.

So if setns(2) doesn't fail then mount -t cgroup shouldn't fail either
AFAICS. Thus this logic isn't necessary.

So the only time we should be bindmounting is when we haven't created a
cgroupns and we aren't joining a cgroupns. Which is already being
handled by the other code you wrote (.Contains is all you need).

I could actually check the kernel code if it made you feel better (the
above is from memory the last time I checked how the code worked).

P.S.: GitHub is horrible, I can't load your comment in the web-view so I
can't comment on it that way (and replying to notification emails in a
review doesn't add the email to the right discussion -- EUGH!).

Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

yummypeng · 2016-11-16T06:47:56Z

So the only time we should be bindmounting is when we haven't created a
cgroupns and we aren't joining a cgroupns. Which is already being
handled by the other code you wrote (.Contains is all you need).

If I run a container with cgroup ns path set as /proc/1/ns/cgroup(shares cgroup ns with host) and mount new cgroups in it. Then its /proc/self/cgroup file would be like this: (as expected)

root@ubuntu:~/tools# runc run test
root@runc:/# cat /proc/self/cgroup
13:name=systemd:/user/0.user/9.session/test
12:pids:/test
11:hugetlb:/user/0.user/9.session/test
10:net_prio:/user/0.user/9.session/test
9:perf_event:/user/0.user/9.session/test
8:net_cls:/user/0.user/9.session/test
7:freezer:/user/0.user/9.session/test
6:devices:/user/0.user/9.session/test
5:memory:/user/0.user/9.session/test
4:blkio:/user/0.user/9.session/test
3:cpuacct:/user/0.user/9.session/test
2:cpu:/user/0.user/9.session/test
1:cpuset:/test

And since its cgroup scoping is shared with host, so its /sys/fs/cgroup directory is the same as host too.
For example, we can see that our /sys/fs/cgroup/pids directory contains these files and subdirectories:

root@runc:~# ls /sys/fs/cgroup/pids/
cgroup.clone_children  cgroup.procs  cgroup.sane_behavior  docker  notify_on_release  release_agent  tasks  test

Here, docker is a subdirectory owned by host, but process inside this container can see it as well. Will it cause any problems related to security?

cyphar · 2016-11-16T06:53:42Z

@yummypeng

I feel like if someone intentionally decides to setns to the host's cgroup namespace then they sort of deserve what they get. Though the other @opencontainers/runc-maintainers might have a different opinion.

But even if we decide that we want to always make /sys/fs/cgroup the cgroup the container is in, then there's a bunch of issues with setns(...) that aren't really easy to solve -- it's not just an issue with joining PID 1's cgroup namespace. Your cgroup mount could be completely different from your actual cgroup association. And we can't move the process because cgroupPath might be set to something outside of the cgroup root for the namespace we're joining.

I feel like the cgroup namespace setns(...) semantics probably require a bit more thought overall. The decision has to be consistent between all of the different cases you can have when you're joining a namespace.

yummypeng · 2016-11-18T01:32:55Z

@cyphar Once again updated! I just removed the whole "skipping" logic this time.

hqhq · 2017-02-10T06:13:14Z

@yummypeng If it's ready for review, you can remove [WIP] prefix.

yummypeng · 2017-02-10T08:43:01Z

Sorry for late update.

ping @opencontainers/runc-maintainers Any comments? Thank you : )

This is a very simple implementation because it doesn't require any configuration unlike the other namespaces, and in its current state it only masks paths. This feature is available in Linux 4.6+ and is enabled by default for kernels compiled with CONFIG_CGROUP=y. Signed-off-by: Aleksa Sarai <[email protected]>

Cgroup namespace can be configured in `config.json` as other namespaces. Here is an example: ``` "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" }, { "type": "cgroup" } ], ``` Note that if you want to run a container which has shared cgroup ns with another container, then it's strongly recommended that you set proper `CgroupsPath` of both containers(the second container's cgroup path must be the subdirectory of the first one). Or there might be some unexpected results. Signed-off-by: Yuanhong Peng <[email protected]>

Add support in port library for detecting if we are running in docker container or not. This is required for cgroup related APIs in port library, such as omrsysinfo_cgroup_get_memlimit, to work properly. This is because when running in a docker container, the cgroups listed in /proc/self/cgroup do not match the cgroups of the process in the container. /proc/self/cgroup actually shows the cgroups as seen from the host. Eg if a process is running in a docker container, its /proc/PID/cgroup on docker will show something like: $ cat /proc/8364/cgroup 11:cpuset:/docker/<container id> 10:perf_event:/docker/<container id> 9:memory:/dockera/<container id> ... but the actual cgroup of the process would be "/". Once we figure out that we are running in docker container, then cgroup APIs in port library would use "/" as cgroup name for reading resource limits. Note that as and when docker containers start supporting cgroup namespace, then above mentioned mismatch would also be fixed. Following issues and PRs are related to adding support for cgroup namespace in docker container: opencontainers/runc#1184 opencontainers/runtime-spec#62 opencontainers/runtime-spec#66 There is no direct mechanism to detect if we are running in docker container. Most reliable mechanism that I have come across is by checking the cgroup for PID 1 (by reading /proc/1/cgroup file) for all subsystems. If all cgroups are "/", then it indicates the process is running on host system. Else we conclude it to be running in a docker container. Note that this mechanism will break when docker containers start supporting cgroup namespace. - Also added couple of APIs to return container type and name. Signed-off-by: Ashutosh Mehra <[email protected]>

cyphar · 2018-01-28T04:52:01Z

Yikes, this has been around for a while. I can re-carry this PR for @yummypeng.

sargun · 2018-01-28T22:45:05Z

libcontainer/configs/namespaces_syscall.go

@@ -8,13 +8,17 @@ func (n *Namespace) Syscall() int {
 	return namespaceInfo[n.Type]
 }

+// This is not yet in the Go stdlib.
+const syscall_CLONE_NEWCGROUP = (1 << 25)


This has been added to the unix library.

yummypeng · 2018-01-29T01:15:05Z

Thanks @cyphar 😜

Add support in port library for detecting if we are running in docker container or not. This is required for cgroup related APIs in port library, such as omrsysinfo_cgroup_get_memlimit, to work properly. This is because when running in a docker container, the cgroups listed in /proc/self/cgroup do not match the cgroups of the process in the container. /proc/self/cgroup actually shows the cgroups as seen from the host. Eg if a process is running in a docker container, its /proc/PID/cgroup on docker will show something like: $ cat /proc/8364/cgroup 11:cpuset:/docker/<container id> 10:perf_event:/docker/<container id> 9:memory:/dockera/<container id> ... but the actual cgroup of the process would be "/". Once we figure out that we are running in docker container, then cgroup APIs in port library would use "/" as cgroup name for reading resource limits. Note that as and when docker containers start supporting cgroup namespace, then above mentioned mismatch would also be fixed. Following issues and PRs are related to adding support for cgroup namespace in docker container: opencontainers/runc#1184 opencontainers/runtime-spec#62 opencontainers/runtime-spec#66 There is no direct mechanism to detect if we are running in docker container. Most reliable mechanism that I have come across is by checking the cgroup for PID 1 (by reading /proc/1/cgroup file) for all subsystems. If all cgroups are "/", then it indicates the process is running on host system. Else we conclude it to be running in a docker container. Note that this mechanism will break when docker containers start supporting cgroup namespace. Signed-off-by: Ashutosh Mehra <[email protected]>

crosbymichael · 2018-10-17T21:19:19Z

I'm rebasing and caring this :)

cyphar · 2018-10-17T23:11:02Z

Heh, thanks. I started rebasing it and it was pretty ugly from memory.

cyphar requested changes Nov 9, 2016

View reviewed changes

yummypeng force-pushed the add_cgroupns branch from 16f09da to 6c4f233 Compare November 15, 2016 00:44

cyphar requested changes Nov 15, 2016

View reviewed changes

yummypeng force-pushed the add_cgroupns branch from 6c4f233 to 558404f Compare November 18, 2016 01:31

hqhq mentioned this pull request Feb 10, 2017

Add support for cgroup namespaces #781

Closed

1 task

yummypeng force-pushed the add_cgroupns branch from 558404f to 4a06d79 Compare February 10, 2017 08:39

yummypeng changed the title ~~[WIP] Carry #781: Add support for cgroup namespace~~ Carry #781: Add support for cgroup namespace Feb 10, 2017

yummypeng force-pushed the add_cgroupns branch from 4a06d79 to 8c6d4d6 Compare February 20, 2017 01:48

cyphar and others added 2 commits February 22, 2017 11:03

yummypeng force-pushed the add_cgroupns branch from 8c6d4d6 to d3516db Compare February 22, 2017 03:45

mrunalp added this to the 1.0.0 milestone Nov 1, 2017

ashu-mehra mentioned this pull request Nov 23, 2017

Add support for detecting docker container eclipse-omr/omr#2047

Merged

cyphar mentioned this pull request Jan 28, 2018

Add definitions for cgroup namespaces #1707

Closed

sargun reviewed Jan 28, 2018

View reviewed changes

sargun mentioned this pull request Jan 29, 2018

Cgroup (namespace) Delegation opencontainers/runtime-spec#948

Open

cyphar self-assigned this Feb 4, 2018

cyphar mentioned this pull request Mar 22, 2018

Implement setns for cgroupns by path #1763

Closed

alban mentioned this pull request Apr 17, 2018

validation: add tests for NSNewNSWithoutPath & NSInheritWithoutType opencontainers/runtime-tools#620

Merged

dongsupark mentioned this pull request Apr 23, 2018

[WIP] validation: add test for NSProcInPath opencontainers/runtime-tools#613

Closed

5 tasks

dongsupark mentioned this pull request May 23, 2018

validation: add test for NSProcInPath opencontainers/runtime-tools#628

Merged

alban mentioned this pull request Jul 19, 2018

Add runtime-tools' validation tests #1758

Closed

cyphar mentioned this pull request Sep 10, 2018

Sporadic runc exec failures #1884

Closed

crosbymichael mentioned this pull request Oct 22, 2018

Add support for cgroup namespace #1916

Merged

mrunalp closed this in #1916 Nov 13, 2018

kolyshkin mentioned this pull request Jun 1, 2020

(*initProcess).start: rm second Apply #2446

Merged

Carry #781: Add support for cgroup namespace #1184

Carry #781: Add support for cgroup namespace #1184

Conversation

yummypeng commented Nov 9, 2016 • edited Loading

cyphar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yummypeng commented Nov 15, 2016

yummypeng commented Nov 15, 2016

cyphar commented Nov 15, 2016

cyphar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Nov 15, 2016

yummypeng commented Nov 16, 2016 • edited Loading

cyphar commented Nov 16, 2016

yummypeng commented Nov 18, 2016

hqhq commented Feb 10, 2017

yummypeng commented Feb 10, 2017

cyphar commented Jan 28, 2018

Choose a reason for hiding this comment

yummypeng commented Jan 29, 2018

crosbymichael commented Oct 17, 2018

cyphar commented Oct 17, 2018

yummypeng commented Nov 9, 2016 •

edited

Loading

yummypeng commented Nov 16, 2016 •

edited

Loading