Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carry #781: Add support for cgroup namespace #1184

Closed
wants to merge 2 commits into from

Conversation

yummypeng
Copy link

@yummypeng yummypeng commented Nov 9, 2016

Carrying #781. Thanks for @cyphar 's work. 😄

Cgroup namespace can be configured in config.json as other
namespaces. Here is an example:

"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper CgroupsPath of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng [email protected]

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I didn't unshare(CLONE_NEWCGROUP) in nsexec.c is that unshare() makes all your current cgroups your new cgroup root. If I unshare the child process to join a new cgroup ns before father progress puts it in the proper cgroup subdirectories(in function p.manager.Apply(p.pid())), then we could see part of cgroup paths(not / as expected) inside a container.

The correct solution for this is to make sure that we do p.manager.Apply executes before we hit the unshare code. Which is not really nice to deal with but it can be done with a simple pipe that the nsexec.c code will block on until we've setup cgroups.

But yeah, the solution is not to do Unshare in Go code. That's just not going to end well.

Aside from that the code looks okay (I'll review it again after you make the above fix). Thanks for carrying this for me 😸.


// Use Unshare() to enter a new cgroup_namespace before InitializeMountNamespace()
if l.config.Config.Namespaces.Contains(configs.NEWCGROUP) && l.config.Config.Namespaces.PathOf(configs.NEWCGROUP) == "" {
if err := syscall.Unshare(configs.Syscall_CLONE_NEWCGROUP); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is seriously not safe. Do not ever call Unshare from inside Go code. The Unshare interfaces do not work in a multithreaded process (and Go programs are multithreaded). This is why nsexec.c exists in the first place.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I also have considered the way of using pipe to block nsexec.c code but got stuck someway, so I just sent this implementation for discussion. Though I would try to work it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to give you a hand if you have any questions, I didn't mean to scare you off. 😸

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yeah, I know. I'm going to play around with it myself first. I would turn to you for help without hesitation if I have any questions. Thanks in advance 😜

@@ -190,6 +195,17 @@ func CreateLibcontainerConfig(opts *CreateOpts) (*configs.Config, error) {
if config.Namespaces.Contains(t) {
return nil, fmt.Errorf("malformed spec file: duplicated ns %q", ns)
}
// if this container is going to have shared cgroup namespace with host,
// we just skip the following .Add()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found that docker just uses delNamespace() to delete a namespace shared with host. If a user runs a container without adding cgroup ns in config.json, then this container will share cgroup ns with host and its cgroups are bind-mounted. I think it is the same with the scenario that we setup cgroup ns in config.json but set cgroup path as /proc/1/ns/cgroup. So I chose not to add cgroup ns into config.Namespaces. Then its cgroups will be bind-mounted without adding more codes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that you're adding more code to special case something that I don't understand why you need to special case it. If you do a setns(...) for a namespace that you've already joined then it doesn't matter anyway.

But also I don't understand why you're comparing the link to /proc/1/ns/cgroup -- how do you know that runC is inside the same cgroup namespace as the init in it's current PID namespace (that doesn't need to be true)? From what I can see you're going to skip joining cgroups people explicitly asked you to.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I added this "skipping" logic because I thought a container's cgroups should be bind-mounted if it shares cgroup ns with host, but now it seems that I have went the wrong way. 😥

So, you mean that it makes no difference whether a container shares cgroup ns with host or not? And, we also need to mount new cgroups (not bind-mount) in a container which shares cgroup ns with host?

@yummypeng
Copy link
Author

@cyphar updated! 😉

@yummypeng
Copy link
Author

Oops, it seems that our janky server hasn't supported cgroup namespace yet.

@cyphar
Copy link
Member

cyphar commented Nov 15, 2016

@yummypeng Yeah, I would make the tests conditional (do t.Skip() if /proc/self/ns/cgroup doesn't exist). We do a similar thing for the PIDs cgroup controller.

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments. My main gripes are just stylistic, but there is one thing about the whole "skipping" logic that I don't like.

@@ -8,13 +8,17 @@ func (n *Namespace) Syscall() int {
return namespaceInfo[n.Type]
}

// This is not yet in the Go stdlib.
const Syscall_CLONE_NEWCGROUP = (1 << 25)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be exported.

defer remove(rootfs)

l, err := os.Readlink("/proc/1/ns/cgroup")
ok(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, this should be a t.Skip if os.IsNotExist(err).

@@ -38,6 +38,9 @@ enum sync_t {
SYNC_ERR = 0xFF, /* Fatal error, no turning back. The error code follows. */
};

/* Synchronisation value for cgroup namespace setup */
#define SYNC_PRIVATE_CGNS 0x80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this synchronisation is different to the previous ones, we should consider naming it something distinctive. I actually didn't catch that this was a different type of synchronisation to the previous one on my first reading of this.

@@ -46,6 +49,9 @@ enum sync_t {
/* JSON buffer. */
#define JSON_MAX 4096

/* buffer for synchronisation value */
#define SYNC_BUF_LEN 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since SYNC_PRIVATE_CGNS is a single byte, can you just follow the example of all of the other synchronisation code and not use a buffer (use a single character local variable).

len = read(pipenum, buf, SYNC_BUF_LEN);
if (len < 0)
bail("read synchronisation value failed");
value = strtol(buf, &endptr, 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this shouldn't be necessary if you just write a single byte. Also, why is value not local to this block?

@@ -19,6 +19,9 @@ import (
"github.com/opencontainers/runc/libcontainer/utils"
)

// Synchronisation value for cgroup namespace setup.
const privateCgroupns int = (1 << 7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit mixed on this. On the one hand I think this should be implemented with an #import using cgo because I don't really want to duplicate this (especially since it's not really possible to grep for both usages of this constant). But then cgo is bad to use for something like that. 😉

If we have to leave it here then please name it something similar to SYNC_PRIVATE_CGNS (after you rename that as per my comments above). Then at least it will be possible to grep for this constant.

if (*endptr != '\0')
bail("unable to parse synchronisation value");
if (value == SYNC_PRIVATE_CGNS) {
if (unshare(CLONE_NEWCGROUP) < 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that RHEL might have issues with this (WDYT @mrunalp). Also this is a bit ugly but I'm guessing you can't combine this with the rest of the unshare code because we're currently sending the PID of JUMP_INIT not JUMP_CHILD to the .Apply code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as far as I can see, unshare(CLONE_NEWCGROUP) must be done in the final child process. Or if you have any better suggestions? 😁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you /could/ do it by passing two different PIDs to the parent. But that's just a bad idea.

This is fine, don't worry about my gripes.

@@ -257,6 +260,13 @@ func (p *initProcess) start() error {
if err := p.manager.Apply(p.pid()); err != nil {
return newSystemErrorWithCause(err, "applying cgroup configuration for process")
}
// Now it's time to setup cgroup namesapce
if p.config.Config.Namespaces.Contains(configs.NEWCGROUP) && p.config.Config.Namespaces.PathOf(configs.NEWCGROUP) == "" {
if _, err := p.parentPipe.WriteString(strconv.Itoa(privateCgroupns)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned this should be a single byte write.

@@ -190,6 +195,17 @@ func CreateLibcontainerConfig(opts *CreateOpts) (*configs.Config, error) {
if config.Namespaces.Contains(t) {
return nil, fmt.Errorf("malformed spec file: duplicated ns %q", ns)
}
// if this container is going to have shared cgroup namespace with host,
// we just skip the following .Add()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that you're adding more code to special case something that I don't understand why you need to special case it. If you do a setns(...) for a namespace that you've already joined then it doesn't matter anyway.

But also I don't understand why you're comparing the link to /proc/1/ns/cgroup -- how do you know that runC is inside the same cgroup namespace as the init in it's current PID namespace (that doesn't need to be true)? From what I can see you're going to skip joining cgroups people explicitly asked you to.

@cyphar
Copy link
Member

cyphar commented Nov 15, 2016

@@ -190,6 +195,17 @@ func CreateLibcontainerConfig(opts _CreateOpts) (_configs.Config, error) {
if config.Namespaces.Contains(t) {
return nil, fmt.Errorf("malformed spec file: duplicated ns %q", ns)
}

  • // if this container is going to have shared cgroup namespace with host,
    
  • // we just skip the following .Add()
    

Originally, I added this "skipping" logic because I thought a
container's cgroups should be bind-mounted if it shares cgroup ns with
host, but now it seems that I have went the wrong way. 😥

So, you mean that it makes no difference whether a container shares
cgroup ns with host or not? And, we also need to mount new cgroups (not
bind-mount) in a container which shares cgroup ns with host?

When you do a setns(2) to a cgroup namespace, you're going to share
their namespace (meaning that the scoping and mounting rules will be
shared with you). In order to setns(2) to a namespace you need to have
certain privileges (ns_capable(CAP_SYS_ADMIN) in the user namespace of
the cgroup namespace) -- which is the same privilege requirement as
mounting the cgroupfs filesystem.

So if setns(2) doesn't fail then mount -t cgroup shouldn't fail either
AFAICS. Thus this logic isn't necessary.

So the only time we should be bindmounting is when we haven't created a
cgroupns and we aren't joining a cgroupns. Which is already being
handled by the other code you wrote (.Contains is all you need).

I could actually check the kernel code if it made you feel better (the
above is from memory the last time I checked how the code worked).

P.S.: GitHub is horrible, I can't load your comment in the web-view so I
can't comment on it that way (and replying to notification emails in a
review doesn't add the email to the right discussion -- EUGH!).

Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

@yummypeng
Copy link
Author

yummypeng commented Nov 16, 2016

So the only time we should be bindmounting is when we haven't created a
cgroupns and we aren't joining a cgroupns. Which is already being
handled by the other code you wrote (.Contains is all you need).

If I run a container with cgroup ns path set as /proc/1/ns/cgroup(shares cgroup ns with host) and mount new cgroups in it. Then its /proc/self/cgroup file would be like this: (as expected)

root@ubuntu:~/tools# runc run test
root@runc:/# cat /proc/self/cgroup
13:name=systemd:/user/0.user/9.session/test
12:pids:/test
11:hugetlb:/user/0.user/9.session/test
10:net_prio:/user/0.user/9.session/test
9:perf_event:/user/0.user/9.session/test
8:net_cls:/user/0.user/9.session/test
7:freezer:/user/0.user/9.session/test
6:devices:/user/0.user/9.session/test
5:memory:/user/0.user/9.session/test
4:blkio:/user/0.user/9.session/test
3:cpuacct:/user/0.user/9.session/test
2:cpu:/user/0.user/9.session/test
1:cpuset:/test

And since its cgroup scoping is shared with host, so its /sys/fs/cgroup directory is the same as host too.
For example, we can see that our /sys/fs/cgroup/pids directory contains these files and subdirectories:

root@runc:~# ls /sys/fs/cgroup/pids/
cgroup.clone_children  cgroup.procs  cgroup.sane_behavior  docker  notify_on_release  release_agent  tasks  test

Here, docker is a subdirectory owned by host, but process inside this container can see it as well. Will it cause any problems related to security?

@cyphar
Copy link
Member

cyphar commented Nov 16, 2016

@yummypeng

I feel like if someone intentionally decides to setns to the host's cgroup namespace then they sort of deserve what they get. Though the other @opencontainers/runc-maintainers might have a different opinion.

But even if we decide that we want to always make /sys/fs/cgroup the cgroup the container is in, then there's a bunch of issues with setns(...) that aren't really easy to solve -- it's not just an issue with joining PID 1's cgroup namespace. Your cgroup mount could be completely different from your actual cgroup association. And we can't move the process because cgroupPath might be set to something outside of the cgroup root for the namespace we're joining.

I feel like the cgroup namespace setns(...) semantics probably require a bit more thought overall. The decision has to be consistent between all of the different cases you can have when you're joining a namespace.

@yummypeng
Copy link
Author

@cyphar Once again updated! I just removed the whole "skipping" logic this time.

@hqhq
Copy link
Contributor

hqhq commented Feb 10, 2017

@yummypeng If it's ready for review, you can remove [WIP] prefix.

@hqhq hqhq mentioned this pull request Feb 10, 2017
1 task
@yummypeng yummypeng changed the title [WIP] Carry #781: Add support for cgroup namespace Carry #781: Add support for cgroup namespace Feb 10, 2017
@yummypeng
Copy link
Author

Sorry for late update.

ping @opencontainers/runc-maintainers Any comments? Thank you : )

cyphar and others added 2 commits February 22, 2017 11:03
This is a very simple implementation because it doesn't require any
configuration unlike the other namespaces, and in its current state it
only masks paths.

This feature is available in Linux 4.6+ and is enabled by default for
kernels compiled with CONFIG_CGROUP=y.

Signed-off-by: Aleksa Sarai <[email protected]>
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:

```
"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

```

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng <[email protected]>
@mrunalp mrunalp added this to the 1.0.0 milestone Nov 1, 2017
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 22, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Nov 23, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Dec 7, 2017
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

- Also added couple of APIs to return container type and name.

Signed-off-by: Ashutosh Mehra <[email protected]>
@cyphar
Copy link
Member

cyphar commented Jan 28, 2018

Yikes, this has been around for a while. I can re-carry this PR for @yummypeng.

@@ -8,13 +8,17 @@ func (n *Namespace) Syscall() int {
return namespaceInfo[n.Type]
}

// This is not yet in the Go stdlib.
const syscall_CLONE_NEWCGROUP = (1 << 25)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been added to the unix library.

@yummypeng
Copy link
Author

Thanks @cyphar 😜

@cyphar cyphar self-assigned this Feb 4, 2018
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Feb 19, 2018
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

Signed-off-by: Ashutosh Mehra <[email protected]>
ashu-mehra pushed a commit to ashu-mehra/omr that referenced this pull request Feb 20, 2018
Add support in port library for detecting if we are running in
docker container or not.
This is required for cgroup related APIs in port library,
such as omrsysinfo_cgroup_get_memlimit, to work properly.
This is because when running in a docker container,
the cgroups listed in /proc/self/cgroup do not match the cgroups
of the process in the container. /proc/self/cgroup actually
shows the cgroups as seen from the host.

Eg if a process is running in a docker container,
its /proc/PID/cgroup on docker will show something like:

$ cat /proc/8364/cgroup
11:cpuset:/docker/<container id>
10:perf_event:/docker/<container id>
9:memory:/dockera/<container id>
...

but the actual cgroup of the process would be "/".

Once we figure out that we are running in docker container,
then cgroup APIs in port library would use "/" as cgroup name
for reading resource limits.

Note that as and when docker containers start supporting
cgroup namespace, then above mentioned mismatch would also be fixed.
Following issues and PRs are related to adding support for
cgroup namespace in docker container:
opencontainers/runc#1184
opencontainers/runtime-spec#62
opencontainers/runtime-spec#66

There is no direct mechanism to detect if we are running in
docker container.
Most reliable mechanism that I have come across is
by checking the cgroup for PID 1 (by reading /proc/1/cgroup file)
for all subsystems. If all cgroups are "/", then it indicates the
process is running on host system.
Else we conclude it to be running in a docker container.
Note that this mechanism will break when docker containers start
supporting cgroup namespace.

Signed-off-by: Ashutosh Mehra <[email protected]>
@crosbymichael
Copy link
Member

I'm rebasing and caring this :)

@cyphar
Copy link
Member

cyphar commented Oct 17, 2018

Heh, thanks. I started rebasing it and it was pretty ugly from memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants