-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cgroups with limits as rootless #1540
Support cgroups with limits as rootless #1540
Conversation
Makefile
Outdated
rootlessintegration: runcimage | ||
docker run -e TESTFLAGS -t --privileged --rm -v $(CURDIR):/go/src/$(PROJECT) --cap-drop=ALL -u rootless $(RUNC_IMAGE) make localintegration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have you removed --cap-drop=all
? That was necessary to ensure that rootless didn't depend on extra capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems --cap-drop=ALL
doesn't do anything useful when also using --privileged
. With or without it, once we switch to the rootless user our caps are:
CapInh: 0000003fffffffff
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
So, as the tests are right now in master, and in this PR, we don't have any effective caps. If we want to guard against picking up caps somehow from files, we could perhaps figure out a way to drop from the bounding/inheritable set as well but I'm not sure that really makes sense within the test context.
Makefile
Outdated
@@ -96,10 +96,14 @@ integration: runcimage | |||
localintegration: all | |||
bats -t tests/integration${TESTFLAGS} | |||
|
|||
rootlessintegrationstub: | |||
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/rootless_cgroup | |||
chown rootless:rootless -R /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/rootless_cgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is crucial that when we add new tests for opportunistic functionality in rootless that we make sure that the base functionality works. So if we want to add tests for this (and also newuidmap
in the future) we need to add them as effectively re-runs of the rootless tests under different conditions (with some extra tests enabled).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure we understand this, could you elaborate please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we have the require rootless_cgroup
class of tests, and the rest of the tests that don't have that requirement, I'm worried about not explicitly testing the case where this setup code is not run (in other words, the case were runc
has no permissions to touch any cgroups). I appreciate that code is still tested, but I just want to be cautious about testing "opportunistic" functionality like rootless cgroup handling or (in the future) newuidmap
handling.
I think the best solution to this would be to have something like the set of additional rootless features: (cgroups newuidmap etc)
and then test base + <the power set of the other features>
. So in the above case we would test base
, base+cgroups
, base+newuidmap
, base+cgroups+newuidmap
. But that's just an idea.
libcontainer/cgroups/fs/apply_raw.go
Outdated
return err | ||
} | ||
m.Paths[sys.Name()] = p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we moved this before Apply
to handle some race conditions, maybe @hqhq remembers this better than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if some subsystems Apply failed, we are not able to cleanup cgroup directories if we haven't recorded it in m.Paths
, see #1196
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williammartin What if you keep the m.Paths[sys.Name()] = p
where it is, but do delete(m.Paths, sys.Name())
if there is a permission error that is ignore-able?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable, we can try it out
libcontainer/cgroups/fs/apply_raw.go
Outdated
@@ -198,6 +206,10 @@ func (m *Manager) Set(container *configs.Config) error { | |||
for _, sys := range subsystems { | |||
path := paths[sys.Name()] | |||
if err := sys.Set(path, container.Cgroups); err != nil { | |||
if path == "" { | |||
// cgroup never applied | |||
return fmt.Errorf("cannot set limits on the %s cgroup, as the container has not joined it. The cgroup controller may not be mounted or you may not have previously had permissions to create the cgroup subdirectory", sys.Name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is way too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
cannot set limits on the %s cgroups due to container not joining it
or something like that seems enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, no problem.
@@ -83,8 +83,7 @@ func (p *setnsProcess) start() (err error) { | |||
if err = p.execSetns(); err != nil { | |||
return newSystemErrorWithCause(err, "executing setns process") | |||
} | |||
// We can't join cgroups if we're in a rootless container. | |||
if !p.config.Rootless && len(p.cgroupPaths) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is still correct in some cases, but I guess erroring out is acceptable if someone explicitly asked for an impossible cgroup configuration (now that we could in principle nest things). I would like to see a test for this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not clear under what circumstances the rootless check still makes sense? Can you give an example please?
p.cgroupPaths
is loaded from the state.json
which best we can tell has to be the same as the cgroup paths the init process is in, so barring people doing weird things, if you succeeded on create, you should succeed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure what I meant either, sorry about that. This change is fine, but I would like to see a runc exec
test with rootless cgroups in use, to make sure this works fine.
Just to be clear, I'm very impressed that there were only two changes necessary to make the I'll look over the limitations you've listed when I get a chance. |
Thanks so much guys for the PR ! |
libcontainer/cgroups/fs/apply_raw.go
Outdated
|
||
if err := sys.Apply(d); err != nil { | ||
if os.IsPermission(err) && m.Cgroups.Path == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing I just thought of is that we should only be permissive if we're running in rootless mode (though I imagine that such cases are unlikely, and passing config.Rootless
down here might be a bit ugly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We hadn't done this because we couldn't think of a way in which root could get a permission error. Maybe from a Mandatory Access Control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right. We can fix it later if we hit it some other time.
Thanks very much for the review. Hope we responded to all the comments. -- |
c99adee
to
107d26d
Compare
We've rebased and pushed a few changes, specifically:
Hopefully that addresses most of the comments. We can also squash the additional commits if that'd be preferred. |
First-pass this looks good, squashing into relevant commits would be preferred. I'm going to test this out over the next few days, and then LGTM if it all looks good. Thanks so much for working on this @williammartin, @jszroberto, and @teddyking! ❤️ |
@cyphar No problem, thanks for keeping on top of this and providing useful pointers! |
The issues with test extensibility and making sure opportunistic features are tested correctly are solved in #1529. I can help you rebase this PR's tests on the new |
b72c42b
to
7b24009
Compare
Awesome, sounds good to me. I've squashed the commits here into one so hopefully it shouldn't be too difficult to rebase once #1529 is merged in. |
b0c4933
to
484a915
Compare
We've rebased on master and taken a stab at updating this PR's tests to fit the new Here's a brief overview of the changes since last push:
|
484a915
to
c13bd2f
Compare
@teddyking Cool, thanks. I'm currently at OSS but I'll take a look at this next week. |
c13bd2f
to
7feeaab
Compare
@cyphar great! I pushed a fix for the failing travis build, but it's now failing on the go 1.8.x job only for a totally unrelated reason (error downloading busybox.tar.xz), so it probably just needs to be kicked off again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my comments. They are very minor, and I will do my final review when I get back from OSS (and I have a chance to play with it some more). Overall it looks pretty great, thanks!
tests/rootless.sh
Outdated
function enable_cgroup() { | ||
# Set up cgroups for use in rootless containers. | ||
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test | ||
chown rootless:rootless -R /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we did chown rootless:rootless -R /sys/fs/cgroup/*/runc-cgroups-integration-test
here.
tests/rootless.sh
Outdated
|
||
function disable_cgroup() { | ||
# Remove cgroups used in rootless containers. | ||
[ -d /sys/fs/cgroup/devices/runc-cgroups-integration-test ] && rmdir /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And rmdir /sys/fs/cgroup/*/runc-cgroups-integration-test
here.
} | ||
|
||
@test "runc create (rootless + no limits + cgrouppath + no permission) fails with permission error" { | ||
requires rootless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will think about this a little more when I get back from OSS, but I am not sure that we'd want the "fails with permission error" tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't you want these? Test bloat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it ends up working in the future (which may happen if systemd decides to enable the cgroupv2 delegation code), then the test will break. On the other hand, we could just include these and deal with that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can buy that. As long as getting a permission error isn't the desired behaviour right now (rather than say, some other error that might bubble up) and we don't want this test for regression, then I say kill the test and save test bloat/potential confusion later.
Would you mind pointing me to some info on the delegation code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a delegation section in the cgroupv2 documentation. Effectively you can set -o nsdelegate
on the root mount of cgroupv2
and then you get permissions to write to a sub-cgroup when you're in a cgroup namespace. I did a quick search for the actual patch but couldn't find it.
tests/integration/update.bats
Outdated
# XXX: Also, this test should be split into separate sections so that we | ||
# can skip kmem without skipping update tests overall. | ||
requires cgroups_kmem | ||
[[ "$ROOTLESS" -ne 0 ]] && requires rootless_cgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: swap the order of the checks to match the rest of the tests.
tests/rootless.sh
Outdated
|
||
function enable_cgroup() { | ||
# Set up cgroups for use in rootless containers. | ||
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to source this list from /proc/self/cgroups
, but I can do that in a follow-up if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will leave that up to you to decide.
Note to myself: We should add some documentation on how to use the lxcfs PAM module to allow a "normal" user to set up delegated cgroups on their host. |
7feeaab
to
dc12997
Compare
dc12997
to
3c10a6e
Compare
Okay, the one thing left is the notification scheme for OOM and memory pressure. Effectively I think that we should not allow someone to register for notifications if we did not join a cgroup (and inherited the original cgroups) -- so we'd have to save somewhere whether the cgroup manager (silently) failed to setup cgroups. I'll work on this after we merge this, since it's not a blocker IMO. |
Signed-off-by: Ed King <[email protected]> Signed-off-by: Gabriel Rosenhouse <[email protected]> Signed-off-by: Konstantinos Karampogias <[email protected]>
3c10a6e
to
ca4f427
Compare
2b72c96
to
2fba94d
Compare
This ensures that we don't hard-code the set of cgroups on the host, as well as making the permissions granted by rootless.sh much more restrictive (to improve the scope of testing). Signed-off-by: Aleksa Sarai <[email protected]>
2fba94d
to
23f4d31
Compare
Great stuff. Thanks @cyphar and all! |
Following on from the rootless cgroups discussion in #1457, this PR provides an initial implementation for rootless cgroups that fulfils Garden's needs. This does not enable all cgroup related functiionality, but we think it's a good starting point for other currently disabled features to work.
In more detail, this PR removes the current rootless cgroup implementation, and using cgroups in rootless mode should now work broadly the same as rootful, providing that runc has permissions on the cgroup path.
Differences with Rootful
Typically, without permissions on the cgroup path, we would expect an error when
applying
(entering a pid into a controller), however we now support a kind of "opportunistic" cgroup usage when no limits have been set and no cgroup path has been provided. Since child processes are entered into the same cgroup as their parent, this is effectively the same (regarding cgroup enforcement) as not providing a cgroup path and not setting limits. Attempting toset
(change a resource limit) will still result in a permission denied error either during creation or viarunc update
but hopefully with a slightly more informative error.The devices cgroup doesn't have all functionality available for setting limits. This is because there is a requirement on
CAP_SYS_ADMIN
for thedevices.allow
anddevices.deny
files. We're not sure of a way to provide different device white/blacklisting per container, but a possible solution for some use cases is to set a static list in a parent cgroup that is inherited. This works becauseapplying
does not requireCAP_SYS_ADMIN
.Disabled features
We haven't enabled OOM notification or Memory pressure notificiation but this is hopefully as simple as removing the rootless conditional:
runc/libcontainer/container_linux.go
Line 528 in e775f0f
We haven't enabled CRIU features because we aren't familiar with what is required.
Notes
The BATS added seem to cover the right features, however, we aren't super happy with the changes to the Makefile. Thoughts on how to solve the requirement that a cgroup exists, chowned to rootless in a nicer way would be much appreciated.
Signed-off-by: Ed King [email protected]