Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jail/chroot nginx process inside controller container #8337

Merged
merged 13 commits into from
Apr 9, 2022

Conversation

rikatz
Copy link
Contributor

@rikatz rikatz commented Mar 15, 2022

What this PR does / why we need it:

Isolate nginx program into its own userns inside the main controller container.

Using this technique, we can provide a sandbox inside a sandbox (beautiful, uh) so nginx program wont have access to sensitive files from the main container.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Which issue/s this PR fixes

So many...

How Has This Been Tested?

  • Generated chroot container and ran inside a Kubernetes cluster (simple approach)
  • Next step is to change the e2e tests and point them all to use the chrooted container instead of the non sandboxed container

TODO (Don't merge it before we pass through the TODO!!!)

  • Plan the BETA for this, as this may be a breaking change
  • We need to make sure it's clear that:
    • We are adding a new sec capability to the container, allowing it to do a unshare/create a new netns inside of it
    • If the admin adds / enables the seccomp feature with default profiles, ingress may stop working. This new approach NEEDS seccomp enabling "clone" syscalls

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 15, 2022
@k8s-ci-robot k8s-ci-robot added area/helm Issues or PRs related to helm charts area/lua Issues or PRs related to lua code approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 15, 2022
@rikatz
Copy link
Contributor Author

rikatz commented Mar 15, 2022

/kind bug
/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 15, 2022
@rikatz rikatz force-pushed the chroot-nginx-proc branch 2 times, most recently from fe42bc2 to b61ae2c Compare March 15, 2022 03:10
@tao12345666333
Copy link
Member

wow! Great!

@longwuyuan
Copy link
Contributor

longwuyuan commented Mar 15, 2022 via email

# See the License for the specific language governing permissions and
# limitations under the License.

cat /etc/resolv.conf > /chroot/etc/resolv.conf

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, because the resolv.conf is generated on runtime (when the Pod is bootstrapped) and may change from cluster to cluster :D

@tallclair
Copy link
Member

Is there a reason the controller can't be separated out into a different container? Naively that seems like it would be a lot simpler than jailing it within the same container.

@rikatz
Copy link
Contributor Author

rikatz commented Mar 15, 2022

Is there a reason the controller can't be separated out into a different container? Naively that seems like it would be a lot simpler than jailing it within the same container.

Hey @tallclair so let me try to clarify why we didn't followed this split, even this being actually our first attempt:

  • One of the main problems is serviceAccount extraction. Mounting the serviceAccount token is a per pod, and not a per container configuration (I guess, at least I couldn't find a way to say "don't mount this SA on this container only"). So, even if we split the containers, the SA will still be there, usable if someone can add arbitrary configs to read the container FS. BTW, I can write a KEP on this to have automountServiceAccount per POD, although I promised myself to be away from KEPs for the next 3 years ;P

  • Based on that, another approach was: "let's disable the SA, and mount/project a specific SA in a specific container" (https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#serviceaccounttokenprojection-v1-core). This works "partially". There are some weird workaround that needs to be made so client-go can use the token and behave as "inCluster". I'm not saying this is not impossible, I could make it work, but there are so many changes required just to use this token that it was a huge effort to change it.

  • Now, considering that we could mount the SA only on one container and not on the other (based on any of the approaches above), there are some restriction in the way ingress-nginx was built that turns into a lot more code changes. Starting with the process monitoring. The ingress-controller process monitors, via a channel open with the nginx process start, if the process is respawning. To be more broad, ingress-controller process not only is responsible for that, but for all the control of the process. It uses nginx -t to test if a new template is valid. It reloads. It stops when requested to stop. So it needs not only to share the same PID namespace, but also to have access to that same binary. One thing we started to discuss here was to move all this process logic to another Go program, and receive the commands via a REST API or gRPC (like ingress controller calls the endpoint to start, stop, restart the container). This would take at least 15 days of implementation, as I'm not fully focused on ingress-nginx :)

  • We can share the PID NS between both containers without any problem as well. But there are some caveats on that restart/reload/stop/etc process that wasn't that easy

Thinking on the same approach kube-proxyv2 is taking (aka kpng), we started to discuss about this as an opportunity to split CP/DP using a single gRPC controller and multiple proxies talking only with the controller and getting the right information. This is, for us, the new ingress-nginx and the desired state of art. But, again, at least myself stopped on the amount of time I have to develop vs amount of time required to make it work :)

So, we didn't either wanted to stop on the promise of a new implementation or a hard workaround, after all this needed to be fixed fast. And this is why the chroot/jail solution came on a weekend discussion and was implemented in like 4 hours :) I've spent some time as well researching the previous solutions, but all of them needed deep changes in the code not as easy as it seems (I though it was going to be easy the first time as well!)

Makes sense? :)

@rikatz rikatz force-pushed the chroot-nginx-proc branch 4 times, most recently from 56ffbfa to d2557a3 Compare March 21, 2022 02:52
@tallclair
Copy link
Member

@rikatz thanks for the detailed response.

I can write a KEP on this to have automountServiceAccount per POD

I might be misunderstanding you, but there's already a pod-level AutomountServiceAccountToken field: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L3134-L3136

There are some weird workaround that needs to be made so client-go can use the token and behave as "inCluster".

Do you have any more details on what went wrong? I'm guessing you're missing some of the other projected data. Easiest thing is to just copy the projected volume from a pod with the SA auto mounted:

  - name: kube-api-access-7g4tx
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

So it needs not only to share the same PID namespace, but also to have access to that same binary.

I'm sure I don't understand all the nuances here, but if the issue is just that the controller needs access to the nginx binary, then you could consider just putting everything you need in a single container image, and run both containers off the same image (just with a different entrypoint / command).

We can share the PID NS between both containers without any problem as well. But there are some caveats on that restart/reload/stop/etc process that wasn't that easy

Yeah, that makes sense. I think stop should be straightforward? In place of restart, could you just have something that monitors the process (e.g. systemd, or just rely on the kubelet) and restarts it as soon as it's killed?

I'm just guessing here... I can see why you might need the gRPC interface instead.


On a separate topic, you might want to have a look at AppArmor hats, which are a way to change the AppArmor context, designed to change privileges on the fly. It might be a good fit for this use case (but only works on AppArmor systems).

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2022
@@ -74,7 +74,7 @@ spec:
containers:
- name: {{ .Values.controller.containerName }}
{{- with .Values.controller.image }}
image: "{{- if .repository -}}{{ .repository }}{{ else }}{{ .registry }}/{{ .image }}{{- end -}}:{{ .tag }}{{- if (.digest) -}} @{{.digest}} {{- end -}}"
image: "{{- if .repository -}}{{ .repository }}{{ else }}{{ .registry }}/{{ include "ingress-nginx.image" . }}{{- end -}}:{{ .tag }}{{- if (.digest) -}} @{{.digest}} {{- end -}}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that currently, we can complete image name conversion through helper functions, but once chroot is enabled, should we cancel digest?

The expected behavior is that we use a configuration item to allow users to switch images without perception, but digests are not the same. If the user configures digests, the image download will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, I think we may have digest for chroot as well. This can be another argument, it's fine. Because we don't have digest yet (no image published) I forgot this, but we can fix this on a followup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,I agree

@@ -61,6 +59,7 @@ import (
"k8s.io/ingress-nginx/internal/nginx"
"k8s.io/ingress-nginx/internal/task"
"k8s.io/ingress-nginx/internal/watch"
"k8s.io/klog/v2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: keep the previous import order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 fixing

@rikatz rikatz force-pushed the chroot-nginx-proc branch from b0aec45 to 875a86f Compare April 9, 2022 03:38
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 9, 2022
@rikatz rikatz force-pushed the chroot-nginx-proc branch from 875a86f to 2718ca6 Compare April 9, 2022 03:39
@rikatz
Copy link
Contributor Author

rikatz commented Apr 9, 2022

@tao12345666333 fixed the helm thing, please take a look into it

Copy link
Member

@tao12345666333 tao12345666333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great! If you want to merge this PR at any time please remove the hold

/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 9, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rikatz, tao12345666333

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rikatz
Copy link
Contributor Author

rikatz commented Apr 9, 2022

/hold cancel
So I can sleep fine and start working on another part tomorrow :)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2022
@k8s-ci-robot k8s-ci-robot merged commit 3def835 into kubernetes:main Apr 9, 2022
rchshld pushed a commit to joomcode/ingress-nginx that referenced this pull request May 19, 2023
* Initial work on chrooting nginx process

* More improvements in chroot

* Fix charts and some file locations

* Fix symlink on non chrooted container

* fix psp test

* Add e2e tests to chroot image

* Fix logger

* Add internal logger in controller

* Fix overlay for chrooted tests

* Fix tests

* fix boilerplates

* Fix unittest to point to the right pid

* Fix PR review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/helm Issues or PRs related to helm charts area/lua Issues or PRs related to lua code cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants