Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I should be able to serve metrics over https #396

Closed
shawn-hurley opened this issue Aug 10, 2018 · 20 comments
Closed

I should be able to serve metrics over https #396

shawn-hurley opened this issue Aug 10, 2018 · 20 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/needs-information Indicates an issue needs more information in order to work on it.
Milestone

Comments

@shawn-hurley
Copy link
Member

Currently the ExposeMetricsPort() does not allow us to serve metrics over https.

I propose that we add an options struct, this will allow a user to specify the certs (maybe even they can generate with the TLS certs?) or if nothing is set, then this happens as it currently works.

We would need to introduce a breaking change to this function. We could just introduce a new function but my vote would be to break the API and have it accept an options struct.

@lilic lilic added the metrics label Nov 16, 2018
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2019
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 8, 2019
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hasbro17
Copy link
Contributor

hasbro17 commented Jul 9, 2019

@lilic Do we need to reopen this? Don't think we have this yet.

@lilic
Copy link
Member

lilic commented Jul 9, 2019

I am working on this right now.

/reopen

@openshift-ci-robot
Copy link

@lilic: Reopened this issue.

In response to this:

I am working on this right now.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lilic
Copy link
Member

lilic commented Jul 9, 2019

/remove-lifecycle rotten

@openshift-ci-robot openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 9, 2019
@lilic
Copy link
Member

lilic commented Jul 10, 2019

TLS metrics for operators deployed in Open Shift 4.x

The following is based on how kube-state-metrics secures TLS metrics endpoint in Open Shift. For Kubernetes we will add documentation to describe how to secure the metrics endpoints, this is because there is no native solution for this in kubernetes itself. kubebuilder suggests using kube-rbab-proxy and cert-manager for this, and they do not enable this by default but you have to opt in. We can provide docs and example manifests how to do this and users can follow that.

TLS in Open Shift 4.x

We need to add the annotation to the service object before the operator is deployed, as we reference the created secret in the deployment itself. So all of these reasources need to be created before the operator starts. So the Service, ServiceMonitor objects will be created using manifest files instead of programitcly as we do right now, this is just for openshift right now. We need a new subcommand operator-sdk generate openshift This command can be useful to generate other openshift specific things later on. Right now what it would do is generate the deploy-openshift directory.

The content of the deploy-openshift directory:

crds/          service-monitor.yaml  service.yaml       operator.yaml role.yaml            role_binding.yaml   service_account.yaml

Content of operator.yaml - deployment manifest file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: memcached-operator
  labels:
    name: memcached-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: memcached-operator
  template:
    metadata:
      labels:
        name: memcached-operator
    spec:
      serviceAccountName: memcached-operator
      volumes:
      - name: memcached-operator
        secret:
          secretName: memcached-operator
      - emptyDir: {}
        name: volume-directive-shadow
      containers:
        - name: memcached-operator
          image: myimage:latest
          command:
          - operator-metrics-tls
          imagePullPolicy: Always
          env:
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: OPERATOR_NAME
              value: "memcached-operator"
        - args:
          - --logtostderr
          - -v=8
          - --secure-listen-address=:9393
          - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
          - --upstream=http://127.0.0.1:8383/
          - --tls-cert-file=/etc/tls/private/tls.crt
          - --tls-private-key-file=/etc/tls/private/tls.key
          image: quay.io/coreos/kube-rbac-proxy:v0.4.1
          name: kube-rbac-proxy-main
          ports:
          - containerPort: 9393
            name: op-metrics
          resources:
            requests:
              cpu: 10m
              memory: 40Mi
          volumeMounts:
          - mountPath: /etc/tls/private
            name: operator-metrics-tls
            readOnly: false
        - args:
          - --logtostderr
          - -v=8
          - --secure-listen-address=:9696
          - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
          - --upstream=http://127.0.0.1:8686/
          - --tls-cert-file=/etc/tls/private/tls.crt
          - --tls-private-key-file=/etc/tls/private/tls.key
          image: quay.io/coreos/kube-rbac-proxy:v0.4.1
          name: kube-rbac-proxy-self
          ports:
          - containerPort: 9696
            name: cr-metrics
          resources:
            requests:
              cpu: 10m
              memory: 40Mi
          volumeMounts:
          - mountPath: /etc/tls/private
            name: memcached-operator
            readOnly: false

content of the service.yaml file:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.alpha.openshift.io/serving-cert-secret-name: memcached-operator
  labels:
    name: memcached-operator
  name: memcached-operator
  namespace: default
spec:
  ports:
  - name: cr-metrics
    port: 9696
    targetPort: cr-metrics
  - name: op-metrics
    port: 9393
    targetPort: op-metrics
  selector:
    name: memcached-operator
  type: ClusterIP

content of the service-monitor.yaml file:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: memcached-operator
  name: memcached-operator
  namespace: default
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 2m
    port: cr-metrics
    scheme: https
    scrapeTimeout: 2m
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: memcached-operator.default.svc
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    interval: 2m
    port: op-metrics
    scheme: https
    scrapeTimeout: 2m
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: memcached-operator.default.svc
  jobLabel: memcached-operator
  selector:
    matchLabels:
        name: memcached-operator

Those are the only changes to the files we need to make.

Open questions

  1. Does this work the same way in 3.x versions of open shift?
  2. We can also do work when a new operator is created, if the user knows it will be deployed on openshift only, as part of a flag of the new subcommand. The flag would be something like operator-sdk new --openshift and it would generate the same resources as above.

@lilic
Copy link
Member

lilic commented Jul 10, 2019

cc @estroz @shawn-hurley @hasbro17 @AlexNPavel @joelanford Can you let me know if you have any objections, otherwise will start implementing the above. Thanks!

@ironcladlou
Copy link

@lilic

We need to add the annotation to the service object before the operator is deployed, as we reference the created secret in the deployment itself. So all of these reasources need to be created before the operator starts.

A deployment can reference a non-existent secret in a volume mount, and if/when the secret does exist, the replica containers will successfully create. Given that, the secret and deployment creation order should be fundamentally irrelevant (modulo any peripheral concerns, e.g. backoff time re-creating the container if there are initial failures due to the absence of the secret).

It wasn't clear to me from your statement whether we agree on that assumption. I'm not sure if it would affect your design one way or another, but it's something that stood out to me.

@lilic
Copy link
Member

lilic commented Jul 11, 2019

@ironcladlou

A deployment can reference a non-existent secret in a volume mount, and if/when the secret does exist, the replica containers will successfully create.

Yes it can reference it, but currently, we create the Service and ServiceMonitor resources programmatically in the operator itself, so we have to change to the above design to create these resources alongside the operators deployment (regardless of the order of course). As like you said the container will be in ContainerCreating state until the Secret is not mounted but the Service and ServiceMonitor will never be created. But agreed that was not clear from my comment above, thanks for commenting!

@ironcladlou
Copy link

@lilic I think your elaboration on the current behavior completed the picture for me. Thank you! Your proposal sounds good to me.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2019
@camilamacedo86 camilamacedo86 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2019
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 7, 2020
@estroz estroz unassigned lilic Mar 2, 2020
@estroz estroz added the triage/needs-information Indicates an issue needs more information in order to work on it. label Mar 2, 2020
@estroz
Copy link
Member

estroz commented Mar 2, 2020

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Mar 2, 2020
@estroz estroz added this to the v1.0.0 milestone Mar 2, 2020
@estroz estroz self-assigned this Mar 2, 2020
@estroz estroz added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 2, 2020
@theishshah theishshah self-assigned this Mar 4, 2020
@estroz estroz removed their assignment Jun 15, 2020
@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Jul 10, 2020

@estroz
Copy link
Member

estroz commented Jul 24, 2020

This will be done once the ansible plugin is complete: #3488.

@camilamacedo86
Copy link
Contributor

The ansible plugin is merged 👍 Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants