Prometheus Integration #222

bradbeam · 2018-05-03T12:49:32Z

We should consider adding in some basic Prometheus support in the generated operator. Just including the client library and exposing the metrics path I think would be a good start, perhaps define/increase a counter in the handler function to set the pattern up.

spahl · 2018-05-04T18:22:51Z

Agreed. We have been thinking about similar things but have not gotten to those yet. (https://github.com/operator-framework/operator-sdk/blob/master/doc/design/long-term/simple-extensions.md)

We would welcome any proposal if you want to give it a shot.

etiennecoutaud · 2018-05-07T07:56:50Z

@spahl I'm working on adding basic Prometheus metrics in the operator, should we generate service and service monitor to expose metrics port ?

robszumski · 2018-05-11T21:26:37Z

I think that makes sense @etiennecoutaud. The main consideration is whether there will be a cluster-wide prometheus watching for all service monitors or one per namespace. We're leaning towards the per-namespace method for most use-cases, as this gives more control over metrics retention and access control to the data.

spahl · 2018-05-11T21:27:49Z

Sorry I missed that. It does make a lot of sense.

etiennecoutaud · 2018-05-14T07:50:54Z

@spahl @robszumski Who is working on prometheus integration ? I will add service and servicemonitor when kubernetes manifest are generating

spahl · 2018-05-14T16:52:41Z

No one at this point. It is in the top list of things we want to add though. Any help is welcome.

hasbro17 · 2018-05-18T19:41:08Z

We need to consider how the SDK should help an operator its metrics to Prometheus via a Service and ServiceMonitor. One way is to generate the manifests at build time and expect the user to configure and create them. This is the approach in #241

There are a couple of issues with this approach:

The user has to change the configuration in multiple places if the default port or labels selectors for the Service or ServiceMonitor need to be changed.
Generating the manifests at build time leaves the user one more thing to do to enable a feature that could work out of the box.

A preferred approach would be to have the operator expose the metrics on a desired port and create only the Service at runtime by default. The user can pass in some configuration to the operator to override the default values for the port and Service.

The following fields can be made configurable via flags passed to the operator:

--metrics-port: The port number to serve metrics. This can have a default e.g 9090 that can be configured if needed.
--service-selector: The Service selector label spec.selector that’s used to select the operator pod. The default can be the same label generated for the operator deployment manifest e.g name=memcached-operator. This will need to be changed if the user ever updates the default labels of the operator deployment.
--disable-metrics: By default this is false. Can be set to true so the operator will not expose the metrics or create the service.

With the above flags the operator can be configured to setup the prometheus integration in 3 ways:

Expose the port and create the Service with default values:

labels:
    name: memcached-operator
spec:
  containers:
   - name: memcached-operator
      command:
      - memcached-operator
      ports:
      - containerPort: 9090

Configure the port or service selector if needed:

labels:
   app: memcached-operator
spec:
  containers:
   - name: memcached-operator
      command:
      - memcached-operator
    - --metrics-port=8080
    - --service-selector="app=memcached-operator"
      ports:
      - containerPort: 8080

Disable the prometheus integration completely:

spec:
  containers:
   - name: memcached-operator
      command:
      - memcached-operator
    - --disable-metrics=true

Creating the ServiceMonitor:
Requiring the operator to create the ServiceMonitor at runtime by default can be a problem in the event where the ServiceMonitor CRD has not been registered with the API server. In this case the operator could simply log the error and skip creating the ServiceMonitor and move on. Or it could keep retrying. In either case it’s debatable how the operator should create the ServiceMonitor.

So for a start I propose we only have the operator expose the port and create a Service at runtime.

@etiennecoutaud Let me know what you think. I think it's important to decide whether we want to generate manifests vs let the operator create the resources, before we move forward with #241

chancez · 2018-05-18T21:38:19Z

Regarding creating the ServiceMonitor, you the operator could probe available CRDs and if it exists, it can attempt to create it. This auto-creation could be default on, with the ability to turn it off using a flag (since RBAC will probably be a factor too).

eparis · 2018-05-19T14:05:59Z

Might I pretty strongly suggest we not add flags until a real user asks? And only then only if they have, what we agree to be, an extremely compelling use case. Even if we think someone 'will want this' just do the right thing, always, for them. Don't let them customize. Pick a random container port that has an extremely low chance of conflict (something like 41783) and just always turn it on. Don't let it be turned off. Don't let it be configured. Just make all operators the same.

I'm good with the operator creating it's own service definition at run time instead of build time, but if the service it just static, it could come later...

I'm big on no, config options. Just do the right thing for people. Our users should follow the best practice because it is the only path we give them.

spahl · 2018-05-21T14:48:08Z

You are right that limiting flags is a wiser approach.

What we could do easily is to not have: --disable-metrics=true and same for --metrics-port=8080 by choosing a port like you described.

I don't know how to get rid of the service-selector. Maybe we could come up with a naming convention for all operators @hasbro17?

etiennecoutaud · 2018-05-22T08:29:25Z

I agree with @spahl and @eparis for limiting flags. Today we discuss about prometheus integration but tomorrow it will be something else and we will have ton of options with ton of possibilities.

We are generating code, we consider as normal that users add code, but is it normal that users remove some generating parts because they don't need it ?
In #241 approach prometheus server handler is added and a metrics port is exposed. If users don't want to use monitoring, they remove prometheus code and don't apply service monitor.

My opinion join @eparis, we generating operator in the state of the art way. We provide everything needed to run the operator in the best way. Free to users to remove parts and change the way we purpose. We cannot handle all the possibilities, we have to choose the most smart path and propose it

eparis · 2018-05-22T13:20:58Z

For serviceSelector how about generating a label on the pod which we know to be unique and use that?
watchedresource.operatorsdk.operatorframework = ${kind}.${api-version}
We should know the kind and api-version we plan to watch, and nothing else should be watching that...

hasbro17 · 2018-05-22T20:38:55Z

@eparis @etiennecoutaud Thanks for the responses. I agree that the number of flags could easily pile up if we add more and more configuration options so perhaps it's best to just pick one approach and stick with that.
So if we have consensus over not adding too much configuration then that means the service to expose the metrics is just static with a fixed port and label selector.

For serviceSelector how about generating a label on the pod which we know to be unique and use that?
watchedresource.operatorsdk.operatorframework = ${kind}.${api-version}

Right now we generate the label name=<operator-name> for the operator pods which should be unique enough to select them since you won't ever have two operators deployments in the same namespace. So we can use that as the label selector like in #241
https://github.com/operator-framework/operator-sdk/pull/241/files#diff-52e0a0ade3435bbd8d2db3ee74b4ede1R142

I'm good with the operator creating it's own service definition at run time instead of build time, but if the service it just static, it could come later

With the service being static should we still have the operator create it at runtime? The benefit of that is the service is always created by default now. If not then we simply go back to generating the manifests as in #241 and let the user create it as needed.

hasbro17 · 2018-05-29T23:42:09Z

@etiennecoutaud @eparis I'm hoping we can move forward based on the discussion so far.

If we're all agreed on not having configuration flags then for #241 can we at start with the following:

The operator deployment manifest is generated with an exposed static container port.
The operator creates the Service at runtime. The label selector can be fixed to name=<operator-name>. The operator name can be passed in as an environment variable to the pod just like the namespace.

So for e.g the generated deployment manifest would look like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: memcached-operator
  labels:
    name: memcached-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: memcached-operator
  template:
    metadata:
      labels:
        name: memcached-operator
    spec:
      containers:
        - name: memcached-operator
          image: quay.io/example/memcached-operator:v0.0.1
          ports:
          - containerPort: 9090
            name: metrics
          command:
          - memcached-operator
          imagePullPolicy: Always
          env:
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: OPERATOR_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

And the Service created by the operator would in turn be:

apiVersion: v1
kind: Service
metadata:
  name: memcached-operator
  labels:
    name: memcached-operator
spec:
  selector:
    name: memcached-operator
  ports:
  - protocol: TCP
    targetPort: metrics
    port: 9090
    name: metrics

Let's leave out the ServiceMonitor from #241 for now.

etiennecoutaud · 2018-05-30T05:41:10Z

Sounds good to me, I will update #241 according to this change.

joelanford · 2018-12-11T20:13:39Z

@lilic just making you aware of this issue (if you aren't already). Can this be closed once your metrics work is merged?

etiennecoutaud mentioned this issue May 14, 2018

pkg/generator: Generate all manifests needed for Prometheus integration #241

Closed

spahl added kind/feature Categorizes issue or PR as related to a new feature. design labels Jun 5, 2018

etiennecoutaud mentioned this issue Jun 22, 2018

pkg/{sdk,generator} Expose Prometheus metrics port #323

Merged

lilic mentioned this issue Dec 12, 2018

Expose controller-runtime metrics #786

Merged

lilic closed this as completed in #786 Jan 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus Integration #222

Prometheus Integration #222

bradbeam commented May 3, 2018

spahl commented May 4, 2018

etiennecoutaud commented May 7, 2018

robszumski commented May 11, 2018

spahl commented May 11, 2018

etiennecoutaud commented May 14, 2018

spahl commented May 14, 2018

hasbro17 commented May 18, 2018

chancez commented May 18, 2018

eparis commented May 19, 2018 •

edited

Loading

spahl commented May 21, 2018 •

edited

Loading

etiennecoutaud commented May 22, 2018

eparis commented May 22, 2018

hasbro17 commented May 22, 2018

hasbro17 commented May 29, 2018

etiennecoutaud commented May 30, 2018

joelanford commented Dec 11, 2018

Prometheus Integration #222

Prometheus Integration #222

Comments

bradbeam commented May 3, 2018

spahl commented May 4, 2018

etiennecoutaud commented May 7, 2018

robszumski commented May 11, 2018

spahl commented May 11, 2018

etiennecoutaud commented May 14, 2018

spahl commented May 14, 2018

hasbro17 commented May 18, 2018

chancez commented May 18, 2018

eparis commented May 19, 2018 • edited Loading

spahl commented May 21, 2018 • edited Loading

etiennecoutaud commented May 22, 2018

eparis commented May 22, 2018

hasbro17 commented May 22, 2018

hasbro17 commented May 29, 2018

etiennecoutaud commented May 30, 2018

joelanford commented Dec 11, 2018

eparis commented May 19, 2018 •

edited

Loading

spahl commented May 21, 2018 •

edited

Loading