Document Observability support for Knative components and Knative services #2931

skonto · 2020-10-14T09:04:44Z

Describe the change you'd like to see
In the related WG meeting we discussed that users may require guidance about what metrics to use when monitoring Knative services. This has been raised before here #2070. The purpose of this feature is to capture the involved personas, what are the use cases and as a result what is the minimum info to expose in docs wrt observability not just metrics.
The personas can be splitted into roughly two groups in terms of reqs:
a) developer/tester/service delivery
These people develop and deploy knative services and they would like to know what to monitor to make sure
everything is running smoothly in staging or development envs. They want to understand key Knative features like
Serving auto-scaling and measure performance before shipping their service.If they use Eventing they want to know how their processing graph behaves (sources/sinks/dispatchers/brokers etc) and understand if their deployed artifacts are performing well. But they also might want to understand networking performance/behavior when a service mesh is used etc.
These people assume that Knative control plane and infrastructure is in place to run their services and debug them using logs, metrics and traces.

b) SREs/Operations/support engineers
These people are responsible for operating/debugging/automating Knative services and Knative components themselves in production. As a minimum they need support for the 4 golden signals to set proper alarms etc. For example how can we help them detect if their setup is healthy.
They need to understand how Knative components and knative services integrate with the existing infrastructure at all levels. In that perspective, they need to know what reqs Knative imposes, if any, when integrating with existing infrastructure in terms of logging, metrics acquisition and tracing.

Notes:

Part of the observability functionality will be provided by vendors building products on top of Knative and thus they can also been seen as a separate persona.
It is important to help people (any persona) with their getting started UX.

The proposed minimum items to add to the docs are:

What metrics are important for the different personas, a description should be available along with a sample example of what is considered a good value and what is not. What are the metrics needed to debug common issues like latency, saturation etc.
It would be nice to have a sample app that demonstrates all the above.
What logs/traces the different personas need to gather to monitor the different Knative components.

Note: Any change to the above should be done with a followup change in docs as part of the PR work upstream.
I believe metrics are not expected to change that often.

/cc @abrennan89 @evankanderson @dprotaso

upodroid · 2020-10-15T14:35:59Z

I like the route that Istio took which is to provide the grafana dashboards and Service Monitor resources for end users to deploy them in to their prometheus installations. I'm happy to contribute them

iancoffey · 2020-11-17T14:00:29Z

I am game to help develop this solution as it is related to solving the kfserving metrics situation. 👍 😄

skonto · 2020-11-18T15:30:56Z

@iancoffey does Kfserving offer any tooling for setting up Prometheus and/or Grafana for the getting started UX at least? Did it use to rely on Knative for that setup? Do you think this tooling could be part of a shared repo, if so how this could be maintained? Maintainability was an issue in the first place for Knative monitoring.
The idea for describing stuff in docs is oriented more to providing high level info on what metrics tell you about your apps, some details about tracing etc and not copying the old Prometheus setup scripts or other scripts/yaml files in there /cc @abrennan89. Should we discuss this on the next wg docs meeting? @iancoffey feel free to join, would be great to discuss this.
@evankanderson WDYTH?

iancoffey · 2020-11-18T15:56:48Z

Did it use to rely on Knative for that setup?

kfserving docs relied on the observability metrics bits that was removed, and thats the genesis of my interest - trying to understand that gap better, and how to help solve it.

Do you think this tooling could be part of a shared repo, if so how this could be maintained?

Its an interesting idea, maybe a small experimental project. It would be great to have something to plug the gap until there are longer term options down the road. I wonder what kfserving folks would think of that.

skonto · 2020-11-18T16:47:02Z

I wonder what kfserving folks would think of that.

Pinging @evankanderson @mdemirhan @markusthoemmes for that matter as well.

iancoffey · 2020-12-07T20:14:33Z

Ive assembled a simple yaml+config solution to get a knative grafana dashboard up and running here > https://github.com/iancoffey/knative-metrics. The intention is to unblock kfserving metrics for the time-being. I can add all the relevant dashboards if theres interest in developing this stopgap further.

yuzisun · 2020-12-11T00:07:33Z

Ive assembled a simple yaml+config solution to get a knative grafana dashboard up and running here > https://github.com/iancoffey/knative-metrics. The intention is to unblock kfserving metrics for the time-being. I can add all the relevant dashboards if theres interest in developing this stopgap further.

I like the idea maybe this can temporarily live in KFServing as additional monitoring plugin users can install before we figure out a long term maintenance model with knative team.

iancoffey · 2020-12-11T17:35:50Z

OK I shall open this in a PR on kfserving repo next 👍

skonto · 2020-12-11T18:46:01Z

@iancoffey thanks for hosting this, I went through the yaml files, I found some incosistencies wrt the old removed scrape config, check this vs this line. Is there some way we can verify configs and stuff?
I pinged knative people on the monitoring channel maybe we can store this under knative. Pls join #monitoring in knative.slack.com to participate there as well.

github-actions · 2021-04-25T21:23:44Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen.Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

dprotaso · 2021-06-14T20:24:26Z

Following up: @iancoffey switched teams and isn't involved with KFServing anymore.

I'm happy to contribute them

@upodroid are you interested in maintaining these contributions across releases? We ideally would like someone to be regularly involved - meaning fixes issues, write some documentation on how to use these monitors. We just want to avoid these going stale again and being unmaintained.

/remove-lifecycle stale

upodroid · 2021-06-14T21:05:31Z

I'm up for it.

I have joined the google group.

This fits the Operations WG nicely.

skonto · 2021-06-15T15:52:53Z

I can also help @upodroid :) We should discuss where to host such artifacts, probably under knative-sandbox.

upodroid · 2021-06-15T16:04:28Z

great, i'll see you at the Operations WG today.

dprotaso · 2021-06-15T21:09:06Z

Following up from the operations WG meeting 2021-06-15

We're going to create a sandbox repo and revive parts of the monitoring bundle:
Issue: knative/community#654
PR: knative/community#653

After revival we'll drive improvements and potential publishing of dashboards to grafana.com

skonto changed the title ~~Document how users can monitor Knative components and Knative services~~ Document Observability support for Knative components and Knative services Oct 14, 2020

dprotaso mentioned this issue Oct 15, 2020

Knative Serving - Servicemonitor CRDS knative/serving#9596

Closed

iancoffey mentioned this issue Oct 28, 2020

Update references to Knative monitoring docs kserve/kserve#1162

Closed

abrennan89 mentioned this issue Nov 3, 2020

Show how to use Prometheus and/or Grafana in "accessing-metrics" docs #2070

Closed

abrennan89 added this to the Icebox milestone Jan 25, 2021

abrennan89 added the triage/needs-eng-input Engineering input is requested label Jan 25, 2021

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2021

knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2021

dprotaso mentioned this issue Jun 16, 2021

New Repo: knative-sandbox/monitoring knative/community#654

Closed

13 tasks

This was referenced Jun 28, 2021

New "Cluster maintenance" section #1895

Closed

Restructure "Eventing component" section #1923

Closed

RichardJJG added the priority/medium label Jul 28, 2021

upodroid mentioned this issue Aug 19, 2021

Update metrics documentation #4138

Merged

abrennan89 assigned upodroid Oct 13, 2021

knative-prow-robot closed this as completed in #4138 Oct 21, 2021

aslom mentioned this issue Mar 22, 2022

Improve Knative Eventing End-to-End Observability (GSOC) knative/eventing#6247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Observability support for Knative components and Knative services #2931

Document Observability support for Knative components and Knative services #2931

skonto commented Oct 14, 2020 •

edited

Loading

upodroid commented Oct 15, 2020

iancoffey commented Nov 17, 2020 •

edited

Loading

skonto commented Nov 18, 2020 •

edited

Loading

iancoffey commented Nov 18, 2020

skonto commented Nov 18, 2020 •

edited

Loading

iancoffey commented Dec 7, 2020

yuzisun commented Dec 11, 2020

iancoffey commented Dec 11, 2020

skonto commented Dec 11, 2020 •

edited

Loading

github-actions bot commented Apr 25, 2021

dprotaso commented Jun 14, 2021

upodroid commented Jun 14, 2021

skonto commented Jun 15, 2021 •

edited

Loading

upodroid commented Jun 15, 2021

dprotaso commented Jun 15, 2021 •

edited

Loading

Document Observability support for Knative components and Knative services #2931

Document Observability support for Knative components and Knative services #2931

Comments

skonto commented Oct 14, 2020 • edited Loading

upodroid commented Oct 15, 2020

iancoffey commented Nov 17, 2020 • edited Loading

skonto commented Nov 18, 2020 • edited Loading

iancoffey commented Nov 18, 2020

skonto commented Nov 18, 2020 • edited Loading

iancoffey commented Dec 7, 2020

yuzisun commented Dec 11, 2020

iancoffey commented Dec 11, 2020

skonto commented Dec 11, 2020 • edited Loading

github-actions bot commented Apr 25, 2021

dprotaso commented Jun 14, 2021

upodroid commented Jun 14, 2021

skonto commented Jun 15, 2021 • edited Loading

upodroid commented Jun 15, 2021

dprotaso commented Jun 15, 2021 • edited Loading

skonto commented Oct 14, 2020 •

edited

Loading

iancoffey commented Nov 17, 2020 •

edited

Loading

skonto commented Nov 18, 2020 •

edited

Loading

skonto commented Nov 18, 2020 •

edited

Loading

skonto commented Dec 11, 2020 •

edited

Loading

skonto commented Jun 15, 2021 •

edited

Loading

dprotaso commented Jun 15, 2021 •

edited

Loading