Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Observability support for Knative components and Knative services #2931

Closed
skonto opened this issue Oct 14, 2020 · 15 comments · Fixed by #4138
Closed

Document Observability support for Knative components and Knative services #2931

skonto opened this issue Oct 14, 2020 · 15 comments · Fixed by #4138
Assignees
Labels
priority/medium triage/needs-eng-input Engineering input is requested
Milestone

Comments

@skonto
Copy link
Contributor

skonto commented Oct 14, 2020

Describe the change you'd like to see
In the related WG meeting we discussed that users may require guidance about what metrics to use when monitoring Knative services. This has been raised before here #2070. The purpose of this feature is to capture the involved personas, what are the use cases and as a result what is the minimum info to expose in docs wrt observability not just metrics.
The personas can be splitted into roughly two groups in terms of reqs:
a) developer/tester/service delivery
These people develop and deploy knative services and they would like to know what to monitor to make sure
everything is running smoothly in staging or development envs. They want to understand key Knative features like
Serving auto-scaling and measure performance before shipping their service.If they use Eventing they want to know how their processing graph behaves (sources/sinks/dispatchers/brokers etc) and understand if their deployed artifacts are performing well. But they also might want to understand networking performance/behavior when a service mesh is used etc.
These people assume that Knative control plane and infrastructure is in place to run their services and debug them using logs, metrics and traces.

b) SREs/Operations/support engineers
These people are responsible for operating/debugging/automating Knative services and Knative components themselves in production. As a minimum they need support for the 4 golden signals to set proper alarms etc. For example how can we help them detect if their setup is healthy.
They need to understand how Knative components and knative services integrate with the existing infrastructure at all levels. In that perspective, they need to know what reqs Knative imposes, if any, when integrating with existing infrastructure in terms of logging, metrics acquisition and tracing.

Notes:

  1. Part of the observability functionality will be provided by vendors building products on top of Knative and thus they can also been seen as a separate persona.
  2. It is important to help people (any persona) with their getting started UX.

The proposed minimum items to add to the docs are:

  • What metrics are important for the different personas, a description should be available along with a sample example of what is considered a good value and what is not. What are the metrics needed to debug common issues like latency, saturation etc.
    It would be nice to have a sample app that demonstrates all the above.
  • What logs/traces the different personas need to gather to monitor the different Knative components.

Note: Any change to the above should be done with a followup change in docs as part of the PR work upstream.
I believe metrics are not expected to change that often.

/cc @abrennan89 @evankanderson @dprotaso

@skonto skonto changed the title Document how users can monitor Knative components and Knative services Document Observability support for Knative components and Knative services Oct 14, 2020
@upodroid
Copy link
Member

I like the route that Istio took which is to provide the grafana dashboards and Service Monitor resources for end users to deploy them in to their prometheus installations. I'm happy to contribute them

@iancoffey
Copy link

iancoffey commented Nov 17, 2020

I am game to help develop this solution as it is related to solving the kfserving metrics situation. 👍 😄

@skonto
Copy link
Contributor Author

skonto commented Nov 18, 2020

@iancoffey does Kfserving offer any tooling for setting up Prometheus and/or Grafana for the getting started UX at least? Did it use to rely on Knative for that setup? Do you think this tooling could be part of a shared repo, if so how this could be maintained? Maintainability was an issue in the first place for Knative monitoring.
The idea for describing stuff in docs is oriented more to providing high level info on what metrics tell you about your apps, some details about tracing etc and not copying the old Prometheus setup scripts or other scripts/yaml files in there /cc @abrennan89. Should we discuss this on the next wg docs meeting? @iancoffey feel free to join, would be great to discuss this.
@evankanderson WDYTH?

@iancoffey
Copy link

Did it use to rely on Knative for that setup?

kfserving docs relied on the observability metrics bits that was removed, and thats the genesis of my interest - trying to understand that gap better, and how to help solve it.

Do you think this tooling could be part of a shared repo, if so how this could be maintained?

Its an interesting idea, maybe a small experimental project. It would be great to have something to plug the gap until there are longer term options down the road. I wonder what kfserving folks would think of that.

@skonto
Copy link
Contributor Author

skonto commented Nov 18, 2020

I wonder what kfserving folks would think of that.

Pinging @evankanderson @mdemirhan @markusthoemmes for that matter as well.

@iancoffey
Copy link

Ive assembled a simple yaml+config solution to get a knative grafana dashboard up and running here > https://github.com/iancoffey/knative-metrics. The intention is to unblock kfserving metrics for the time-being. I can add all the relevant dashboards if theres interest in developing this stopgap further.

@yuzisun
Copy link

yuzisun commented Dec 11, 2020

Ive assembled a simple yaml+config solution to get a knative grafana dashboard up and running here > https://github.com/iancoffey/knative-metrics. The intention is to unblock kfserving metrics for the time-being. I can add all the relevant dashboards if theres interest in developing this stopgap further.

I like the idea maybe this can temporarily live in KFServing as additional monitoring plugin users can install before we figure out a long term maintenance model with knative team.

@iancoffey
Copy link

OK I shall open this in a PR on kfserving repo next 👍

@skonto
Copy link
Contributor Author

skonto commented Dec 11, 2020

@iancoffey thanks for hosting this, I went through the yaml files, I found some incosistencies wrt the old removed scrape config, check this vs this line. Is there some way we can verify configs and stuff?
I pinged knative people on the monitoring channel maybe we can store this under knative. Pls join #monitoring in knative.slack.com to participate there as well.

@abrennan89 abrennan89 added this to the Icebox milestone Jan 25, 2021
@abrennan89 abrennan89 added the triage/needs-eng-input Engineering input is requested label Jan 25, 2021
@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen.Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2021
@dprotaso
Copy link
Member

Following up: @iancoffey switched teams and isn't involved with KFServing anymore.

I'm happy to contribute them

@upodroid are you interested in maintaining these contributions across releases? We ideally would like someone to be regularly involved - meaning fixes issues, write some documentation on how to use these monitors. We just want to avoid these going stale again and being unmaintained.

/remove-lifecycle stale

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2021
@upodroid
Copy link
Member

I'm up for it.

I have joined the google group.

This fits the Operations WG nicely.

@skonto
Copy link
Contributor Author

skonto commented Jun 15, 2021

I can also help @upodroid :) We should discuss where to host such artifacts, probably under knative-sandbox.

@upodroid
Copy link
Member

great, i'll see you at the Operations WG today.

@dprotaso
Copy link
Member

dprotaso commented Jun 15, 2021

Following up from the operations WG meeting 2021-06-15

We're going to create a sandbox repo and revive parts of the monitoring bundle:
Issue: knative/community#654
PR: knative/community#653

After revival we'll drive improvements and potential publishing of dashboards to grafana.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/medium triage/needs-eng-input Engineering input is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants