-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Observability support for Knative components and Knative services #2931
Comments
I like the route that Istio took which is to provide the grafana dashboards and Service Monitor resources for end users to deploy them in to their prometheus installations. I'm happy to contribute them |
I am game to help develop this solution as it is related to solving the kfserving metrics situation. 👍 😄 |
@iancoffey does Kfserving offer any tooling for setting up Prometheus and/or Grafana for the getting started UX at least? Did it use to rely on Knative for that setup? Do you think this tooling could be part of a shared repo, if so how this could be maintained? Maintainability was an issue in the first place for Knative monitoring. |
kfserving docs relied on the observability metrics bits that was removed, and thats the genesis of my interest - trying to understand that gap better, and how to help solve it.
Its an interesting idea, maybe a small experimental project. It would be great to have something to plug the gap until there are longer term options down the road. I wonder what kfserving folks would think of that. |
Pinging @evankanderson @mdemirhan @markusthoemmes for that matter as well. |
Ive assembled a simple yaml+config solution to get a knative grafana dashboard up and running here > https://github.com/iancoffey/knative-metrics. The intention is to unblock kfserving metrics for the time-being. I can add all the relevant dashboards if theres interest in developing this stopgap further. |
I like the idea maybe this can temporarily live in KFServing as additional monitoring plugin users can install before we figure out a long term maintenance model with knative team. |
OK I shall open this in a PR on kfserving repo next 👍 |
@iancoffey thanks for hosting this, I went through the yaml files, I found some incosistencies wrt the old removed scrape config, check this vs this line. Is there some way we can verify configs and stuff? |
This issue is stale because it has been open for 90 days with no |
Following up: @iancoffey switched teams and isn't involved with KFServing anymore.
@upodroid are you interested in maintaining these contributions across releases? We ideally would like someone to be regularly involved - meaning fixes issues, write some documentation on how to use these monitors. We just want to avoid these going stale again and being unmaintained. /remove-lifecycle stale |
I'm up for it. I have joined the google group. This fits the Operations WG nicely. |
I can also help @upodroid :) We should discuss where to host such artifacts, probably under knative-sandbox. |
great, i'll see you at the Operations WG today. |
Following up from the operations WG meeting 2021-06-15 We're going to create a sandbox repo and revive parts of the monitoring bundle: After revival we'll drive improvements and potential publishing of dashboards to grafana.com |
Describe the change you'd like to see
In the related WG meeting we discussed that users may require guidance about what metrics to use when monitoring Knative services. This has been raised before here #2070. The purpose of this feature is to capture the involved personas, what are the use cases and as a result what is the minimum info to expose in docs wrt observability not just metrics.
The personas can be splitted into roughly two groups in terms of reqs:
a) developer/tester/service delivery
These people develop and deploy knative services and they would like to know what to monitor to make sure
everything is running smoothly in staging or development envs. They want to understand key Knative features like
Serving auto-scaling and measure performance before shipping their service.If they use Eventing they want to know how their processing graph behaves (sources/sinks/dispatchers/brokers etc) and understand if their deployed artifacts are performing well. But they also might want to understand networking performance/behavior when a service mesh is used etc.
These people assume that Knative control plane and infrastructure is in place to run their services and debug them using logs, metrics and traces.
b) SREs/Operations/support engineers
These people are responsible for operating/debugging/automating Knative services and Knative components themselves in production. As a minimum they need support for the 4 golden signals to set proper alarms etc. For example how can we help them detect if their setup is healthy.
They need to understand how Knative components and knative services integrate with the existing infrastructure at all levels. In that perspective, they need to know what reqs Knative imposes, if any, when integrating with existing infrastructure in terms of logging, metrics acquisition and tracing.
Notes:
The proposed minimum items to add to the docs are:
It would be nice to have a sample app that demonstrates all the above.
Note: Any change to the above should be done with a followup change in docs as part of the PR work upstream.
I believe metrics are not expected to change that often.
/cc @abrennan89 @evankanderson @dprotaso
The text was updated successfully, but these errors were encountered: