[EPIC] Improve accuracy and reliability of Grafana Dashboards #2812

balajialg · 2021-09-29T21:21:19Z

Summary

Grafana Dashboards are a comprehensive and useful tool that serves the following purposes,

Highlights important metrics such as the number of active users
Highlight hubs performance by looking at different indicators such as memory distribution, pod latency, etc..

Grafana is used by us extensively which highlights the critical role played by the tool. However, the grafana metrics can also be confusing to parse out in certain cases. The purpose of this enhancement request is to revamp the reporting to make it easier to interpret.

In the graph below, It highlights "Active users over 24 hours" while reporting metrics for the entire month. In addition, we are not sure whether this is the accurate number of unique users using Datahub during the past 24 hours,

These metrics in this graph below highlights monthly active users as 9k. However, it is not clear whether it is a unique number of users? If yes then whether the reported numbers are accurate.

Some of the data are not getting reported in the dashboard!

Sometimes, the dashboard allows for saving the changes which can be confusing!

In certain cases, the graphs appear twice

The categorization of the dashboard is not intuitive! For eg: What is the difference between JupyterHub vs JupyterHub Dashboard vs JupyterHub Original Dashboard? If we can categorize this clearly it would be valuable.

User Stories

As a team member, I want to have access to metrics in the dashboard which are reliable and accurate for troubleshooting/evangelizing purposes.

Acceptance criteria

Given an outage, I have all the information required as part of grafana to debug the issue
Given a workshop or a meeting with leadership, I have accurate and reliable metrics that can get shared with stakeholders

Tasks to complete

Move all dashboards we are 'playing around with' to their own folder, so there is lesser confusion #2859
Add descriptions to all dashboards and panels in https://github.com/jupyterhub/grafana-dashboards #2860
Develop the 'usage metrics' dashboard some more - it can be very helpful for evangelism, but isn't in a usable state now. #2861
Analyze the nbgitpuller log data to find the 3k discrepancy between course enrollment data and raw data from logs #2943
Add active users prometheus metrics jupyterhub/jupyterhub#4214 (Implement @yuvipanda's PR improving the accuracy of Grafana metrics for calculating active users)

yuvipanda · 2021-09-30T10:40:11Z

Thanks a lot for opening this, @balajialg!

github.com/jupyterhub/jupyterhub-grafana deploys a set of common dashboards to a particular folder - in our case, https://grafana.datahub.berkeley.edu/dashboards/f/70E5EE84-1217-4021-A89E-1E3DE0566D93/jupyterhub-default-dashboards. I think this is the only one that's reliably consistent - the dashboards are version controlled, documented and somewhat understood by the broader community. I think every other dashboard is really one of us 'playing around', and I'm not sure how much I trust most of them.

Here's a suggestion on how to proceed.

Move all dashboards we are 'playing around with' to their own folder, so there is lesser confusion
Add descriptions to all dashboards and panels in https://github.com/jupyterhub/grafana-dashboards
Develop the 'usage metrics' dashboard some more - it can be very helpful for evangelism, but isn't in a usable state now.

How does this sound, @balajialg? If this sounds good, what kinda metrics will be useful for evangelism?

balajialg · 2021-09-30T16:36:15Z

@yuvipanda These are awesome next steps! Makes a lot of sense. Moving the dashboards we are playing around with to separate folders and adding descriptions to all the dashboards would make it easier for interpretation.

In terms of metrics required for evangelism, I am looking at articulating our story in terms of our reach, impact, and the technical brilliance of the tool.

REACH: What are our unique Daily Active Users (DAU) and Monthly Active Users (MAU) look like? (Dissecting this data across hubs)
IMPACT: How much time are our users spending cumulatively across all the hubs (daily/monthly/yearly)? Articulating that in terms of years would be a powerful metric for evangelizing the extensive usage we are observing.
IMPACT: How many assignments are completed cumulatively on a daily/monthly/yearly basis?
TECHNICAL BRILLIANCE: Metrics around the time it takes for us to auto-scale to 1000's of users. Any other metric to articulate the value prop from a technology standpoint would be amazing.

@yuvipanda These are desirable metrics. Let me know how many of these requests are actually feasible. Thanks for taking the time to look into this. Appreciate it.

yuvipanda · 2021-10-01T14:07:28Z

I've organized the dashboards into folders now:

The 'production' dashboards are clearly labelled now.

balajialg · 2021-10-15T20:53:37Z

@yuvipanda Thanks for this.

When you are back from vacation on Monday (10/18), Can you prioritize the following request?

For the tech strategy meeting on 10/21, Jim Colliander, Erfan, and Bill Allison (CTO) are joining us. @ericvd-ucb suggested that we use this time to share an interim update on fall usage and make the case for additional resourcing. Would you be able to fill in some of the details required for this deck ? Please do let me know the data points that are not feasible to fetch at this juncture.

balajialg · 2021-10-18T18:55:34Z

@yuvipanda Bringing this back to your attention to get your perspectives!

yuvipanda · 2021-10-18T19:36:29Z

@balajialg I'm looking at slides 7, 8 and 9. I can produce data for 7 and 9, but I don't understand what 8 refers to. Can you expand a little bit?

yuvipanda · 2021-10-18T19:51:20Z

@balajialg you should be able to get cost information from https://console.cloud.google.com/billing/013554-935B0A-B97AA1/reports;grouping=GROUP_BY_SKU;projects=ucb-datahub-2018?project=ucb-datahub-2018

balajialg · 2021-10-18T20:10:13Z

@yuvipanda I don't have the required permission to view the billing information. Can you elevate the privileges for me?

yuvipanda · 2021-10-18T20:46:18Z

done

balajialg · 2021-10-19T19:51:00Z

@yuvipanda There seems to be a minor discrepancy between the grafana data and the raw data shared with me. Sharing the snapshots for the last 30 days,

Grafana data from the past 30 days

Analysis based on the raw data shared for the past 30 days,

Link to R notebook where I did the above analysis.

As I mentioned in the chart, it will be amazing to calculate the total time spent by users in the hub from an evangelizing perspective. Considering the mismatch between the start and stop actions, it will be great if we can figure out a way to log the stop actions of the users (whenever the culler shuts down the inactive users). We have a huge discrepancy with regards to the start and stop actions based on the below data,

start stop 158703 2099

yuvipanda · 2021-10-20T08:54:34Z

We should completely discount the grarfana usage data - it was an experiment that hasn't been validated at all. The log data is definitely more accurate.

I agree on session lengths! I'll try think of a way to properly measure that.

balajialg · 2021-10-20T17:39:40Z

@yuvipanda That will be awesome!

balajialg · 2022-02-18T20:53:38Z

@yuvipanda Highlighting some of our grafana woes. Copy-pasting some of the grafana results for CPU allocation today. We definitely need a way to solve this issue with grafana not fetching data to unblock @felder whenever he needs to analyze stuff.

balajialg · 2022-02-24T20:29:26Z

@felder to work with @yuvipanda to fix some of the issues with grafana. It might involve bumping up the limits of Prometheus!

balajialg · 2022-11-08T23:42:58Z

@shaneknapp Just an FYI - Grafana improvements are something I wanted to discuss during our Sprint planning meeting but missed adding to our monthly backlog. Some of the graphs in the dashboard often break while changing time intervals resulting in empty responses. Here is an example from my exploration today,

Either the graph needs to be fixed or it should provide a better error message from an user standpoint.

felder · 2022-11-08T23:48:21Z

@balajialg I believe the graphs are breaking because the responses are taking too long. If you click the red exclamation mark, you should be able to get an indication of why it's unhappy.

For example:

Basically the queries are timing out meaning that grafana is not getting the information as quickly as it expects to. So it's not really a matter of fixing the graph unless the query that generates the graph is grossly inefficient.

In the past the solution was to allocate more RAM to prometheus to speed it up. I'm not sure how sustainable that solution is. The more data that is collected, the slower it gets.

balajialg · 2022-11-08T23:58:37Z

@felder Got it. So, optimizing the queries is the way forward? How does your PR here relate to this objective?

shaneknapp · 2022-11-09T00:31:11Z

@balajialg optimizing might work... maybe?

you can get the same graphs/reports by running prometheus locally on your laptop and executing the queries there. perhaps it's time for me to remember to have @felder show me (and now you) how to do this. :)

felder · 2022-11-09T00:49:00Z

@balajialg That PR was about fixing the queries because they were returning incorrect information. My PR doesn't have any relation to improving the efficiency of those queries.

I also do not know if optimizing the queries is the way forward as I do not know whether or not the queries can be optimized further. An investigation (along with an education on PromQL query optimization) would be required. Here's one example result from a google search for promql query optimization https://thenewstack.io/query-optimization-in-the-prometheus-world/

Assuming the queries are suboptimal, then yes that might be one way to address this. However, if they cannot be optimized then we'll need to do something else. I just don't know that throwing RAM at it is the way to go or not...it could be. However it might also just be a game of whack-a-mole until adding RAM is no longer feasible.

balajialg · 2022-11-09T03:06:53Z

@shaneknapp This seems like an interesting idea. Curious to see how my laptop would handle this volume of data (assuming this must be a large dataset). Look forward to learning more about this.

@felder Got it. Based on what you said, it seems like this is more of an experimental project like Google Filestore in terms of a) figuring out whether there are alternatives to increasing RAM and optimizing queries and b) finding a solution to optimize queries.

balajialg · 2023-02-03T01:08:31Z

Bump the version of Jupyterhub (Define the latest Jupyterhub changes properly and decide)
Scoping this for maintenance window during Spring break?
Blast an email to datahub-announce list to check about usage during MW and announce that there is a maintenance window!

balajialg self-assigned this Sep 29, 2021

balajialg added the automation Manual things that shouldn't be label Sep 29, 2021

balajialg assigned yuvipanda and felder Sep 29, 2021

balajialg mentioned this issue Oct 18, 2022

Shane Onboarding: Datahub Infra Access Checklist #3820

Closed

9 tasks

balajialg unassigned yuvipanda Nov 8, 2022

balajialg changed the title ~~[EPIC] Revamping Grafana Dashboards~~ [EPIC] Improve accuracy of Grafana Dashboards Feb 1, 2023

balajialg changed the title ~~[EPIC] Improve accuracy of Grafana Dashboards~~ [EPIC] Improve accuracy and reliability of Grafana Dashboards Feb 1, 2023

balajialg added the priority: high High priority tasks label Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Improve accuracy and reliability of Grafana Dashboards #2812

[EPIC] Improve accuracy and reliability of Grafana Dashboards #2812

balajialg commented Sep 29, 2021 •

edited

Loading

yuvipanda commented Sep 30, 2021 •

edited by balajialg

Loading

balajialg commented Sep 30, 2021 •

edited

Loading

yuvipanda commented Oct 1, 2021

balajialg commented Oct 15, 2021 •

edited

Loading

balajialg commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

balajialg commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

balajialg commented Oct 19, 2021 •

edited

Loading

yuvipanda commented Oct 20, 2021

balajialg commented Oct 20, 2021

balajialg commented Feb 18, 2022 •

edited

Loading

balajialg commented Feb 24, 2022 •

edited

Loading

balajialg commented Nov 8, 2022 •

edited

Loading

felder commented Nov 8, 2022 •

edited

Loading

balajialg commented Nov 8, 2022 •

edited

Loading

shaneknapp commented Nov 9, 2022 •

edited

Loading

felder commented Nov 9, 2022

balajialg commented Nov 9, 2022

balajialg commented Feb 3, 2023 •

edited

Loading

[EPIC] Improve accuracy and reliability of Grafana Dashboards #2812

[EPIC] Improve accuracy and reliability of Grafana Dashboards #2812

Comments

balajialg commented Sep 29, 2021 • edited Loading

Summary

User Stories

Acceptance criteria

Tasks to complete

yuvipanda commented Sep 30, 2021 • edited by balajialg Loading

balajialg commented Sep 30, 2021 • edited Loading

yuvipanda commented Oct 1, 2021

balajialg commented Oct 15, 2021 • edited Loading

balajialg commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

balajialg commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

balajialg commented Oct 19, 2021 • edited Loading

yuvipanda commented Oct 20, 2021

balajialg commented Oct 20, 2021

balajialg commented Feb 18, 2022 • edited Loading

balajialg commented Feb 24, 2022 • edited Loading

balajialg commented Nov 8, 2022 • edited Loading

felder commented Nov 8, 2022 • edited Loading

balajialg commented Nov 8, 2022 • edited Loading

shaneknapp commented Nov 9, 2022 • edited Loading

felder commented Nov 9, 2022

balajialg commented Nov 9, 2022

balajialg commented Feb 3, 2023 • edited Loading

balajialg commented Sep 29, 2021 •

edited

Loading

yuvipanda commented Sep 30, 2021 •

edited by balajialg

Loading

balajialg commented Sep 30, 2021 •

edited

Loading

balajialg commented Oct 15, 2021 •

edited

Loading

balajialg commented Oct 19, 2021 •

edited

Loading

balajialg commented Feb 18, 2022 •

edited

Loading

balajialg commented Feb 24, 2022 •

edited

Loading

balajialg commented Nov 8, 2022 •

edited

Loading

felder commented Nov 8, 2022 •

edited

Loading

balajialg commented Nov 8, 2022 •

edited

Loading

shaneknapp commented Nov 9, 2022 •

edited

Loading

balajialg commented Feb 3, 2023 •

edited

Loading