Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Improve accuracy and reliability of Grafana Dashboards #2812

Open
2 of 5 tasks
balajialg opened this issue Sep 29, 2021 · 21 comments
Open
2 of 5 tasks

[EPIC] Improve accuracy and reliability of Grafana Dashboards #2812

balajialg opened this issue Sep 29, 2021 · 21 comments
Assignees
Labels
automation Manual things that shouldn't be priority: high High priority tasks

Comments

@balajialg
Copy link
Contributor

balajialg commented Sep 29, 2021

Summary

Grafana Dashboards are a comprehensive and useful tool that serves the following purposes,

  1. Highlights important metrics such as the number of active users
  2. Highlight hubs performance by looking at different indicators such as memory distribution, pod latency, etc..

Grafana is used by us extensively which highlights the critical role played by the tool. However, the grafana metrics can also be confusing to parse out in certain cases. The purpose of this enhancement request is to revamp the reporting to make it easier to interpret.

In the graph below, It highlights "Active users over 24 hours" while reporting metrics for the entire month. In addition, we are not sure whether this is the accurate number of unique users using Datahub during the past 24 hours,

image (3)

These metrics in this graph below highlights monthly active users as 9k. However, it is not clear whether it is a unique number of users? If yes then whether the reported numbers are accurate.

Capture

Some of the data are not getting reported in the dashboard!

Screen Shot 2021-09-29 at 1 33 41 PM
Screen Shot 2021-09-29 at 1 31 12 PM
Screen Shot 2021-09-29 at 1 29 52 PM

Sometimes, the dashboard allows for saving the changes which can be confusing!

Screen Shot 2021-09-29 at 1 37 47 PM

In certain cases, the graphs appear twice

Screen Shot 2021-09-29 at 1 44 20 PM

The categorization of the dashboard is not intuitive! For eg: What is the difference between JupyterHub vs JupyterHub Dashboard vs JupyterHub Original Dashboard? If we can categorize this clearly it would be valuable.

image (2)

User Stories

  • As a team member, I want to have access to metrics in the dashboard which are reliable and accurate for troubleshooting/evangelizing purposes.

Acceptance criteria

  • Given an outage, I have all the information required as part of grafana to debug the issue
  • Given a workshop or a meeting with leadership, I have accurate and reliable metrics that can get shared with stakeholders

Tasks to complete

@balajialg balajialg self-assigned this Sep 29, 2021
@balajialg balajialg added the automation Manual things that shouldn't be label Sep 29, 2021
@yuvipanda
Copy link
Contributor

yuvipanda commented Sep 30, 2021

Thanks a lot for opening this, @balajialg!

github.com/jupyterhub/jupyterhub-grafana deploys a set of common dashboards to a particular folder - in our case, https://grafana.datahub.berkeley.edu/dashboards/f/70E5EE84-1217-4021-A89E-1E3DE0566D93/jupyterhub-default-dashboards. I think this is the only one that's reliably consistent - the dashboards are version controlled, documented and somewhat understood by the broader community. I think every other dashboard is really one of us 'playing around', and I'm not sure how much I trust most of them.

Here's a suggestion on how to proceed.

  • Move all dashboards we are 'playing around with' to their own folder, so there is lesser confusion
  • Add descriptions to all dashboards and panels in https://github.com/jupyterhub/grafana-dashboards
  • Develop the 'usage metrics' dashboard some more - it can be very helpful for evangelism, but isn't in a usable state now.

How does this sound, @balajialg? If this sounds good, what kinda metrics will be useful for evangelism?

@balajialg
Copy link
Contributor Author

balajialg commented Sep 30, 2021

@yuvipanda These are awesome next steps! Makes a lot of sense. Moving the dashboards we are playing around with to separate folders and adding descriptions to all the dashboards would make it easier for interpretation.

In terms of metrics required for evangelism, I am looking at articulating our story in terms of our reach, impact, and the technical brilliance of the tool.

  1. REACH: What are our unique Daily Active Users (DAU) and Monthly Active Users (MAU) look like? (Dissecting this data across hubs)
  2. IMPACT: How much time are our users spending cumulatively across all the hubs (daily/monthly/yearly)? Articulating that in terms of years would be a powerful metric for evangelizing the extensive usage we are observing.
  3. IMPACT: How many assignments are completed cumulatively on a daily/monthly/yearly basis?
  4. TECHNICAL BRILLIANCE: Metrics around the time it takes for us to auto-scale to 1000's of users. Any other metric to articulate the value prop from a technology standpoint would be amazing.

@yuvipanda These are desirable metrics. Let me know how many of these requests are actually feasible. Thanks for taking the time to look into this. Appreciate it.

@yuvipanda
Copy link
Contributor

I've organized the dashboards into folders now:

image

The 'production' dashboards are clearly labelled now.

@balajialg
Copy link
Contributor Author

balajialg commented Oct 15, 2021

@yuvipanda Thanks for this.

When you are back from vacation on Monday (10/18), Can you prioritize the following request?

For the tech strategy meeting on 10/21, Jim Colliander, Erfan, and Bill Allison (CTO) are joining us. @ericvd-ucb suggested that we use this time to share an interim update on fall usage and make the case for additional resourcing. Would you be able to fill in some of the details required for this deck ? Please do let me know the data points that are not feasible to fetch at this juncture.

@balajialg
Copy link
Contributor Author

@yuvipanda Bringing this back to your attention to get your perspectives!

@yuvipanda
Copy link
Contributor

@balajialg I'm looking at slides 7, 8 and 9. I can produce data for 7 and 9, but I don't understand what 8 refers to. Can you expand a little bit?

@yuvipanda
Copy link
Contributor

@balajialg you should be able to get cost information from https://console.cloud.google.com/billing/013554-935B0A-B97AA1/reports;grouping=GROUP_BY_SKU;projects=ucb-datahub-2018?project=ucb-datahub-2018

@balajialg
Copy link
Contributor Author

@yuvipanda I don't have the required permission to view the billing information. Can you elevate the privileges for me?

image

@yuvipanda
Copy link
Contributor

done

@balajialg
Copy link
Contributor Author

balajialg commented Oct 19, 2021

@yuvipanda There seems to be a minor discrepancy between the grafana data and the raw data shared with me. Sharing the snapshots for the last 30 days,

Grafana data from the past 30 days
image

Analysis based on the raw data shared for the past 30 days,

image

Link to R notebook where I did the above analysis.

As I mentioned in the chart, it will be amazing to calculate the total time spent by users in the hub from an evangelizing perspective. Considering the mismatch between the start and stop actions, it will be great if we can figure out a way to log the stop actions of the users (whenever the culler shuts down the inactive users). We have a huge discrepancy with regards to the start and stop actions based on the below data,

start stop 158703 2099

@yuvipanda
Copy link
Contributor

We should completely discount the grarfana usage data - it was an experiment that hasn't been validated at all. The log data is definitely more accurate.

I agree on session lengths! I'll try think of a way to properly measure that.

@balajialg
Copy link
Contributor Author

@yuvipanda That will be awesome!

@balajialg
Copy link
Contributor Author

balajialg commented Feb 18, 2022

@yuvipanda Highlighting some of our grafana woes. Copy-pasting some of the grafana results for CPU allocation today. We definitely need a way to solve this issue with grafana not fetching data to unblock @felder whenever he needs to analyze stuff.
image
Screen Shot 2022-02-18 at 12 53 19 PM

@balajialg
Copy link
Contributor Author

balajialg commented Feb 24, 2022

  • @felder to work with @yuvipanda to fix some of the issues with grafana. It might involve bumping up the limits of Prometheus!

@balajialg
Copy link
Contributor Author

balajialg commented Nov 8, 2022

@shaneknapp Just an FYI - Grafana improvements are something I wanted to discuss during our Sprint planning meeting but missed adding to our monthly backlog. Some of the graphs in the dashboard often break while changing time intervals resulting in empty responses. Here is an example from my exploration today,

image

Either the graph needs to be fixed or it should provide a better error message from an user standpoint.

@felder
Copy link
Contributor

felder commented Nov 8, 2022

@balajialg I believe the graphs are breaking because the responses are taking too long. If you click the red exclamation mark, you should be able to get an indication of why it's unhappy.

For example:

Screen Shot 2022-11-08 at 3 46 49 PM

Screen Shot 2022-11-08 at 3 46 58 PM

Basically the queries are timing out meaning that grafana is not getting the information as quickly as it expects to. So it's not really a matter of fixing the graph unless the query that generates the graph is grossly inefficient.

In the past the solution was to allocate more RAM to prometheus to speed it up. I'm not sure how sustainable that solution is. The more data that is collected, the slower it gets.

@balajialg
Copy link
Contributor Author

balajialg commented Nov 8, 2022

@felder Got it. So, optimizing the queries is the way forward? How does your PR here relate to this objective?

@shaneknapp
Copy link
Contributor

shaneknapp commented Nov 9, 2022

@balajialg optimizing might work... maybe?

you can get the same graphs/reports by running prometheus locally on your laptop and executing the queries there. perhaps it's time for me to remember to have @felder show me (and now you) how to do this. :)

@felder
Copy link
Contributor

felder commented Nov 9, 2022

@balajialg That PR was about fixing the queries because they were returning incorrect information. My PR doesn't have any relation to improving the efficiency of those queries.

I also do not know if optimizing the queries is the way forward as I do not know whether or not the queries can be optimized further. An investigation (along with an education on PromQL query optimization) would be required. Here's one example result from a google search for promql query optimization https://thenewstack.io/query-optimization-in-the-prometheus-world/

Assuming the queries are suboptimal, then yes that might be one way to address this. However, if they cannot be optimized then we'll need to do something else. I just don't know that throwing RAM at it is the way to go or not...it could be. However it might also just be a game of whack-a-mole until adding RAM is no longer feasible.

@balajialg
Copy link
Contributor Author

@shaneknapp This seems like an interesting idea. Curious to see how my laptop would handle this volume of data (assuming this must be a large dataset). Look forward to learning more about this.

@felder Got it. Based on what you said, it seems like this is more of an experimental project like Google Filestore in terms of a) figuring out whether there are alternatives to increasing RAM and optimizing queries and b) finding a solution to optimize queries.

@balajialg balajialg changed the title [EPIC] Revamping Grafana Dashboards [EPIC] Improve accuracy of Grafana Dashboards Feb 1, 2023
@balajialg balajialg changed the title [EPIC] Improve accuracy of Grafana Dashboards [EPIC] Improve accuracy and reliability of Grafana Dashboards Feb 1, 2023
@balajialg balajialg added the priority: high High priority tasks label Feb 2, 2023
@balajialg
Copy link
Contributor Author

balajialg commented Feb 3, 2023

  • Bump the version of Jupyterhub (Define the latest Jupyterhub changes properly and decide)
  • Scoping this for maintenance window during Spring break?
  • Blast an email to datahub-announce list to check about usage during MW and announce that there is a maintenance window!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation Manual things that shouldn't be priority: high High priority tasks
Projects
None yet
Development

No branches or pull requests

4 participants