Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ScyllaDB Monitoring to 4.4.5 #1456

Merged
merged 1 commit into from
Oct 13, 2023

Conversation

tnozicka
Copy link
Contributor

@tnozicka tnozicka commented Oct 6, 2023

Description of your changes:
This updates ScyllaDB Monitoring to 4.4.5, structures the dashboards and gzips their content so it will fit into a ConfigMap.

Which issue is resolved by this Pull Request:
Resolves #1449

@tnozicka tnozicka added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 6, 2023
@scylla-operator-bot scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 6, 2023
@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2023
@tnozicka
Copy link
Contributor Author

tnozicka commented Oct 9, 2023

/hold
to manually verify upgrade from previous version

@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2023
@tnozicka
Copy link
Contributor Author

tnozicka commented Oct 9, 2023

ouch :(

I1009 18:58:58.742218   76485 record/event.go:307] "Event occurred" object="test/example-grafana-scylladb-dashboards" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Warning" reason="UpdateConfigMapFailed" message="Failed to update ConfigMap test/example-grafana-scylladb-dashboards: ConfigMap \"example-grafana-scylladb-dashboards\" is invalid: []: Too long: must have at most 1048576 bytes"

@tnozicka tnozicka force-pushed the bump-monitoring branch 3 times, most recently from 0c4942f to cee188e Compare October 10, 2023 14:50
annotations:
description: 'OOM Kill on {{ $labels.instance }}'
summary: A process was terminated on Instance {{ $labels.instance }}
- alert: tooManyFiles
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, this fires on my toy cluster, but this is out of scope for this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same in my case.

@tnozicka tnozicka force-pushed the bump-monitoring branch 2 times, most recently from c1c2a3c to bc7edd9 Compare October 10, 2023 15:02
@tnozicka tnozicka changed the title [WIP] Update ScyllaDB Monitoring to 4.5 Update ScyllaDB Monitoring to 4.4.5 Oct 10, 2023
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2023
@tnozicka
Copy link
Contributor Author

/cc @amnonh

@scylla-operator-bot scylla-operator-bot bot requested a review from amnonh October 10, 2023 15:03
@@ -0,0 +1,4680 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amnonh this one is not updated as I haven't found a source for it in 4.4.5 - where can I find it?
(also this will break with grafana 10.0 for the 4.5 bump)

@tnozicka
Copy link
Contributor Author

@zimnx @rzetelskik @amnonh PTAL, I'd like to take this to v1.11 branch tomorrow so we can have it in v1.11.0-rc.1 to ensure stable monitoring with latest ScyllaDB versions.

@rzetelskik
Copy link
Member

Something is off with the commit names - one says Update ScyllaDB Monitoring to 4.5, the other Structure monitoring manifests, update them to 4.4.5 and gzip dashboards, while the PR title is Update ScyllaDB Monitoring to 4.4.5. Should the commits have been squashed or the first commit renamed?

@tnozicka
Copy link
Contributor Author

tnozicka commented Oct 11, 2023

thanks, it was a leftover from when I've initially tried 4.5 (as the docs already refer to it but it's not released) + it breaks serverless dashboard (bumps grafana to 10.0) so I went with 4.4.5. Updated.

@zimnx
Copy link
Collaborator

zimnx commented Oct 11, 2023

Dashboards require having ScyllaCluster name set to scylla, otherwise panels are not showing any data due to wrong metric tag.
Was it always like this?

@tnozicka tnozicka requested review from zimnx and rzetelskik October 12, 2023 12:48
@tnozicka tnozicka force-pushed the bump-monitoring branch 2 times, most recently from 61e58a6 to 7704251 Compare October 12, 2023 13:57
Copy link
Collaborator

@zimnx zimnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign rzetelskik
lgtm, thanks

@rzetelskik
Copy link
Member

/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Oct 12, 2023
@tnozicka
Copy link
Contributor Author

tnozicka commented Oct 12, 2023

I've tried the upgrade but I was surprised the saas dashboard broke, given we only ended up upgrading minor version for Grafana.

An unexpected error happened
TypeError: F is undefined

V@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:1:1430
ve@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:1:3841
div
div
div
b@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:2718:7432
Ot@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:26:4481
div
d@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:1:199
He@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:26:14543
An@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:174:126969

@amnonh can you pls help us figure out why this breaks?

@tnozicka
Copy link
Contributor Author

logger=frontend t=2023-10-12T19:18:07.672139034Z level=error msg="TypeError: F is undefined" url="https://localhost:3000/?orgId=1&refresh=30s" user_agent="Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" event_id=0702769497f4440285deadf93adb7ad6 original_timestamp=2023-10-12T19:18:07.663Z stacktrace="TypeError: F is undefined\n  at dispatch (core|webpack://grafana/./public/app/features/panel/state/actions.ts:36:4)\n  at ? (core|webpack://grafana/../yarn/cache/redux-npm-4.2.0-4688cc8d65-75f3955c89.zip/node_modules/redux/es/redux.js:691:27)\n  at next (core|webpack://grafana/./.yarn/__virtual__/redux-thunk-virtual-b59ff2637b/2/yarn/cache/redux-thunk-npm-2.4.2-3acdaaf7b0-c7f757f6c3.zip/node_modules/redux-thunk/es/index.js:20:15)\n  at next (core|webpack://grafana/./.yarn/__virtual__/@reduxjs-toolkit-virtual-75570b104b/2/yarn/cache/@reduxjs-toolkit-npm-1.9.3-cbad5da1de-d965fc6197.zip/node_modules/@reduxjs/toolkit/dist/query/rtk-query.esm.js:2023:26)\n  at next (core|webpack://grafana/./.yarn/__virtual__/@reduxjs-toolkit-virtual-75570b104b/2/yarn/cache/@reduxjs-toolkit-npm-1.9.3-cbad5da1de-d965fc6197.zip/node_modules/@reduxjs/toolkit/dist/query/rtk-query.esm.js:2023:26)\n  at listener (core|webpack://grafana/../yarn/cache/redux-npm-4.2.0-4688cc8d65-75f3955c89.zip/node_modules/redux/es/redux.js:297:6)\n  at ? (core|webpack://grafana/./.yarn/__virtual__/react-redux-virtual-bdf97cb622/2/yarn/cache/react-redux-npm-7.2.6-134f5ed64d-0bf142ce0d.zip/node_modules/react-redux/es/utils/Subscription.js:90:19)\n  at ? (core|webpack://grafana/./.yarn/__virtual__/react-redux-virtual-bdf97cb622/2/yarn/cache/react-redux-npm-7.2.6-134f5ed64d-0bf142ce0d.zip/node_modules/react-redux/es/utils/Subscription.js:85:14)\n  at ? (core|webpack://grafana/./.yarn/__virtual__/react-redux-virtual-bdf97cb622/2/yarn/cache/react-redux-npm-7.2.6-134f5ed64d-0bf142ce0d.zip/node_modules/react-redux/es/utils/Subscription.js:15:12)\n  at ig (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:244:189)\n  at jg (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:122:427)\n  at ? (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:123:63)\n  at Nf (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:122:324)\n  at b (core|webpack://grafana/../yarn/cache/scheduler-npm-0.20.2-90beaecfba-c4b35cf967.zip/node_modules/scheduler/cjs/scheduler.production.min.js:18:342)\n  at c (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:123:114)\n  at Tj (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:243:162)\n  at ak (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:250:137)\n  at Y (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:250:280)\n  at a (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:250:349)\n  at Ch (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:267:459)\n  at c (core|webpack://grafana/./.yarn/__virtual__/react-dom-virtual-48394057d2/2/yarn/cache/react-dom-npm-17.0.2-f551215af1-1c1eaa3bca.zip/node_modules/react-dom/cjs/react-dom.production.min.js:157:136)\n  at onShowPanelLinks (core|webpack://grafana/./public/app/features/dashboard/dashgrid/PanelLinks.tsx:27:22)\n  at ? (core|webpack://grafana/./public/app/features/dashboard/utils/getPanelChromeProps.tsx:55:52)\n  at ? (core|webpack://grafana/./public/app/features/panel/panellinks/linkSuppliers.ts:151:19)\n  at ? (core|webpack://grafana/./public/app/features/panel/panellinks/linkSuppliers.ts:152:28)\n  at ? (core|webpack://grafana/./public/app/features/panel/panellinks/link_srv.ts:301:6)" context_react_componentStack="\nV@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:1:1430\nve@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:1:3841\ndiv\ndiv\ndiv\nb@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:2718:7432\nOt@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:26:4481\ndiv\nd@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:1:199\nHe@https://localhost:3000/public/build/9371.5e68c13ff532901274cb.js:26:14543\nAn@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:174:126969\ndiv\n71625/J<@https://localhost:3000/public/build/3776.9139c4cd6f4ddf42c096.js:69:10599\nC@https://localhost:3000/public/build/7307.46717cbd898458fddd30.js:5:6364\nm@https://localhost:3000/public/build/7307.46717cbd898458fddd30.js:3:2611\nf@https://localhost:3000/public/build/3776.9139c4cd6f4ddf42c096.js:69:13780\ndiv\nt@https://localhost:3000/public/build/3776.9139c4cd6f4ddf42c096.js:71:2668\ndiv\ndiv\nD@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:181:108172\ndiv\nue@https://localhost:3000/public/build/3776.9139c4cd6f4ddf42c096.js:69:6702\ndiv\ndiv\ndiv\nQ@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:139:13278\no@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:475:3166\ndiv\nF@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:4935:2828\n_e@https://localhost:3000/public/build/7669.9b1c93089cb391e3452c.js:387:425\nDashboardPage\nAn@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:174:126969\nSuspense\nf@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:1278:180\nGm@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:9876:330\nYe@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:178:24526\nYe@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:178:26722\ndiv\nmain\nLm@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:9873:1833\nYe@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:178:20821\ndiv\nl@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:2567:3977\nv@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:49:51341\ni@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:5255:18040\nf@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:1278:180\nc@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:1278:754\nS@https://localhost:3000/public/build/4067.67791818270ded8beee8.js:174:124661\nZm@https://localhost:3000/public/build/9274.093be82c7b62ab45e801.js:9889:122" user_email=admin@localhost user_id=1
logger=frontend t=2023-10-12T19:18:07.718876591Z level=error msg="UnhandledRejection: Non-Error promise rejection captured with keys: cancelled, config, data, status" url="https://localhost:3000/?orgId=1&refresh=30s" user_agent="Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" event_id=66e9dec20bc54fbfad634e46f9ca1f84 original_timestamp=2023-10-12T19:18:07.708Z stacktrace="UnhandledRejection: Non-Error promise rejection captured with keys: cancelled, config, data, status" user_email=admin@localhost user_id=1

@tnozicka
Copy link
Contributor Author

tnozicka commented Oct 13, 2023

the errors are meaningless to me, but I have manually bisected all panels and identified these 4 that break it

$ jq '[.panels[21,22,23,24]] | map({"id": .id, "title": .title})' assets/monitoring/grafana/v1alpha1/dashboards/saas/overview.json
[
  {
    "id": 22,
    "title": "Connections"
  },
  {
    "id": 23,
    "title": "CQL OPs"
  },
  {
    "id": 24,
    "title": "Node Latency"
  },
  {
    "id": 25,
    "title": "Shard Latency"
  }
]

not yet sure why

@tnozicka
Copy link
Contributor Author

tnozicka commented Oct 13, 2023

ok, this seems to be what's fixing it. not even sure what the meaning was as grafana expects corresponding url

-            "links": [
-                {
-                    "title": "The number of connections per shard should be balanced"
-                }
-            ],
+            "links": [],

@scylla-operator-bot scylla-operator-bot bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2023
@tnozicka
Copy link
Contributor Author

tested the upgraded grafana works with SaaS dashboard without dummy links
/hold cancel

@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2023
@amnonh
Copy link

amnonh commented Oct 13, 2023

@tnozicka Do you need my help? there is an issue with the dashboard for the serverless, it has no repository yet, I wanted it to be part of Scylla Monitoring, but Tzach was against it.

@tnozicka
Copy link
Contributor Author

@amnonh I think this has fixed it
https://github.com/scylladb/scylla-operator/compare/7704251aec18b67f7a819ad8cb6983e2f9840e02..5ee2c8ee133867e7e0d0d35b51227c473c795484
although it's seems outdated compared to the other dashboards. But we can deal with that later on.

Maybe you could have it within monitoring, test it and just not publish it when you make a release?

@tnozicka
Copy link
Contributor Author

But I have spent the day doing manual bisects between the panels and brute forcing through changing every option available which quite sucked + it delays the release. Eventually I got lucky but I believe the SaaS dashboard needs to be maintained with the other ones so you propagate the changes and test it when you change grafana version requirement.

Copy link
Collaborator

@zimnx zimnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Good job tracing what breaks it!

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2023
@scylla-operator-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tnozicka
Copy link
Contributor Author

flaked on #1476

@scylla-operator-bot scylla-operator-bot bot merged commit 4bc44d9 into scylladb:master Oct 13, 2023
@tnozicka tnozicka deleted the bump-monitoring branch October 13, 2023 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scylla Monitoring version is outdated
5 participants