Skip to content

Commit 1e18fc2

Browse files
committed
Extended Grafana section
1 parent 8077a99 commit 1e18fc2

File tree

1 file changed

+44
-0
lines changed

1 file changed

+44
-0
lines changed

docs/maintenance-operations/monitoring/accessing-grafana.md

+44
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,47 @@ sbcli cluster get-secret <CLUSTER_ID>
3737
**Credentials**<br/>
3838
Username: **admin**<br/>
3939
Password: **<CLUSTER_SECRET>**
40+
41+
## Grafana Dashboards
42+
43+
All dashboards are stored in per-cluster folders. Each cluster contains the following dashboards entries:
44+
45+
- Cluster
46+
- Storage node
47+
- Device
48+
- Logical Volume
49+
- Storage Pool
50+
51+
Dashboard widgets are designed to be self-explanatory.
52+
53+
Per default, each of those dashboards contain data for all objects (e.g. all devices) in a cluster. It is, however,
54+
possible to filter them by particular objects (e.g. devices, storage nodes or logical volumes) and to change the
55+
timescale and window.
56+
57+
Dashboards include physical and logical capacity utilization dynamics, IOPS, I/O throughput, and latency dynamics (all
58+
separate for read, write and unmap). While all data of the event log is currently stored in Prometheus, they aren't
59+
used at the time of writing.
60+
61+
## Alerting
62+
63+
By default, Grafana is configured to send alerts to Slack channels. However, Grafana also allows alerting via email
64+
notifications, but this requires the use of an authorized SMTP server to send message.
65+
66+
An SMTP server is currently not part of the management stack and must be deployed separately. Alerts can be triggered
67+
based on on-time or interval-based thresholds of statistical data collected (IO statistics, capacity information) or
68+
based on events from the cluster event log.
69+
70+
### Pre-Defined Alerts
71+
72+
The following pre-defined alerts are available:
73+
74+
| Alert | Trigger |
75+
|--------------------|---------------------------------------------------------------|
76+
| device-unavailable | Device Status changed from online to unavailable |
77+
| device-read-only | Device Status changed from online to read-only |
78+
| sn-offline | Storage node status changed from online to offline |
79+
| crit-cap-reached | Critical absolute capacity utilization in cluster was reached |
80+
| crit-prov-reached | Critical absolute capacity utilization in cluster was reached |
81+
82+
It is possible to configure the Slack webhook for alerting during cluster creation or to modify it at a later point in
83+
time.

0 commit comments

Comments
 (0)