Extended Grafana section

noctarius · noctarius · commit 1e18fc2681f0 · 2025-03-12T08:36:28.000+01:00
diff --git a/docs/maintenance-operations/monitoring/accessing-grafana.md b/docs/maintenance-operations/monitoring/accessing-grafana.md
@@ -37,3 +37,47 @@ sbcli cluster get-secret <CLUSTER_ID>
 **Credentials**<br/>
 Username: **admin**<br/>
 Password: **<CLUSTER_SECRET>**
+
+## Grafana Dashboards
+
+All dashboards are stored in per-cluster folders. Each cluster contains the following dashboards entries:
+
+- Cluster
+- Storage node
+- Device
+- Logical Volume
+- Storage Pool
+
+Dashboard widgets are designed to be self-explanatory.
+
+Per default, each of those dashboards contain data for all objects (e.g. all devices) in a cluster. It is, however,
+possible to filter them by particular objects (e.g. devices, storage nodes or logical volumes) and to change the
+timescale and window.
+
+Dashboards include physical and logical capacity utilization dynamics, IOPS, I/O throughput, and latency dynamics (all
+separate for read, write and unmap). While all data of the event log is currently stored in Prometheus, they aren't
+used at the time of writing.
+
+## Alerting
+
+By default, Grafana is configured to send alerts to Slack channels. However, Grafana also allows alerting via email
+notifications, but this requires the use of an authorized SMTP server to send message.
+
+An SMTP server is currently not part of the management stack and must be deployed separately. Alerts can be triggered
+based on on-time or interval-based thresholds of statistical data collected (IO statistics, capacity information) or
+based on events from the cluster event log.
+
+### Pre-Defined Alerts
+
+The following pre-defined alerts are available:
+
+| Alert              | Trigger                                                       |
+|--------------------|---------------------------------------------------------------|
+| device-unavailable | Device Status changed from online to unavailable              |
+| device-read-only   | Device Status changed from online to read-only                |
+| sn-offline         | Storage node status changed from online to offline            |
+| crit-cap-reached   | Critical absolute capacity utilization in cluster was reached |
+| crit-prov-reached  | Critical absolute capacity utilization in cluster was reached |
+
+It is possible to configure the Slack webhook for alerting during cluster creation or to modify it at a later point in
+time.