Swap, RAID, Free space, Node alarms #7162

DavidePrincipi · 2024-11-20T15:37:02Z

Implement new alarms for clusters with an active subscription.

Proposed solution

The current monitoring script node-monitor is very simple and difficult to improve further. It is time to look at standard solutions, like Netdata or Prometheus (or both) to implement the monitoring of cluster nodes and applications.

Alternative solutions

Implement missing alarms and notifications in the existing script.

See also

Sub-issues below

The text was updated successfully, but these errors were encountered:

gsanchietti · 2025-02-25T09:34:42Z

The node-monitor has been replaced with a new metrics module that introduces a stack composed by:

Prometheus to gather the metrics and generate the alerts
Alertmanager to route the alerts
alert-proxy to forward alerts to my.nethesis.it and my.nethserver.com

Test case 1: new installation

Install a new cluster with 2 nodes using the latest testing image
Verify metrics module is installed and running only on the leader node: look for /home/metrics1 and check that Prometheus, Alertmanager and alert-proxy are running:
- ss -lanp | grep 9091 for Prometheus
- ss -lanp | grep 9094 for Alertmanager
- ss -lanp | grep 9095 for alert-proxy
Verify node_exporter is running on all nodes: systemctl status node-exporter

Test case 2: update

Install a new cluster with 2 nodes using the latest stable image
Install node_exporter module on both nodes
Update the core on both nodes
Verify metrics module is installed and running only on the leader node: look for /home/metrics1 and check that Prometheus, Alertmanager and alert-proxy are running:
- ss -lanp | grep 9091 for Prometheus
- ss -lanp | grep 9094 for Alertmanager
- ss -lanp | grep 9095 for alert-proxy
Verify old node_exporter instances have been removed systemctl status node-exporterX
Verify node_exporter is running on all nodes: systemctl status node-exporter
Verify node-monitor is not running systemctl status node-monitor

Test case 3: switch leader

After test case 1 or 2
Promote the worker node to leader
Verify metrics module is installed and running only on the leader node: look for /home/metrics1 and check that Prometheus, Alertmanager and alert-proxy are running:
- ss -lanp | grep 9091 for Prometheus
- ss -lanp | grep 9094 for Alertmanager
- ss -lanp | grep 9095 for alert-proxy
Verify the above services are not running on the worker node: /home/metricsX directory must be gone on the old leader

Test case 4: alert to my.nethesis.it

After test case 1 or 2
Disable node_exporter on the worker node: systemctl stop node_exporter

Verify no alert is sent, inside the logs you should see something like:

Feb 25 08:48:53 rl1.leader.cluster0.gs.nethserver.net alert-proxy[137405]: ALERT CRITICAL node:offline:1
Feb 25 08:48:53 rl1.leader.cluster0.gs.nethserver.net alert-proxy[137405]: No auth token, alert not sent

Restart node_exporter on the worker_node: systemctl start node_exporter
Register the cluster with a valid subscription token
Disable node_exporter on the worker node: systemctl stop node_exporter
Verify the alert is received by my.nethesis.it

Test case 4: access to Prometheus

Enable access to Prometheus: `api-cli run module/metrics1/configure-module --data '{"prometheus_path": "prometheus", "grafana_path": "", "lets_encrypt": false}'
Verify Prometheus is accessible from the browser at: https://<node_fqdn>/prometheus

Test case 4: access to Grafana

Enable access to Grafana: api-cli run module/metrics1/configure-module --data '{"prometheus_path": "prometheus", "grafana_path": "grafana", "lets_encrypt": false}'
Verify Grafana is accessible from the browser at: https://<node_fqdn>/grafana
Login with default credentials: admin / admin

Test case 5: Grafana dashboard

After test case 4
Verify that following dashboards are present and contains data:
- Loki: it must show metrics from Loki (some charts could be empty)
- Nodes: hardware metrics of all nodes, nodes can be selected from the top bar
If an alert is raised, it must be visible under the Alerting section

Test case 6: Grafana and Prometheus certificates

After test case 5
Request a valid certificate, execute: ``api-cli run module/metrics1/configure-module --data '{"prometheus_path": "prometheus", "grafana_path": "grafana", "lets_encrypt": true, "mail_to": [], "mail_from": "", "mail_template": ""}'`
Wait a bit
Make sure that both /prometheus and /grafana are served using a valid certificate

Test case 7: mail alerts

After test case 5
Access the Settings page of the cluster and configure a remote or local mail server to send the notifications
Configure the mail notification: api-cli run module/metrics1/configure-module --data '{"prometheus_path": "prometheus", "grafana_path": "grafana", "lets_encrypt": false, "mail_to": ["[email protected]"], "mail_from": "[email protected]", "mail_template": ""}
Generate an alert
Verify a mail is sent to the destination (the link to alertmanager inside the mail will point to a non-existing URL)

Test case 8: custom alert

After test case 4
Follow instructions on how to create a custom alert: https://github.com/NethServer/ns8-metrics?tab=readme-ov-file#customimze-alert-rules
Verify the alert is generated

Test case 9: mail notification

After test case 7
Follow instructions on how to create a custom mail template: https://github.com/NethServer/ns8-metrics?tab=readme-ov-file#customize-alert-mail-template
Verify the sent mail uses the new template

NethServer/dev#7162

DavidePrincipi · 2025-02-27T16:09:21Z

In testing core 3.5.0-dev.5 (see the test cases from the previous comment).

DavidePrincipi added the milestone goal 👑 This describes an announced milestone goal label Nov 20, 2024

DavidePrincipi added this to the NethServer M8.3 milestone Nov 20, 2024

DavidePrincipi added this to NethServer Nov 20, 2024

DavidePrincipi moved this to ToDo in NethServer Nov 20, 2024

gsanchietti changed the title ~~Backup, Swap, RAID, Free space alarms~~ Swap, RAID, Free space alarms Nov 22, 2024

gsanchietti modified the milestones: NethServer M8.3, NethServer 8.4 Nov 22, 2024

DavidePrincipi changed the title ~~Swap, RAID, Free space alarms~~ Swap, RAID, Free space, Node alarms Nov 26, 2024

DavidePrincipi assigned gsanchietti Feb 20, 2025

DavidePrincipi moved this from ToDo to In Progress in NethServer Feb 20, 2025

gsanchietti added a commit to NethServer/ns8-repomd that referenced this issue Feb 27, 2025

feat: add metrics core module (#44)

f1751bd

NethServer/dev#7162

gsanchietti mentioned this issue Feb 27, 2025

Monitoring: integrate metrics inside the core NethServer/ns8-core#816

Merged

DavidePrincipi unassigned gsanchietti Feb 27, 2025

DavidePrincipi added the testing Packages are available from testing repositories label Feb 27, 2025

nethbot moved this from In Progress to Testing in NethServer Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap, RAID, Free space, Node alarms #7162

Swap, RAID, Free space, Node alarms #7162

DavidePrincipi commented Nov 20, 2024 •

edited

Loading

gsanchietti commented Feb 25, 2025 •

edited

Loading

DavidePrincipi commented Feb 27, 2025

Swap, RAID, Free space, Node alarms #7162

Swap, RAID, Free space, Node alarms #7162

Comments

DavidePrincipi commented Nov 20, 2024 • edited Loading

gsanchietti commented Feb 25, 2025 • edited Loading

DavidePrincipi commented Feb 27, 2025

DavidePrincipi commented Nov 20, 2024 •

edited

Loading

gsanchietti commented Feb 25, 2025 •

edited

Loading