Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap, RAID, Free space, Node alarms #7162

Open
DavidePrincipi opened this issue Nov 20, 2024 · 2 comments
Open

Swap, RAID, Free space, Node alarms #7162

DavidePrincipi opened this issue Nov 20, 2024 · 2 comments
Labels
milestone goal 👑 This describes an announced milestone goal testing Packages are available from testing repositories

Comments

@DavidePrincipi
Copy link
Member

DavidePrincipi commented Nov 20, 2024

Implement new alarms for clusters with an active subscription.

Proposed solution

The current monitoring script node-monitor is very simple and difficult to improve further. It is time to look at standard solutions, like Netdata or Prometheus (or both) to implement the monitoring of cluster nodes and applications.

Alternative solutions

Implement missing alarms and notifications in the existing script.

See also

  • Sub-issues below
@DavidePrincipi DavidePrincipi added the milestone goal 👑 This describes an announced milestone goal label Nov 20, 2024
@DavidePrincipi DavidePrincipi added this to the NethServer M8.3 milestone Nov 20, 2024
@DavidePrincipi DavidePrincipi moved this to ToDo in NethServer Nov 20, 2024
@gsanchietti gsanchietti changed the title Backup, Swap, RAID, Free space alarms Swap, RAID, Free space alarms Nov 22, 2024
@DavidePrincipi DavidePrincipi changed the title Swap, RAID, Free space alarms Swap, RAID, Free space, Node alarms Nov 26, 2024
@DavidePrincipi DavidePrincipi moved this from ToDo to In Progress in NethServer Feb 20, 2025
@gsanchietti
Copy link
Member

gsanchietti commented Feb 25, 2025

The node-monitor has been replaced with a new metrics module that introduces a stack composed by:

  • Prometheus to gather the metrics and generate the alerts
  • Alertmanager to route the alerts
  • alert-proxy to forward alerts to my.nethesis.it and my.nethserver.com

Test case 1: new installation

  • Install a new cluster with 2 nodes using the latest testing image
  • Verify metrics module is installed and running only on the leader node: look for /home/metrics1 and check that Prometheus, Alertmanager and alert-proxy are running:
    • ss -lanp | grep 9091 for Prometheus
    • ss -lanp | grep 9094 for Alertmanager
    • ss -lanp | grep 9095 for alert-proxy
  • Verify node_exporter is running on all nodes: systemctl status node-exporter

Test case 2: update

  • Install a new cluster with 2 nodes using the latest stable image
  • Install node_exporter module on both nodes
  • Update the core on both nodes
  • Verify metrics module is installed and running only on the leader node: look for /home/metrics1 and check that Prometheus, Alertmanager and alert-proxy are running:
    • ss -lanp | grep 9091 for Prometheus
    • ss -lanp | grep 9094 for Alertmanager
    • ss -lanp | grep 9095 for alert-proxy
  • Verify old node_exporter instances have been removed systemctl status node-exporterX
  • Verify node_exporter is running on all nodes: systemctl status node-exporter
  • Verify node-monitor is not running systemctl status node-monitor

Test case 3: switch leader

  • After test case 1 or 2
  • Promote the worker node to leader
  • Verify metrics module is installed and running only on the leader node: look for /home/metrics1 and check that Prometheus, Alertmanager and alert-proxy are running:
    • ss -lanp | grep 9091 for Prometheus
    • ss -lanp | grep 9094 for Alertmanager
    • ss -lanp | grep 9095 for alert-proxy
  • Verify the above services are not running on the worker node: /home/metricsX directory must be gone on the old leader

Test case 4: alert to my.nethesis.it

  • After test case 1 or 2
  • Disable node_exporter on the worker node: systemctl stop node_exporter
  • Verify no alert is sent, inside the logs you should see something like:
    Feb 25 08:48:53 rl1.leader.cluster0.gs.nethserver.net alert-proxy[137405]: ALERT CRITICAL node:offline:1
    Feb 25 08:48:53 rl1.leader.cluster0.gs.nethserver.net alert-proxy[137405]: No auth token, alert not sent
    
  • Restart node_exporter on the worker_node: systemctl start node_exporter
  • Register the cluster with a valid subscription token
  • Disable node_exporter on the worker node: systemctl stop node_exporter
  • Verify the alert is received by my.nethesis.it

Test case 4: access to Prometheus

  • Enable access to Prometheus: `api-cli run module/metrics1/configure-module --data '{"prometheus_path": "prometheus", "grafana_path": "", "lets_encrypt": false}'
  • Verify Prometheus is accessible from the browser at: https://<node_fqdn>/prometheus

Test case 4: access to Grafana

  • Enable access to Grafana: api-cli run module/metrics1/configure-module --data '{"prometheus_path": "prometheus", "grafana_path": "grafana", "lets_encrypt": false}'
  • Verify Grafana is accessible from the browser at: https://<node_fqdn>/grafana
  • Login with default credentials: admin / admin

Test case 5: Grafana dashboard

  • After test case 4
  • Verify that following dashboards are present and contains data:
    • Loki: it must show metrics from Loki (some charts could be empty)
    • Nodes: hardware metrics of all nodes, nodes can be selected from the top bar
  • If an alert is raised, it must be visible under the Alerting section

Test case 6: Grafana and Prometheus certificates

  • After test case 5
  • Request a valid certificate, execute: ``api-cli run module/metrics1/configure-module --data '{"prometheus_path": "prometheus", "grafana_path": "grafana", "lets_encrypt": true, "mail_to": [], "mail_from": "", "mail_template": ""}'`
  • Wait a bit
  • Make sure that both /prometheus and /grafana are served using a valid certificate

Test case 7: mail alerts

  • After test case 5
  • Access the Settings page of the cluster and configure a remote or local mail server to send the notifications
  • Configure the mail notification: api-cli run module/metrics1/configure-module --data '{"prometheus_path": "prometheus", "grafana_path": "grafana", "lets_encrypt": false, "mail_to": ["[email protected]"], "mail_from": "[email protected]", "mail_template": ""}
  • Generate an alert
  • Verify a mail is sent to the destination (the link to alertmanager inside the mail will point to a non-existing URL)

Test case 8: custom alert

Test case 9: mail notification

@DavidePrincipi
Copy link
Member Author

In testing core 3.5.0-dev.5 (see the test cases from the previous comment).

@DavidePrincipi DavidePrincipi added the testing Packages are available from testing repositories label Feb 27, 2025
@nethbot nethbot moved this from In Progress to Testing in NethServer Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
milestone goal 👑 This describes an announced milestone goal testing Packages are available from testing repositories
Projects
Status: Testing
Development

No branches or pull requests

2 participants