HOWTO: Setup NAGIOS for an HPC cluster

Setting up Nagios

This guide will walk through the basic steps of setting up Nagios on a RHEL6 based system. By its nature, Nagios will need customising for your particular hardware, but these instructions should give you the steps you need to create a new environment.

Assumptions

You have already installed a Linux instance that you want to run Nagios itself on (including the web GUI).
You have one or more machines that you want to monitor (via the Nagios NRPE agent)
The machines can all communicate together on a network.
You have run the puppet recipe to install Nagios and NRPE where you need it.

Configuring Nagios

Check that you can login to the Nagios web GUI - the default username & password is nagiosadmin. You should see a basic configuration already in place, with just the trio machines configured.
Add the machines in your cluster to the /etc/nagios/objects/cluster.cfg file; remember to also add any VM instances that you're using, e.g.

define host{
use generic
host_name headnode1
alias headnode1.mgt.symphony.local
address headnode1
}

define host{
use generic
host_name node01
alias node01.mgt.symphony.local
address node01
}

define host{
use generic
host_name login1
alias login1.mgt.symphony.local
address login1
}

Add groups for your new hosts as well; e.g.

define hostgroup{
hostgroup_name headnodes
alias Headnode systems
members headnode1
}

define hostgroup{
hostgroup_name compute
alias Compute node systems
members node01,node02,node03,node04,node05,node06,node07,node08
}

define hostgroup{
hostgroup_name logins
alias Login node systems
members login1
}

Enable services for the new nodes by adding them to the hostgroup_name field for each service. Uncomment services that you need for these hosts. Remember that it's best to monitor things only once in your environment - e.g. if you're monitoring how full a shared filesystem is, only monitor it on the server, not on every node.
Add any new commands you need to /etc/nagios/objects/commands.cfg
Update the commands file for the standard location of plugins (necessary if you're using a template from the previous alcesstack version); e.g. sed -i 's?/var/lib/alces/share/nagios/plugins/?/usr/lib64/nagios/plugins/alces/?g' /etc/nagios/objects/commands.cfg
Update the /etc/nagios/nrpe.cfg file to include remote commands to be run on machines
Update the sudoers file for Nagios plugins that need to be run as root; e.g.
- Disable requiretty
- Add a new command alias called MONITOR; e.g.

   Cmnd_Alias MONITOR = /usr/lib64/nagios/plugins/alces/check_PSUs, \

                   /usr/lib64/nagios/plugins/alces/check_dirvish

Add commands to that group which need to be run with elevated privileges
Allow the nagios user to run commands in that group; e.g.

    nagios,nrpe    ALL=(ALL)       NOPASSWD: MONITOR

Install any plugins you need to the correct place; custom ones are usually added to the /usr/lib64/nagios/plugins/alces/ directory on each host being monitored. Remember that normal plugins will need to be installed on the symphony-monitor machine, and plugins run via NRPE will need to be installed on the machine being monitored. e.g.

    yum install nagios-plugins-disk nagios-plugins-load nagios-plugins-users  nagios-plugins-procs

on the symphony-monitor machine:

    yum install nagios-plugins-nrpe freeipmi

add the cluster IPMI username and password to the /etc/freeipmi/freeipmi.conf file, and change the driver type from KCS to lanplus
Edit the /etc/cron.hourly/symphonymonitor-ipmi-check file to get information for the relevant nodes
Edit the /etc/cron.daily/symphonymonitor-ecc-check file to get information for the relevant nodes
Install and configure the "alces temps" service on the headnode to broadcast temperature data via ganglia; N.B. you may need to set "-I lanplus" in the actions file to allow ipmitool to talk to certain hardware. The action file to edit is /var/lib/alces/nodeware/opt/symphony/lib/stack/lib/actions/temps on the headnode. Test with alces temps node01 on the headnode - if it works, chkconfig the alces-temps service on, and start the service. If necessary, edit the service start script /etc/init.d/alces-temps to point to the new BIN location: BIN="/var/lib/alces/nodeware/bin/alces
Ensure firewall ports are open on monitored machines for NRPE:

     firewall-cmd --add-port 5666/tcp --zone private --permanent

       firewall-cmd --reload

Enabling notifications

The final step once you're happy that the checks are all working properly is to enable live notifications. Only do this when you're prepared to have emails sent to you about any checks that returning warnings or errors.

Setup an email forward so you can receive the emails. The default settings send emails to the root user of the server where Nagios is running, so add an email address in /root/.forward there if you want to forward the emails to your own address.
Enable notifications in the /etc/nagios/nagios.cfg file on the monitoring server. Restart Nagios to enable the change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOWTO: Setup NAGIOS for an HPC cluster

Setting up Nagios

Assumptions

Configuring Nagios

Enabling notifications

Clone this wiki locally