Skip to content

HOWTO: Setup NAGIOS for an HPC cluster

RuanEllis edited this page Dec 4, 2014 · 5 revisions

Setting up Nagios

This guide will walk through the basic steps of setting up Nagios on a RHEL6 based system. By its nature, Nagios will need customising for your particular hardware, but these instructions should give you the steps you need to create a new environment.

Assumptions

  • You have already installed a Linux instance that you want to run Nagios itself on (including the web GUI).
  • You have one or more machines that you want to monitor (via the Nagios NRPE agent)
  • The machines can all communicate together on a network.
  • You have run the puppet recipe to install Nagios and NRPE where you need it.

Configuring Nagios

  • Check that you can login to the Nagios web GUI - the default username & password is nagiosadmin. You should see a basic configuration already in place, with just the trio machines configured.
  • Add the machines in your cluster to the /etc/nagios/objects/cluster.cfg file; remember to also add any VM instances that you're using, e.g.

define host{
use generic
host_name headnode1
alias headnode1.mgt.symphony.local
address headnode1
}

define host{
use generic
host_name node01
alias node01.mgt.symphony.local
address node01
}

define host{
use generic
host_name login1
alias login1.mgt.symphony.local
address login1
}

  • Add groups for your new hosts as well; e.g.

define hostgroup{
hostgroup_name headnodes
alias Headnode systems
members headnode1
}

define hostgroup{
hostgroup_name compute
alias Compute node systems
members node01,node02,node03,node04,node05,node06,node07,node08
}

define hostgroup{
hostgroup_name logins
alias Login node systems
members login1
}

  • Enable services for the new nodes by adding them to the hostgroup_name field for each service. Uncomment services that you need for these hosts. Remember that it's best to monitor things only once in your environment - e.g. if you're monitoring how full a shared filesystem is, only monitor it on the server, not on every node.

  • Add any new commands you need to /etc/nagios/objects/commands.cfg

  • Update the commands file for the standard location of plugins (necessary if you're using a template from the previous alcesstack version); e.g. sed -i 's?/var/lib/alces/share/nagios/plugins/?/usr/lib64/nagios/plugins/alces/?g' /etc/nagios/objects/commands.cfg

  • Update the /etc/nagios/nrpe.cfg file to include remote commands to be run on machines

  • Update the sudoers file for Nagios plugins that need to be run as root; e.g.

    • Disable requiretty
    • Add a new command alias called MONITOR; e.g.
   Cmnd_Alias MONITOR = /usr/lib64/nagios/plugins/alces/check_PSUs, \    
                   /usr/lib64/nagios/plugins/alces/check_dirvish 
  • Add commands to that group which need to be run with elevated privileges
  • Allow the nagios user to run commands in that group; e.g.
    nagios,nrpe    ALL=(ALL)       NOPASSWD: MONITOR
  • Install any plugins you need to the correct place; custom ones are usually added to the /usr/lib64/nagios/plugins/alces/ directory on each host being monitored. Remember that normal plugins will need to be installed on the symphony-monitor machine, and plugins run via NRPE will need to be installed on the machine being monitored. e.g.
    yum install nagios-plugins-disk nagios-plugins-load nagios-plugins-users  nagios-plugins-procs
  • on the symphony-monitor machine:
    yum install nagios-plugins-nrpe freeipmi
  • add the cluster IPMI username and password to the /etc/freeipmi/freeipmi.conf file, and change the driver type from KCS to lanplus
  • Edit the /etc/cron.hourly/symphonymonitor-ipmi-check file to get information for the relevant nodes
  • Edit the /etc/cron.daily/symphonymonitor-ecc-check file to get information for the relevant nodes
  • Install and configure the "alces temps" service on the headnode to broadcast temperature data via ganglia; N.B. you may need to set "-I lanplus" in the actions file to allow ipmitool to talk to certain hardware. The action file to edit is /var/lib/alces/nodeware/opt/symphony/lib/stack/lib/actions/temps on the headnode. Test with alces temps node01 on the headnode - if it works, chkconfig the alces-temps service on, and start the service. If necessary, edit the service start script /etc/init.d/alces-temps to point to the new BIN location: BIN="/var/lib/alces/nodeware/bin/alces
  • Ensure firewall ports are open on monitored machines for NRPE:
     firewall-cmd --add-port 5666/tcp --zone private --permanent
       firewall-cmd --reload      

Enabling notifications

The final step once you're happy that the checks are all working properly is to enable live notifications. Only do this when you're prepared to have emails sent to you about any checks that returning warnings or errors.

  • Setup an email forward so you can receive the emails. The default settings send emails to the root user of the server where Nagios is running, so add an email address in /root/.forward there if you want to forward the emails to your own address.
  • Enable notifications in the /etc/nagios/nagios.cfg file on the monitoring server. Restart Nagios to enable the change.
Clone this wiki locally