Delay watchdog checks while any other nsm_sensor_ps script runs #21

petiepooo · 2018-05-09T19:14:07Z

This check pauses the --if-stale checks run by the watchdog cron job if it
detects any other nsm_sensor_ps script running. This prevents race conditions
such as when rule_update or daily_restart is starting/stopping processes, and
watchdog interferes, leaving an extra process running.

History:

I would often find an extra pcap_agent (and occasionally a snort alert process) running after the automated daily restarts and/or rule-update processes run. The most recent showed these symptoms:

$ ps -ef|grep [p]cap_agent
root 18533 1 0 12:04 ? 00:00:00 su - sguil -- /usr/bin/pcap_agent.tcl -c /etc/nsm/mysensor-em2/pcap_agent.conf
sguil 18535 18533 0 12:04 ? 00:00:00 tclsh /usr/bin/pcap_agent.tcl -c /etc/nsm/mysensor-em2/pcap_agent.conf
root 18567 1 0 12:04 ? 00:00:00 su - sguil -- /usr/bin/pcap_agent.tcl -c /etc/nsm/mysensor-em2/pcap_agent.conf
sguil 18569 18567 0 12:04 ? 00:00:00 tclsh /usr/bin/pcap_agent.tcl -c /etc/nsm/mysensor-em2/pcap_agent.conf

(relevant portion of watchdog.log)

Tue May 8 12:04:01 UTC 2018

stale PID file found, deleting!

stopping: pcap_agent (sguil) (not running)[ WARN ]

starting: pcap_agent (sguil)[ OK ]

Note that the timestamp in the logfile matches the time that the duplicate processes were created. This is also four minutes after the sensor-newday cron job runs, but also right when the nsm-watchdog cron job runs.

Solution:

In the lib-nsm-common-utils library in the function that handles restarts of stale processes (when run with the --if-stale flag by the watchdog cron job), check to see if any other of the nsm_sensor_ps-* scripts are running, and if so, pause, checking every second up to 30 times. If other scripts end, continue with the --if-stale checks. If they're still running after 30 checks, abort.

Note that it reports the error using echo_msg_end, but it also echos to stderr, which generates an email from cron to root if you have a mailer configured on the system. This is to alert on timeouts rather than just hide in the watchdog.log file. At most, it will generate an email every 5 minutes, so there should not be an uncontrollable flood, and emails only ever go out if a mailer such as exim4 is installed and configured on the sensor.

Testing:

The patch author's recommended testing method:

Install SecurityOnion in advanced mode, with multiple snort IDS processes and multiple monitored interfaces if possible (as many as your memory/resources allow).
(optional) Increase the frequency of watchdog checks from every 5 minutes to every 1 minute by replacing '4,9,14,19,24,29,34,39,44,49,54,59' with '*' in /etc/cron.d/nsm-watchdog.
Follow the end of the watchdog.log file using 'tail -f /var/log/nsm/watchdog.log'
In another window, continuously restart the sensor processes by running 'nsm_sensor_ps-restart' until a log entry shows that watchdog restarted a stale process.
Verify that there is an additional process as identified in the logfile started by the watchdog script (failure condition).
Install this patch
Follow the end of the watchdog.log file using 'tail -f /var/log/nsm/watchdog.log'
Restart the sensor processes until you see a watchdog.log entry warning on the order of "waiting for other nsmnow process control script to stop (continuing after 2 checks)" (new feature: watchdog action delayed)
Pause a status check (nsm --sensor --status) by quickly pressing Ctrl-S (aka XOFF) while running, and wait for a log entry failure on the order of "waiting for other nsmnow process control script to stop (aborting after 30 checks)" then resume by pressing Ctrl-Q (aka XON). (feature failsafe: watchdog action aborted)
(optional) If a mailer is installed, check for an email containing the message body of "nsm_sensor_ps process_restart_if_stale function call exited with error" (failsafe feature: stdout alert on fail)
Kill one or more sensor processes using the "kill" or "pkill" command (eg. 'sudo pkill snort' and/or 'sudo pkill -f "tclsh /usr/bin/.*agent.tcl" ') and wait for a watchdog log entry showing all those processes have been restarted (routine action still works)
Check that all processes show OK after watchdog restart by running 'nsm --all --status'

…hecks This check pauses the --if-stale checks run by the watchdog cron job if it detects any other nsm_sensor_ps script running. This prevents race conditions such as when rule_update or daily_restart is starting/stopping processes, and watchdog interferes, leaving an extra process running.

dougburks · 2018-05-10T14:54:01Z

Hi @petiepooo ,

Thanks for the pull request and detailed notes!

I'll review as time allows.

petiepooo · 2018-05-24T20:07:39Z

To followup, I'm currently running this patch on my systems. Every couple of days, always at 12:04:01 UTC, in /var/log/nsm/watchdog.log*, I now see this:

Wed May 23 12:04:01 UTC 2018
^M * waiting for other nsmnow process control script to stop (continuing after 2 checks)[ WARN ]
^M * waiting for other nsmnow process control script to stop (continuing after 2 checks)[ WARN ]

And once in a while, I believe when someone is manually restarting the services just when watchdog is triggere, I get this:

Thu May 24 07:19:01 UTC 2018
^M * waiting for other nsmnow process control script to stop (aborting after 30 checks)[ FAIL ]

When that happens, I also get an email (since I have exim4-daemon-light installed) like:

From: [email protected] (Cron Daemon)
To: [email protected]
Subject: Cron [email protected] ( date ; /usr/sbin/nsm_server_ps-restart --if-stale ; /usr/sbin/nsm_sensor_ps-restart --if-stale) >> /var/log/nsm/watchdog.log
Date: Thu, 24 May 2018 07:19:31 +0000

nsm_sensor_ps process_restart_if_stale function call exited with error

I no longer get duplicate snort or pcap_agent processes like I used to, which tells me it's working as planned.

I like the email as a failsafe to ensure the patch isn't completely breaking the watchdog. It would not be generated unless a mailer like exim4 is installed, so base installs would never even notice. If it's not needed, the echo call above the exit in the patch could be removed.

Alternatively, the loop count could be increased to more than 30 (but less than 240 or so lest the watchdog calls start overlapping) if you feel it should wait longer before giving up.

I'm really surprised this isn't a huge, oft-reported issue, as I've been experiencing it since I started using SecurityOnion three+ years ago. Yet, searching through https://groups.google.com/forum/#!forum/security-onion, I can't find anyone else talking about this. The biggest effect is when it happens to snort, and there are duplicate snort processes using the same config file and vying for access to the same alert file. That breaks things badly, as it starts dropping alerts.

dougburks mentioned this pull request Aug 10, 2018

NSM: Delay watchdog checks while any other nsm_sensor_ps script runs Security-Onion-Solutions/security-onion#1292

Closed

dougburks merged commit 8da9695 into Security-Onion-Solutions:master Oct 29, 2018

petiepooo deleted the fix-wd-race branch October 29, 2018 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay watchdog checks while any other nsm_sensor_ps script runs #21

Delay watchdog checks while any other nsm_sensor_ps script runs #21

petiepooo commented May 9, 2018

dougburks commented May 10, 2018

petiepooo commented May 24, 2018

Delay watchdog checks while any other nsm_sensor_ps script runs #21

Delay watchdog checks while any other nsm_sensor_ps script runs #21

Conversation

petiepooo commented May 9, 2018

History:

Solution:

Testing:

dougburks commented May 10, 2018

petiepooo commented May 24, 2018