This repository has been archived by the owner on Apr 19, 2021. It is now read-only.
Delay watchdog checks while any other nsm_sensor_ps script runs #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This check pauses the --if-stale checks run by the watchdog cron job if it
detects any other nsm_sensor_ps script running. This prevents race conditions
such as when rule_update or daily_restart is starting/stopping processes, and
watchdog interferes, leaving an extra process running.
History:
I would often find an extra pcap_agent (and occasionally a snort alert process) running after the automated daily restarts and/or rule-update processes run. The most recent showed these symptoms:
(relevant portion of watchdog.log)
Note that the timestamp in the logfile matches the time that the duplicate processes were created. This is also four minutes after the sensor-newday cron job runs, but also right when the nsm-watchdog cron job runs.
Solution:
In the lib-nsm-common-utils library in the function that handles restarts of stale processes (when run with the --if-stale flag by the watchdog cron job), check to see if any other of the nsm_sensor_ps-* scripts are running, and if so, pause, checking every second up to 30 times. If other scripts end, continue with the --if-stale checks. If they're still running after 30 checks, abort.
Note that it reports the error using echo_msg_end, but it also echos to stderr, which generates an email from cron to root if you have a mailer configured on the system. This is to alert on timeouts rather than just hide in the watchdog.log file. At most, it will generate an email every 5 minutes, so there should not be an uncontrollable flood, and emails only ever go out if a mailer such as exim4 is installed and configured on the sensor.
Testing:
The patch author's recommended testing method: