Script to Monitor Nagios Logs – Detect Nagios Daemon Failure and restart

by | Jul 18, 2016 | Nagios Core

This script is to monitor a failure when Nagios daemons fails to start or sometimes Nagios stops sending alerts. When you check the logs at “/usr/local/nagios/var/log/nagios.log” you might come across messages like "Caught SIGSEGV, shutting down". These messages need to be monitored and then fixed so that Nagios works in a proper fashion. This would be monitored via Cronjob on the host as Nagios itself won’t be able to detect the failure.

I have written a script in BASH which have been working successfully. Please feel free to explore other options around the script and suggestion are always welcomed.

Nagios Daemon Check Script

#!/bin/bash

######################################  VARIABLES ############################################################
NAGIOS_LOG=`cat /usr/local/nagios/var/log/nagios.log | perl -pe 's/(\d+)/localtime($1)/e' | grep Caught | awk '{print $2" "$3" "$4" "$6" "$7" "$8" "$9$10}' > /usr/local/nagios/var/log/tmp_log`

NAGIOS_LOG_COUNT=`awk -v d1="$(date --date="-60 min" "+%b %_d %H:%M")" -v d2="$(date "+%b %_d %H:%M")" '$0 > d1 && $0 < d2 || $0 ~ d2' /usr/local/nagios/var/log/tmp_log | wc -l`

SERVICE_NAG_COUNT=`/etc/init.d/nagios status | grep running | wc -l`
####################################### DEC END ##############################################################

if [ $NAGIOS_LOG_COUNT == 0 ];

then

echo "Nagios is running OK"

elif [ $NAGIOS_LOG_COUNT -ge 1 ];

then

echo "Nagios Service Outage" >> /usr/local/nagios/var/nagios_service_check_log

echo "=====================" >> /usr/local/nagios/var/nagios_service_check_log

echo "$NAGIOS_LOG" >> /usr/local/nagios/var/nagios_service_check_log

echo "## Restarting Nagios Service ##" >> /usr/local/nagios/var/nagios_service_check_log

/etc/init.d/nagios restart >> /usr/local/nagios/var/nagios_service_check_log

sleep 2

if [ $SERVICE_NAG_COUNT == 1 ];

then

############# VARIABLE ###############################
SERVICE_NAG=`/etc/init.d/nagios status | grep running`
######################################################

echo "OK - $SERVICE_NAG" >> /usr/local/nagios/var/nagios_service_check_log | mail -s "NOTIFICATION - Nagios Service Outage" [email protected]/ < /usr/local/nagios/var/nagios_service_check_log && rm -rf /usr/local/nagios/var/nagios_service_check_log

else

echo "CRITICAL - Nagios Service restart failed" >> /usr/local/nagios/var/nagios_service_check_log | mail -s "CRITICAL - Nagios Service Outage - Escalation Needed" [email protected]/ < /usr/local/nagios/var/nagios_service_check_log && rm -rf /usr/local/nagios/var/nagios_service_check_log

fi
fi

How does the above script work

  • It starts off with grep of the nagios log and also converting the UNIX timestamp in the Human readable format and then format the output of the result to “Month dd hh:mm:ss” so that it can be grepped for a specfic time period (NAGIOS_LOG Line).
  • Then we grep the log file for the error for the last hour and if no error then echo “Nagios is Running OK” or else if error occured more than once then tell the script to restart the Nagios Daemon and send out the notification that the error occured and has been fixed. You can send the email to maybe your support department or the people who are responsible for Nagios monitoring.

I hope this script has helped you and please share it and feedback is always welcome for improvements.

Related Articles….