This script is to monitor a failure when Nagios daemons fails to start or sometimes Nagios stops sending alerts. When you check the logs at “/usr/local/nagios/var/log/nagios.log
” you might come across messages like "Caught SIGSEGV, shutting down".
These messages need to be monitored and then fixed so that Nagios works in a proper fashion. This would be monitored via Cronjob on the host as Nagios itself won’t be able to detect the failure.
I have written a script in BASH which have been working successfully. Please feel free to explore other options around the script and suggestion are always welcomed.
Nagios Daemon Check Script
###################################### VARIABLES ############################################################
NAGIOS_LOG=`cat /usr/local/nagios/var/log/nagios.log | perl -pe 's/(\d+)/localtime($1)/e' | grep Caught | awk '{print $2" "$3" "$4" "$6" "$7" "$8" "$9$10}' > /usr/local/nagios/var/log/tmp_log`
NAGIOS_LOG_COUNT=`awk -v d1="$(date --date="-60 min" "+%b %_d %H:%M")" -v d2="$(date "+%b %_d %H:%M")" '$0 > d1 && $0 < d2 || $0 ~ d2' /usr/local/nagios/var/log/tmp_log | wc -l`
SERVICE_NAG_COUNT=`/etc/init.d/nagios status | grep running | wc -l`
####################################### DEC END ##############################################################
if [ $NAGIOS_LOG_COUNT == 0 ];
echo "Nagios is running OK"
elif [ $NAGIOS_LOG_COUNT -ge 1 ];
echo "Nagios Service Outage" >> /usr/local/nagios/var/nagios_service_check_log
echo "=====================" >> /usr/local/nagios/var/nagios_service_check_log
echo "$NAGIOS_LOG" >> /usr/local/nagios/var/nagios_service_check_log
echo "## Restarting Nagios Service ##" >> /usr/local/nagios/var/nagios_service_check_log
/etc/init.d/nagios restart >> /usr/local/nagios/var/nagios_service_check_log
sleep 2
if [ $SERVICE_NAG_COUNT == 1 ];
############# VARIABLE ###############################
SERVICE_NAG=`/etc/init.d/nagios status | grep running`
echo "OK - $SERVICE_NAG" >> /usr/local/nagios/var/nagios_service_check_log | mail -s "NOTIFICATION - Nagios Service Outage" < /usr/local/nagios/var/nagios_service_check_log && rm -rf /usr/local/nagios/var/nagios_service_check_log
echo "CRITICAL - Nagios Service restart failed" >> /usr/local/nagios/var/nagios_service_check_log | mail -s "CRITICAL - Nagios Service Outage - Escalation Needed" < /usr/local/nagios/var/nagios_service_check_log && rm -rf /usr/local/nagios/var/nagios_service_check_log
How does the above script work
- It starts off with grep of the nagios log and also converting the UNIX timestamp in the Human readable format and then format the output of the result to “Month dd hh:mm:ss” so that it can be grepped for a specfic time period (NAGIOS_LOG Line).
- Then we grep the log file for the error for the last hour and if no error then echo “Nagios is Running OK” or else if error occured more than once then tell the script to restart the Nagios Daemon and send out the notification that the error occured and has been fixed. You can send the email to maybe your support department or the people who are responsible for Nagios monitoring.
I hope this script has helped you and please share it and feedback is always welcome for improvements.