Customise Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorised as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyse the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customised advertisements based on the pages you visited previously and to analyse the effectiveness of the ad campaigns.

No cookies to display.

Script to Monitor Nagios Logs – Detect Nagios Daemon Failure and restart

by | Jul 18, 2016 | Nagios Core

This script is to monitor a failure when Nagios daemons fails to start or sometimes Nagios stops sending alerts. When you check the logs at “/usr/local/nagios/var/log/nagios.log” you might come across messages like "Caught SIGSEGV, shutting down". These messages need to be monitored and then fixed so that Nagios works in a proper fashion. This would be monitored via Cronjob on the host as Nagios itself won’t be able to detect the failure.

I have written a script in BASH which have been working successfully. Please feel free to explore other options around the script and suggestion are always welcomed.

Nagios Daemon Check Script

#!/bin/bash

######################################  VARIABLES ############################################################
NAGIOS_LOG=`cat /usr/local/nagios/var/log/nagios.log | perl -pe 's/(\d+)/localtime($1)/e' | grep Caught | awk '{print $2" "$3" "$4" "$6" "$7" "$8" "$9$10}' > /usr/local/nagios/var/log/tmp_log`

NAGIOS_LOG_COUNT=`awk -v d1="$(date --date="-60 min" "+%b %_d %H:%M")" -v d2="$(date "+%b %_d %H:%M")" '$0 > d1 && $0 < d2 || $0 ~ d2' /usr/local/nagios/var/log/tmp_log | wc -l`

SERVICE_NAG_COUNT=`/etc/init.d/nagios status | grep running | wc -l`
####################################### DEC END ##############################################################

if [ $NAGIOS_LOG_COUNT == 0 ];

then

echo "Nagios is running OK"

elif [ $NAGIOS_LOG_COUNT -ge 1 ];

then

echo "Nagios Service Outage" >> /usr/local/nagios/var/nagios_service_check_log

echo "=====================" >> /usr/local/nagios/var/nagios_service_check_log

echo "$NAGIOS_LOG" >> /usr/local/nagios/var/nagios_service_check_log

echo "## Restarting Nagios Service ##" >> /usr/local/nagios/var/nagios_service_check_log

/etc/init.d/nagios restart >> /usr/local/nagios/var/nagios_service_check_log

sleep 2

if [ $SERVICE_NAG_COUNT == 1 ];

then

############# VARIABLE ###############################
SERVICE_NAG=`/etc/init.d/nagios status | grep running`
######################################################

echo "OK - $SERVICE_NAG" >> /usr/local/nagios/var/nagios_service_check_log | mail -s "NOTIFICATION - Nagios Service Outage" admin@howtovmlinux.com/ < /usr/local/nagios/var/nagios_service_check_log && rm -rf /usr/local/nagios/var/nagios_service_check_log

else

echo "CRITICAL - Nagios Service restart failed" >> /usr/local/nagios/var/nagios_service_check_log | mail -s "CRITICAL - Nagios Service Outage - Escalation Needed" admin@howtovmlinux.com/ < /usr/local/nagios/var/nagios_service_check_log && rm -rf /usr/local/nagios/var/nagios_service_check_log

fi
fi

How does the above script work

  • It starts off with grep of the nagios log and also converting the UNIX timestamp in the Human readable format and then format the output of the result to “Month dd hh:mm:ss” so that it can be grepped for a specfic time period (NAGIOS_LOG Line).
  • Then we grep the log file for the error for the last hour and if no error then echo “Nagios is Running OK” or else if error occured more than once then tell the script to restart the Nagios Daemon and send out the notification that the error occured and has been fixed. You can send the email to maybe your support department or the people who are responsible for Nagios monitoring.

I hope this script has helped you and please share it and feedback is always welcome for improvements.

Related Articles….