Invoking Cfengine from Nagios
21 Nov 2012Nagios and Cf3 each have their strengths:
- Nagios has nicely-encapsulated checks for lots of different things, and I'm quite familiar with it.
- Cfengine is a nice way of sanely ensuring things are the way we want them to be (ie, without running amok and restarting something infinity times).
Nagios plugins, frankly, are hard to duplicate in Cfengine. Check out this Cf3 implementation of a web server check:
bundle agent check_tcp_response {
vars:
"read_web_srv_response" string => readtcp("php.net", "80", "GET /manual/en/index.php HTTP/1.1$(const.r)$(const.n)Host: php.net$(const.r)$(const.n)$(const.r)$(const.n)", 60);
classes:
"expectedResponse" expression => regcmp(".*200 OK.*\n.*", "$(read_web_srv_response)");
reports:
!expectedResponse::
"Something is wrong with php.net - see for yourself: $(read_web_srv_response)";
}
That simply does not compare with this Nagios stanza:
define service{
use local-service ; Name of service template to use
hostgroup_name http-servers
service_description HTTP
check_command check_http
}
define command{
command_name check_http
command_line $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}
My idea, which I totally stole from this article, was to invoke Cfengine from Nagios when necessary, and let Cf3 restart the service. Example: I've got this one service that monitors a disk array for faults. It's flaky, and needs to be restarted when it stops responding. I've already got a check for the service in Nagios, so I added an event handler:
define service{
use local-service ; Name of service template to use
host_name diskarray-mon
service_description diskarray-mon website
check_command check_http!-H diskmon.example.com -S -u /login.html
event_handler invoke_cfrunagent
}
define command{
command_name invoke_cfrunagent
command_line $USER2/invoke_cfrunagent.sh -n "$SERVICEDESC" -s $SERVICESTATE$ -t $SERVICESTATETYPE$ -a $HOSTADDRESS$
}
Leaving out some getopt()
stuff, invoke_cfrunagent.sh looks like this:
# Convert "diskarray-mon website to disarray-mon_website":
SVC=${SVC/ /_}
STATE="nagios_$STATE"
TYPE="nagios_$TYPE"
# Debugging
echo "About to run sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE" | /usr/bin/logger
# We allow this in sudoers:
sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE
cf-runagent is a request, not an order, to the running cf-server process to fulfill already-configured processes; it's like saying "If you don't mind, could you please run now?"
Finally, this was to be detected in Cf3 like so:
methods:
diskarray-mon_website.nagios_CRITICAL.nagios_HARD::
"Restart the diskarray monitoring service" usebundle => restart_diskarray_monitor();
(This stanza is in a bundle that I know is called on the disk array monitor.)
Here's what works:
- If I run cf-agent -- not cf-runagent -- with those args ("-D diskarray-monwebsite -D nagiosCRITICAL -D nagios_HARD"), it'll run the restart script.
What doesn't work:
running cf-runagent, either as root or as nagios. It seems to stop after analyzing classes and such, and not actually do anything. I'm probably misunderstanding how cf-runagent is meant to work.
Nagios will only run an event handler when things change -- not all the time until things get better. That means that if the first attempt by Cf3 to restart doesn't work, for whatever reason, it won't get run again.
What might work better is using this Cf3 wrapper for Nagios plugins (which I think is the same approach, or possibly code, discussed in this mailing list post).
Anyhow...This is a sort of half-assed attempt in a morning to get something working. Not there yet.
Add a comment:
Name and email required; email is not displayed.
Related Posts
QRP weekend 08 Oct 2018
Open Source Cubesat Workshop 2018 03 Oct 2018
mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018