First day back at $WORK after the winter break yesterday, and some...interesting...things. Like finding out about the service that didn't come back after a power outage three weeks ago. Fuck. Add the check to Nagios, bring it up; when the light turns green, the trap is clean.
Or when I got a page about a service that I recognized as having, somehow, to do with a webapp we monitor, but no real recollection of what it does or why it's important. Go talk to my boss, find out he's restarted it and it'll be up in a minute, get the 25-word version of what it does, add him to the contact list for that service and add the info to documentation.
I start to think about how to include a link to documentation in Nagios alerts, and a quick search turns up "Default monitoring alerts are awful" , a blog post by Jeff Goldschrafe about just this. His approach looks damned cool, and I'm hoping he'll share how he does this. Inna meantime, there's the Nagios config options "notes", "notesurl" and "actionurl", which I didn't know about. I'll start adding stuff to the Nagios config. (Which really makes me wish I had a way of generating Nagios config...sigh. Maybe NConf?)
But also on Jeff's blog I found a post about Kaboli, which lets you interact with Nagios/Icinga through email. That's cool. Repo here.
Planning. I want to do something better with planning. I've got RT to catch problems as they emerge, and track them to completion. Combined with orgmode, it's pretty good at giving me a handy reference for what I'm working on (RT #666) and having the whole history available. What it's not good at is big-picture planning...everything is just a big list of stuff to do, not sorted by priority or labelled by project, and it's a big intimidating mess. I heard about Kanban when I was at LISA this year, and I want to give it a try...not suure if it's exactly right, but it seems close.
And then I came across Behaviour-driven infrastructure through Cucumber, a blog post from Lindsay Holmwood. Which is damn cool, and about which I'll write more another time. Which led to the Github repo for a cucumber/nagios plugin, and reading more about Cucumber, and behaviour-driven development versus test-driven development (hint: they're almost exactly the same thing).
My god, it's full of stars.
So yesterday I got an email from another sysadmin: "Hey, looks like there's a lot of blocked connections to your server X. Anything happening there?" Me: "Well, I moved email from X to Y on Tuesday...but I changed the MX to point at Y. What's happening there?"
Turns out I'd missed a fucking domain: I'd left the MX pointing to the old server instead of moving it to the new one. And when I turned off the mail server on the old domain, delivery to this domain stopped. Fortunately I was able to get things going again: I changed the MX to point at the new server, and turned on the old server again to handle things until the new record propogated.
So how in hell did this happen? I can see two things I did wrong:
Poor planning: my plans and checklists included all the steps I needed to do, but did not mention the actual domains being moved. I relied on memory, which meant I remembered (and tested) two and forgot the third. I should have included the actual domains: both a note to check the settings and a test of email delivery.
No email delivery check by Nagios: Nagios checks that the email server is up, displays a banner and so on, but does not check actual email delivery for the domains I'm responsible for. There's a plugin for that, of course, and I'm going to be adding that.
I try to make a point of writing down things that go bad at $WORK, along with things that go well. This is one of those things.
Nagios and Cf3 each have their strengths:
Nagios plugins, frankly, are hard to duplicate in Cfengine. Check out this Cf3 implementation of a web server check:
bundle agent check_tcp_response {
vars:
"read_web_srv_response" string => readtcp("php.net", "80", "GET /manual/en/index.php HTTP/1.1$(const.r)$(const.n)Host: php.net$(const.r)$(const.n)$(const.r)$(const.n)", 60);
classes:
"expectedResponse" expression => regcmp(".*200 OK.*\n.*", "$(read_web_srv_response)");
reports:
!expectedResponse::
"Something is wrong with php.net - see for yourself: $(read_web_srv_response)";
}
That simply does not compare with this Nagios stanza:
define service{
use local-service ; Name of service template to use
hostgroup_name http-servers
service_description HTTP
check_command check_http
}
define command{
command_name check_http
command_line $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}
My idea, which I totally stole from this article, was to invoke Cfengine from Nagios when necessary, and let Cf3 restart the service. Example: I've got this one service that monitors a disk array for faults. It's flaky, and needs to be restarted when it stops responding. I've already got a check for the service in Nagios, so I added an event handler:
define service{
use local-service ; Name of service template to use
host_name diskarray-mon
service_description diskarray-mon website
check_command check_http!-H diskmon.example.com -S -u /login.html
event_handler invoke_cfrunagent
}
define command{
command_name invoke_cfrunagent
command_line $USER2/invoke_cfrunagent.sh -n "$SERVICEDESC" -s $SERVICESTATE$ -t $SERVICESTATETYPE$ -a $HOSTADDRESS$
}
Leaving out some getopt()
stuff, invoke_cfrunagent.sh looks like this:
# Convert "diskarray-mon website to disarray-mon_website":
SVC=${SVC/ /_}
STATE="nagios_$STATE"
TYPE="nagios_$TYPE"
# Debugging
echo "About to run sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE" | /usr/bin/logger
# We allow this in sudoers:
sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE
cf-runagent is a request, not an order, to the running cf-server process to fulfill already-configured processes; it's like saying "If you don't mind, could you please run now?"
Finally, this was to be detected in Cf3 like so:
methods:
diskarray-mon_website.nagios_CRITICAL.nagios_HARD::
"Restart the diskarray monitoring service" usebundle => restart_diskarray_monitor();
(This stanza is in a bundle that I know is called on the disk array monitor.)
Here's what works:
What doesn't work:
running cf-runagent, either as root or as nagios. It seems to stop after analyzing classes and such, and not actually do anything. I'm probably misunderstanding how cf-runagent is meant to work.
Nagios will only run an event handler when things change -- not all the time until things get better. That means that if the first attempt by Cf3 to restart doesn't work, for whatever reason, it won't get run again.
What might work better is using this Cf3 wrapper for Nagios plugins (which I think is the same approach, or possibly code, discussed in this mailing list post).
Anyhow...This is a sort of half-assed attempt in a morning to get something working. Not there yet.
Ran into a problem at $WORK today when trying to use the Nagios plugin checkmysqlplugin. I wanted to verify that the table had > 0 rows. The configuration looked like this:
define command{
command_name check_table_size
command_line $USER1$/check_mysql_query -q "select count(*) from example;" -H $HOSTADDRESS$ $ARG1$
}
define service{
host_name foo
service_description Table size
check_command check_table_size!-u nagios -p password -d database -w 1:1000000000 -c 1:1000000000
}
I could run it find from the command line, but Nagios kept getting CRITICAL, and the only output it had was "(null)". I turned on debugging, and kept seeing this:
Done. Final output: '/usr/lib64/nagios/plugins/check_mysql_query -q 'select count(*) from example'
I finally figured out that the semicolon in the -q argument was messing things up. Removing it and restarting Nagios fixed the problem.
Two things I should remember:
icli -c /var/spool/nagios/objects.cache -f /var/spool/nagios/status.dat
(assuming you've got the Nagios files in those locations). And by "well enough", I mean that it will show the current state of all services.