So one thing that's been hanging around in my mailbox for (checks mailbox) good God, three weeks now is an exchange I had with Jess Males (aka Hefeweizen on IRC, which is a damn good name). He wrote to me about my Infrastructure Code Testing BoF, and asked:
I don't see notes to this effect, so I'll ask: what's the difference between monitoning and test-driven infrastructure? Monitoring started as a way to tell us that the infrastructure we need is available and operating as expected. Test-driven infrastructure serves the role of verifying that the environment we're describing (in code) is being implemented, and thus, operating as expected. Those sound awfully similar to me. Is there a nuance that flies over my head?
Before I insert my uninformed opinions, I'll point you to an excellent article from this year's SysAdvent by Yvonne Lam called How To Talk About Monitors, Tests, and Diagnostics. But if you're still interested, here goes...
First, often (though not always) it's a pain in the ass to point already-existing monitoring at possibly ephemeral VMs, dev machines and the like. Just think of the pain involved in adding a new machine + a bunch of services AND remembering to disable alerts while you do so. Not to say it can't be done, just that it's a source of friction, which means it's less likely to be done.
Second, there are often times when we're building something new, and we don't have already-existing monitoring to point at it. Case in point: I recently set up RabbitMQ at work; this was new to us, and I was completely unfamiliar with it. The tests I added can go on to form the basis of new monitoring, but they emerged from my desire to get familiar with them.
Third, these tests were also about getting familiar with RabbitMQ (and Puppet, which is new to me), and doubtless there are some things in there that will not be needed for monitoring. These are valuable to have in testing, but don't always need to be kept around.
I fully stipulate that monitoring, as often implemented, falls woefully short of our ideal. More often than not, monitoring is a ping check or a port check. Our test driven environment should check for page load times or members behind a load-balancer, or &c. If what we really want are better, more accurate environment measuring, then know there's a refreshing reimagination of monitoring with #monitoringlove. If they're already marching in our direction, let's join ranks.
True story. I've shot myself in the foot more times than I care to remember by, for example, testing that port 80's open without checking the content coming back.
Now that I've said this, I think I start to answer my own question of TDI (test-driven infrastructure) vs monitoring. I begin to see these points: write the tests first (duh, devs have been saying this for years), and better test (monitoring artifacts) generation (ideally, automatic).
Test first: yes, particularly when starting with a new technology (see above re: RabbitMQ). Also, in theory you can rip stuff out and try something else in its place (think nginx vs Apache); if the tests still pass, you're golden. Still missing: Better test generation. jI'd love something that ate serverspec tests and spat out Nagios configs; even as a first draft of the tests, it'd be valuable.
I've tripped over this error a few times; time to write it down.
A few times now, I've run serverspec-init
, added a couple tests,
then had the first Rake
fail like so:
Circular dependency detected: TOP => default => spec => spec:all =>
spec:default => spec:all
Tasks: TOP => default => spec => spec:all => spec:default
(See full trace by running task with --trace)
Circular dependency detected:
Turns out that this is a known problem in Serverspec, but it's not exactly a bug. The problem appears to be that some part of the Vagrantfile I'm using is named "default". The original reporter said it was the hostname, but I'm not sure I have that in mine. In any case, this causes problems with the Rakefile: the target is default, but that also matches the hostname, and so it's circular and Baby Jesus cries.
(Side rant: I really wish the Serverspec project would use a proper bug tracker, rather than just having everything in pull requests. Grrr.)
One way around this is to change the Rakefile itself. Open it up and look for this part:
namespace :spec do
targets = []
Dir.glob('./spec/*').each do |dir|
next unless File.directory?(dir)
targets << File.basename(dir)
end
task :all => targets
task :default => :all
Comment out that last line, task :default => :all
:
namespace :spec do
targets = []
Dir.glob('./spec/*').each do |dir|
next unless File.directory?(dir)
targets << File.basename(dir)
end
task :all => targets
# task :default => :all
Problem solved (though probably in a fairly hacky way...)
First day back at $WORK after the winter break yesterday, and some...interesting...things. Like finding out about the service that didn't come back after a power outage three weeks ago. Fuck. Add the check to Nagios, bring it up; when the light turns green, the trap is clean.
Or when I got a page about a service that I recognized as having, somehow, to do with a webapp we monitor, but no real recollection of what it does or why it's important. Go talk to my boss, find out he's restarted it and it'll be up in a minute, get the 25-word version of what it does, add him to the contact list for that service and add the info to documentation.
I start to think about how to include a link to documentation in Nagios alerts, and a quick search turns up "Default monitoring alerts are awful" , a blog post by Jeff Goldschrafe about just this. His approach looks damned cool, and I'm hoping he'll share how he does this. Inna meantime, there's the Nagios config options "notes", "notesurl" and "actionurl", which I didn't know about. I'll start adding stuff to the Nagios config. (Which really makes me wish I had a way of generating Nagios config...sigh. Maybe NConf?)
But also on Jeff's blog I found a post about Kaboli, which lets you interact with Nagios/Icinga through email. That's cool. Repo here.
Planning. I want to do something better with planning. I've got RT to catch problems as they emerge, and track them to completion. Combined with orgmode, it's pretty good at giving me a handy reference for what I'm working on (RT #666) and having the whole history available. What it's not good at is big-picture planning...everything is just a big list of stuff to do, not sorted by priority or labelled by project, and it's a big intimidating mess. I heard about Kanban when I was at LISA this year, and I want to give it a try...not suure if it's exactly right, but it seems close.
And then I came across Behaviour-driven infrastructure through Cucumber, a blog post from Lindsay Holmwood. Which is damn cool, and about which I'll write more another time. Which led to the Github repo for a cucumber/nagios plugin, and reading more about Cucumber, and behaviour-driven development versus test-driven development (hint: they're almost exactly the same thing).
My god, it's full of stars.