It's never gonna happen unless I do it now

So one thing that's been hanging around in my mailbox for (checks mailbox) good God, three weeks now is an exchange I had with Jess Males (aka Hefeweizen on IRC, which is a damn good name). He wrote to me about my Infrastructure Code Testing BoF, and asked:

I don't see notes to this effect, so I'll ask: what's the difference between monitoning and test-driven infrastructure? Monitoring started as a way to tell us that the infrastructure we need is available and operating as expected. Test-driven infrastructure serves the role of verifying that the environment we're describing (in code) is being implemented, and thus, operating as expected. Those sound awfully similar to me. Is there a nuance that flies over my head?

Before I insert my uninformed opinions, I'll point you to an excellent article from this year's SysAdvent by Yvonne Lam called How To Talk About Monitors, Tests, and Diagnostics. But if you're still interested, here goes...

First, often (though not always) it's a pain in the ass to point already-existing monitoring at possibly ephemeral VMs, dev machines and the like. Just think of the pain involved in adding a new machine + a bunch of services AND remembering to disable alerts while you do so. Not to say it can't be done, just that it's a source of friction, which means it's less likely to be done.

Second, there are often times when we're building something new, and we don't have already-existing monitoring to point at it. Case in point: I recently set up RabbitMQ at work; this was new to us, and I was completely unfamiliar with it. The tests I added can go on to form the basis of new monitoring, but they emerged from my desire to get familiar with them.

Third, these tests were also about getting familiar with RabbitMQ (and Puppet, which is new to me), and doubtless there are some things in there that will not be needed for monitoring. These are valuable to have in testing, but don't always need to be kept around.

I fully stipulate that monitoring, as often implemented, falls woefully short of our ideal. More often than not, monitoring is a ping check or a port check. Our test driven environment should check for page load times or members behind a load-balancer, or &c. If what we really want are better, more accurate environment measuring, then know there's a refreshing reimagination of monitoring with #monitoringlove. If they're already marching in our direction, let's join ranks.

True story. I've shot myself in the foot more times than I care to remember by, for example, testing that port 80's open without checking the content coming back.

Now that I've said this, I think I start to answer my own question of TDI (test-driven infrastructure) vs monitoring. I begin to see these points: write the tests first (duh, devs have been saying this for years), and better test (monitoring artifacts) generation (ideally, automatic).

Test first: yes, particularly when starting with a new technology (see above re: RabbitMQ). Also, in theory you can rip stuff out and try something else in its place (think nginx vs Apache); if the tests still pass, you're golden. Still missing: Better test generation. jI'd love something that ate serverspec tests and spat out Nagios configs; even as a first draft of the tests, it'd be valuable.