Posts tagged “monitoring”

May 26, 2010 Config file parsing
I've been setting up some new VMs for a separate project at work. I've realized that this is painful for two reasons: Bacula and Nagios.

Both are important...can't have a service without monitoring, and can't have a machine without backups. But each of these are configured by vast files; Bacula's is monolithic (the director's, anyhow, which is where you add new jobs) and Nagios' is legion. And they're hard to configure automagically with sed/awk/perl or cfengine; their stanzas span lines, and whitespace is important.

I've recently added a short script to my Nagios config; it regenerates a file that monitors all the Bacula jobs and makes sure they happen often enough. This is good...and I want more.

I found pynag, a Python module to parse and configure Nagios files. This is a start. I've had problems getting its head around my config files, because it didn't understand recursion in hostgroups (which I think is a recent feature of Nagios) or a hostname equal to "*". I've got the first working, and I'm banging my head against the second. The three books I got recently on Python should help (wow, IronPython looks nice).

There are a lot of example scripts with pynag. None do exactly what I want, but it looks like it should be possible to generate Nagios config files from some kind of list of hosts and services. This would be a big improvement.

But then there's Augeas, which does bi-directional parsing of config files. Have a look at the walk-through...it's pretty astounding. I realized that I've been looking for something like this for a long time: an easier way of managing all sorts of config files. Cfengine (v2 to be sure) just isn't cutting it anymore for me.

Now, the problem with Augeas for my present task is that there isn't anything in the current tree that does what I want, either. There is a commit for parsing nagios.cfg -- not sure if it's been released, or if it will parse everything in a Nagios config_dir. There's nothing for Bacula, either. This will mean a lot more work to get my ideal configuration management tool.

(On a side note, my wife said something to me the other day that was quite striking: I need tasks that can be divvied up into 45-minute chunks. That's how much free time I've got in the morning, bus rides to and from work, and the evening. Commute + kids != long blocks of free time.)

And I've got a congenital weakness for grand overarching syntheses of all existing knowledge...or at least big tasks like managing config files. So I'm trying to be aware of my brain.

...and there's son #2 waking up. Time to post.
January 20, 2010 Checks
- This Nagios check looks for extra Wordpress admins.$ARG1$ should look include "-w min:max -c min:max", giving the acceptable ranges; in my case, I know I should have 3 and only 3 admins, so I have "-w 3:3 -c 3:3".
```
    define command{
```
```
    command_name check_wp_admins
    command_line $USER1$/check_mysql_query -q 'SELECT COUNT(wp_users.user_login) AS "Admins"
                           FROM wp_users, wp_usermeta
                           WHERE wp_usermeta.meta_value LIKE "%administrator%" AND
                           wp_usermeta.user_id=wp_users.ID' -H $HOSTADDRESS$ $ARG1$
```
```
    }
```
- This one looks for nasty Wordpress posts. Note the dependency on MySQL's regex command. In my case, I know that I do not have any posts with these words, so in $ARG1$ I have "-w 0:0 -c 0:0".
```
    define command{
```
```
    command_name check_wp_nasty_posts
    command_line $USER1$/check_mysql_query -q 'SELECT COUNT(*)
                           FROM wp_posts
                           WHERE post_content REGEXP "iframe|noscript|display"' -H $HOSTADDRESS$ $ARG1$
```
```
    }
```
- And a Python script that picks out a random client, job and file from Bacula's database and tries to retrieve it. It's not ideal -- checking for sanity is left as a DARPA Grand Challenge -- but at least it's one way of exercising backups. I anticipate running this often. If anyone's interested, let me know.
The more I work with Python, the more I don't just like it but admire it.

Ugh...not much more right now. I've got a blocked eustachian tube that I'm self-medicating with a Python script^W^Wcold medicine, and the acetominiphen in it is making me hazy.
December 31, 2009 Xmas maintenance
A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.

Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.

It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.

One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)

The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:
1. You have to rebuild the drivers after a kernel change.
2. This only showed up on two servers because the third server had not upgraded its kernel (or indeed, any of its packages). Why? cfservd had refused its connection because I had the MaxConnections parameter too low.
3. And of the two that did upgrade, the one machine we'd tested the Linux drivers on still had an old multipath.conf file in /etc, which even though the multipathd. service wasn't starting up was enough to get drivers loaded. This took a while to figure out because I'd completely forgotten how to tell which driver was in use.
I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)

Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.

(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)

Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.

And best of 2010 to all of you!
October 05, 2009 Instructions for yak shaving
1. Install logwatch on Solaris fileserver.
2. Notice that logwatch emails are not coming in.
3. Log in and run logwatch by hand.
4. Inspect mail log and notice lack of any entries.
5. Notice that Postfix is in maintenance mode; start it up.
6. Notice continued lack of emails.
7. Notice that Postfix is running, which confused svcadm when told to start up Postfix. It fails to do so and fails to log this.
8. killall postfix, svcadm enable postfix.
9. man svcadm; svcadm clear postfix; svcadm enable postfix.
10. Run logwatch by hand; notice emailed report to "root@localhost.localdomain", which gets bounced by Postfix on the mail server because it's a non-existent host.
11. Resist temptation to go down that rabbit hole just now, and stick to the problem at hand.
12. Edit /opt/csw/etc/log.d/logwatch.conf and set MailTo to proper address.
13. Re-run logwatch and note that reports are still going to root@localhost.
14. After much swearing, notice that actually, logwatch is set to look in /opt/csw/etc/log.d/conf/logwatch.conf for configuration.
15. Edit that file, re-run logwatch.
16. Notice errors from Postfix: "postdrop[13848]: [ID 947731 mail.warning] warning: mailqueueenter: create file maildrop/908447.13848: Permission denied".
17. Run "postfix set-permissions". Test mail; still failing.
18. Check permissions on another system and set by hand.
19. Re-run logwatch. Still no email. Re-run with debug=high and get email.
20. Wonder idly about futility of self-aware log watching system that can't report on its own heisenbug-induced failure, crappy packaging practices, inability to check end-to-end email connectivity, other career options.
21. (Update) Realize that the emails show up if "Detail" is set to Medium or High ; Low, the default, makes the report silent.
22. (Update) Uninstall the package and reinstall, only to find that the symlink to conf/logwatch.conf is set up at installation, and that this is probably a case of $EDITOR breaking the symlink. Apply head to desk.
July 16, 2009 Cacti debugging
The saga of the UPS continues. Yesterday I got the SNMP card set up and working. I also found this Cacti template, which promised lots of pretty graphs. But there were a few bumps along the way.

First, Cacti was convinced that the UPS was down. Actually, it took me a while to figure this out because the logs didn't say anything abou this host at all. Eventually I tracked it down to Cacti using SNMP queries to see if it was up; turns out that this machine doesn't like being queried at the OID 0.1, and just doesn't respond. Changing the upness-detecting algorithm (heh) to TCP ping did the trick nicely.

Next, the graphs for the UPS were still not being produced, even though the RRDs were now being updated. I got the debug info for a graph and ran the rrdtool command by hand. The RRD does not contain an RRA matching the chosen CF was the response.

This thread showed a lot of people having the same problem. Since some of these problems were fixed by an upgrade, I did so; there were a few CentOS updates waiting for that machine anyhow. That made it worse: no graphs were being shown now. rrdtool said that there were no fonts present, so maybe fontconfig was out of order. Installing dejavu-lgc-fonts did the trick nicely, and I got my graphs back.

Well, all except the UPS ones I was after in the first place. I was still getting the error about not containing the chosen CF. Well, when all else fails keep reading the forum, right?

The rrdtool command used the LAST function; this was the culprit. If I ran s/LAST/AVERAGE/g on the command, it worked a treat. Thus, one option would have been to edit the template. However, I decided on an alternate approach, suggested in the forum: I removed the UPS RRDs, went to Data Sources -> RRAs in the Console menu, selected each RRA in turn and added LAST to the consolidation function.

Finally! Whee! Except for one: the graph of voltage vs input frequency. I still don't know what this means to me, but I wasn't about to give up now.

Again, rrdtool provided the error: "For a logarithmic yaxis you must specify a lower limit > 0". Bug reports to the rescue: Console -> Graph Templates -> Voltage/Freq, and set Lower Limit to 0.1.

All that and I'm still the only one looking at these graphs. Man, I should frame them.
September 29, 2007 Presentation(s), conference, nagios exchange, Project U-13, Project U-14
I've had a bunch of ideas lately. I'm inflicting them on you.

The presentation went well...I didn't get too nervous, or run too long, or start screaming at people (damn Induced Tourette's Syndrome) or anything. There were maybe 30 or so people there, and a bunch of them had questions at the end too. Nice! I was embiggened enough by the whole experience that, when the local LUG announced that they were having a newbie's night and asked for presenters to explain stuff, I volunteered. It's coming up in a few weeks; we'll see what happens.

And then I thought some more. A few days before I'd been listening to the almost-latest episode of LugRadio (nice new design!), where they were talking about GUADEC and PyCon UK. PyCon was especially interesting to hear about; the organizers had thought "Wouldn't it be cool to have a Python conference here in the UK?", so they made one.

So I thought, "It's a shame I'm not going to be able to go to LISA this year. Why don't we have our own conference here in Vancouver?" The more I thought about it, the better the idea seemed. We could have it at UBC in the summer, where I'm pretty sure there are cheap venues to be had. Start out modest — say, a day long the first time around. We could have, say, a training track and a papers track. I'm going to talk about this to some folks and see what they think.

Memo to myself: still on my list of stuff to do is to join pool.ntp.org. Do it, monkey boy!

Another idea I had: a while back I exchanged secondary DNS service, c/o ns2exchange.com. It's working pretty well so far, but I'm not monitoring it so it's hard for me to be sure that I can get rid of the other DNS servers I've got. (Everydns.net is fine, but they don't do TXT or IPv6 records.) I'm in the process of setting up Nagios to watch my own server, but of course that doesn't tell me what things look like from the outside.

So it hit me: what about Nagios exchange? I'll watch your services if you watch mine. You wouldn't want your business depending on me, of course, but this'd be fine for the slightly anal sysadmin looking to monitor his home machines. :-) The comment link's at the end of the article; let me know if you're interested, or if you think it's a good/bad/weird idea.

The presentation also made me think about how this job has been, in many ways, a lot like the last job: implementing a lot of Things That Really Should Be Done (I hate to say "Best Practices) in a small shop. Time is tight and there's a lot to do, so I've been slowly making my way through the list:
- Improving backups (Bacula, Amanda)
- Automated install (FAI, Jumpstart)
- Monitoring services (Nagios)
- Monitoring performance (MRTG, Cacti)
- Ticket system (RT)
- Automating management (Cfengine)
Some of these things have been held up by my trying to remember what I did the last time. And then there's just getting up to speed on bootstrapping a Cfengine installation (say).

So what if all these things were available in one easy package? Not an appliance, since we're sysadmins — but integrated nicely into one machine, easily broken up if needed, and ready to go? Furthermore, what if that tool was a Linux distro, with all its attendant tools and security? What if that tool was easily regenerated, and itself served as a nicely annotated set of files to get the newbie up and running?

Between FAI (because if it's not Debian, you're working too hard) and cfengine, it should be easy to make a machine look like this. Have it work on a live ISO, with installation afterward with saved customizations from when you were playing around with it.

Have it be a godsend for the newbie, a timesaver for the experienced, and a lifeline for those struggling in rapidly expanding shops. Make this the distro I'd want to take to the next job like this.

I'm tentatively calling this Project U-13. We'll see how it goes.

Oh, and over here we've got Project U-14. So, you know, I've got lots of spare time.
July 28, 2007 It's a love affair...mainly Nagios and my network
I can get really, really focussed sometimes. Every now and then that happens with Nagios.

Yesterday I had some time to kill before I went home, so I looked over my tickets in RT. (I work in a small shop, so a lot of the time the tickets in RT are a way of adding things to my to-do list.) There was one that said to watch for changes in our web site's main page; I'd added that one after MySQL'd had problems one time -- ran out of connections, I think -- and Mambo had displayed a nice "Whoops! Can someone please tell the sysadmin?" page (a nice change from the usual cryptic error when there's no database connection). Someone did, but it would've been nice to be paged about it.

At home I use WebSec to keep track of some pages that don't change very often (worse luck…), and I thought of using that. It sends you the new web page with the different bits highlighted, which is a nice touch. But I wanted something tied in with Nagios, rather than another separate and special system.

So I started looking at the Nagios plugins I had, and I was surprised to find that check_http has a raft of different options, including the ability to check for regexes in the content. Sweet! I added a couple strings that'll almost certainly be there until The Next Big Redesign(tm), and done.

I started looking at the other plugins, and noticed check_hpjd. A few minutes later I was checking our printers for errors...just in time to notice a weird error that someone had emailed me about 30 seconds before. Nice!

This morning (I work from home on Saturdays in return for getting Wednesdays off to take care of Arlo) I was checking Cacti (which rocks even if they do call it a solution). /home/visitors with no free space? Wha'? Someone had run a job that'd managed to fill the whole damned partition.

Well, there's check_disk, but that's only for mounted disks — and I don't want the monitoring machine freezing if there's a problem with NFS. SNMP should do this, right? Right — the net-snmp project has the ability to throw errors if there's less than a certain amount of free space on a disk. For some reason I'd never set that up before, nor got Nagios to monitor for it. A few minutes later and check_snmp was looking for non-empty error messages:
```
$USER1$/check_snmp -H $HOSTADDRESS$ -o UCD-SNMP-MIB::dskErrorMsg.$ARG1$ -s ""
```
I looked ahead in snmpd.conf and noticed the process section. Well, hell! It's all very good to check that the web server is running, but what if there are too many Apache processes? Or too few of MySQL? Or no Postfix? Can't believe I never set this up before…

I've finally come up for breath. This wasn't what I planned on doing this morning, but I love it when a plan will come together next time.