Carousel is a lie!

Entries tagged "cfengine".

1 != 2
26th August 2005

I love cfengine. If you haven't checked it out yet, do so. You can do really neat stuff like this:

editfiles::
        { /etc/Xprint/C/print/attributes/document
                BeginGroupIfNoLineMatching "^\*default-printer-resolution: 300"
                        CommentLinesMatching "^\*default-printer-resolution: 600"
                        LocateLineMatching "^# \*default-printer-resolution: 600"
                        InsertLine "*default-printer-resolution: 300"
                        DefineInGroup restart_xprint
                EndGroup
        }

shell::
        debian.restart_xprint::
                "/etc/init.d/xprint restart"

(Which, by the way, totally fixes the problem of Debian printing 'way huge stuff. Bug number 262958. You should totally look it up.) Look at that. It's lovely. It's obvious what it's looking for, what it'll do if it can't find it, and what'll happen after that. And it does it automagically. At night. From cron. The way God intended all system administration to be done. However -- and I cannot emphasize how important it is to keep this in mind -- it is absolutely NFG reading the documentation for an hour trying to figure out why the DefineInGroup statement just does not work if:

  1. you're reading the docs for cfengine v2, and
  2. you're working with cfengine v1.

It's my own fault for printing out v2 docs and not thinking much about it. However, in my own defense it would be nice if cfengine would complain about something it appears not to recognize. Not even with -d2 (which produces output along the lines of CheckingDateForSolarEclipseToday [no]) did it whisper a word about this.

Tags: cfengine.
cfengine classes and shellcommands
6th October 2005

cfengine is great, it really is. But there are some things that tripped me up. Often you want to set up a daemon to run The Right Way, which involves changing its config file. After that, of course, you want to restart it. What to do? The naive way (ie, the first way I tried) of doing things is:

control::
        sequence ( editfiles shellcommands )

editfiles::
        debian:
                { /etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                Define restart_foo
                        EndGroup
                }

        freebsd:

                { /usr/local/etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                Define restart_foo
                        EndGroup
                }

shellcommands::
        debian.restart_foo:
                "/etc/init.d/foo restart"

        freebsd.restart_foo:
                "/usr/local/etc/rc.d/foo restart"

However, the correct way of doing this is:

control::
        sequence = ( editfiles shellcommands )
        AddInstallable = ( restart_foo )

editfiles::
        debian:
                { /etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                DefineInGroup "restart_foo"
                        EndGroup
                }

        freebsd:
                { /usr/local/etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                DefineInGroup "restart_foo"
                        EndGroup
                }

shellcommands::
        debian.restart_foo:
                "/etc/init.d/foo restart"

        freebsd.restart_foo:
                "/usr/local/etc/rc.d/foo restart"

Without both the enumeration of all your made-up classes in AddInstallable and the enclosing of that class in quotes, cfengine will fail to do what you want -- and will do so quietly and with no clue about why. God, that took me a long time to find.

Tags: cfengine.
We're An American Band
13th March 2006

More fallout today from Saturday's power outage: two workstations that failed to boot up (BIOS checksum error for one of 'em, which is a new one for me), some NIS-related services that didn't get started properly (not sure what's going on there), and so on. Plus the return of the where-are-those-seven-machines? that didn't get done on Friday because of all of this.

But I did learn some stuff about Cfengine. For example, if you have something like:

my_url = ( http://www.example.com/foo/bar )

then you'd better precede it with:

split = ( "+" )

or some other character that isn't used. The colon is treated as a list separator by default, which means that later on, when you try and do something like:

shell::
    linux.need_some_file:
        "/bin/wget $(my_url)/baz"

what it'll actually do is this:

/bin/wget http/baz
/bin/wget //www.example.com/foo/bar/baz

'cos it's iterating over the two lists, see?

And SuSE's dhcp client, by default (I think), will change /etc/yp.conf without telling you, and then on exit put back the old version (saved conveniently at /etc/yp.conf.sv. It took me a long time to figure out that this was happening, and it pissed me off mightily. /etc/resolv.conf is filled with comments when the dhcp client modifies it -- hell, they even throw in the PID. So why not do that with yp.conf? At least you can turn it off by changing DHCLIENT_MODIFY_NIS_CONF in /etc/sysconfig/networking/dhcp.

Tags: cfengine.
Little Green Bag
1st June 2006

Some days are fun days. I got this error on a Debian workstation when starting X:

Xlib: Connection to ":0.0" refused by server Xblib: Protocol not
supported by server.  Xrdb: Can't open display ':0'

Turns out that an .xsession file, with one commented-out line, caused that. Remove the line (so now it's empty) and everything works.

Next we got the same user, who's had his home directory moved around on the machine. Machines mounting his home dir via amd (FreeBSD, Debian) work fine, but the SuSE machines running autofs fail miserably with "permission denied" and the ever-popular:

$ cd
-bash: cd: /home/foo: Unknown error 521

Which, if you look up /usr/include/linux/errno.h -- which, you know, is the logical thing to do -- you see this:

/* Defined for the NFSv3 protocol */
#define EBADHANDLE      521     /* Illegal NFS file handle */

Another weird thing with AutoFS: I was running cfengine on a machine, and it hung when querying which RPMs were installed. strace on the rpm command shows its trying to lock a file and failing; looking at /proc/number/fd shows that, yep, it's trying and failing to lock /var/lib/rpm/Packages, the Berkeley DB file that knows all and sees all. So lsof to see who's holding it open, and that hangs; strace shows it's hanging trying to access the home directory of a user whose machine is down right now for reinstall. Try to unmount that directory and it fails. So I bring up the machine with the user's home directory, which allows me to unmount his home directory on the SuSE machine, which allows cfengine to run rpm, which succeeds in locking the Berkely DB file. Strange; possibly similar to this problem.

On top of everything else, someone asked me if I could be a "network prime". I think they mean "person we can talk to with authority to make network changes", or possibly "network contact". Not entirely sure.

But on the other hand: figured out how to run wpkg, package manager for Windows of the elder gods, as a service using Cygwin's cygrunsrv. The instructions are on the wiki for your viewing enjoyment.

Tags: amd, cfengine, windows.
Choose It
4th December 2006
Two sips from the cup of human kindness, and I'm shit-faced
Just laid to waste
If there's a choice between chance and flight, Choose it tonight.
"Choose It", The New Pornographers

Just got back from a whirlwind walk from the Lincoln Memorial to the Washington Monument to the White House. Beautiful, all of it...though a) the White House is small and b) there was something being filmed/videotaped in the courtyard, which made me think of Vancouver.

Training again. AFrisch was good, convering Cfengine quite well; would've liked to see more info about expect. (Apparently there are Perl/Python bindings...I had no idea.) Afternoon course was "Interviewing For System Administrators" by Adam Moskowitz and that was great -- lots of things I didn't know, lots of tips on doing it better next time.

Saw Tom Limoncelli in the hall during a break. Managed to restrain myself. I have the reputation for quiet restraint of a nation to uphold.

Very tired now. Time to go get beer.

Tags: cfengine, lisa.
Electric Version
7th December 2006
Sound of tires, sound of God...
"Electric Version", The New Pornographers.

Thursday morning came far too early. My roommate offered some of his 800mg Ibuprofins, and I accepted. First thing I attended was the presentation "Drowning in the Data Tsunami" by Lee Damon and Evan Marcus. It was interesting, but seemed to be mostly about US data regulations (HIPPA/SOX et al.) and wasn't really relevant to me. I had been expecting more of an outline of, say, how in God's name we're going to preserve information for, say, a hundred years (heroic efforts of the Internet Archive notwithstanding). There was mention of an interesting approach to simply not accumulating cruft as you upgrade storage (because it's easier than sorting through to see what can be discarded; "Why bother weeding out 200MB when the new disk is 800GB?"): a paper by Radia Perlman (sp?) (she of OSPF fame) that proposes an encrypted data storage system (called The Ephemerizer) combined with key escrow that, to expire data, simply deletes the key when the time is up. Still, I moved on before too long.

...Which was good, because I sat in on Alva Couch's presentation on his and Mark Burgess' paper, "Modelling Next-Generation Configuration Management Tools". Some very, very confusing stuff about aspects, promises and closures -- confusing because the bastard didn't preface his talk with "This is what Hugh from Vancouver will need to know to understand this." (May be in the published paper; will check later.) Here's what I could gather:

I will do the right thing and read his paper, and I may update this later; these are just my notes and impressions, and aren't gospel. Couch is an incredibly enthusiastic speaker, and even though I didn't understand a lot of it I ended up excited anyway. :-) He gave another talk later in the week that Ricky went to, about how system administration will have to become more automatic; as a result, we'd all better learn how to think high-level and to be better communicators, because more and more of our stuff will be management -- and not just in the sense of managing computers. I'm going to seek out more of his stuff and see if it'll fit in my head.

After the break was a talk on "QA and the System Administrator", presented by a Google sysadmin. I went because it was Google, and frankly it wasn't that interesting. One thing that did jump out at me was when he described a Windows tool called Eggplant, a QA/validation tool. It has OCR built-in to recognize a menu, no matter where it is on the screen. This astounded me; when you start needing OCR to script things, that's broken. I don't doubt that it's a good tool, and I can think of lots of ways that would come in handy. But come on. I mean, a system that requires that is just so ugly.

I went out to lunch with Jay, a sysadmin from a shop that's just got permission from the boss to BSD a unit-testing program they've come up with for OpenBSD firewalls: it uses QEMU instances to fully test a firewall with production IP addresses, making sure that you're blocking and allowing everything you want. It sounds incredibly cool, and he's promised to send me a copy when he gets back. I can't wait to have a look at it.

After that was the meet-the-author session. I got to thank Tom Limoncelli for "Time Management for System Administrators", and got an autograph sticker from him and Strata Rose Chalup, his co-author for Ed 2. Sadly, I didn't get a chance to thank Tobias Oetiker (who I nearly ran into at lunch the day before).

Next up was the talk from Tom Limoncelli and Adam Moskovitz (Adam's looking for a job! Somebody hire him!) about how to get your paper accepted at LISA. Probably basic stuff if you've written a paper before, but I haven't so it was good to know. Thing like how to write a good abstract, what kind of paper is good for LISA, and how you shouldn't say things like "...and if our paper is accepted, we'll start work right away on the solution." Jay asked whether a paper on the pf testing tool would be good, and they both nodded enthusiastically.

Must Google:

Quotes from the talk:

At this point I started getting fairly depressed. Part of it was just being tired, but I kept thinking that not only could I not think of something to write a paper about, I could not think of how I'd get to find something to write about. I wandered over to the next talk feeling rather sad and lost.

The next talk was from Andy Seely on being a sysadmin in US Armed Forces Command and Control. Jessica was there, and we chatted a bit about how this talk conflicted with Tom Limoncelli's Time Management Guru session, and maybe ducking over to see that. Then Andy came over and asked Jessica to snap some picture, so she ended up staying. I was prepared to give it five minutes before deciding whether or not to leave.

Well, brother, let me tell you: Andy Seely is one of the best goddamned speakers on the planet. He was funny, engaging, and I could no more leave the room than I could get my jaw to undrop. Not only that, his talk was fascinating, and not just because he's a sysadmin for the US Armed Forces while simultaneously having a ponytail, earrings and tattoos. You can read the article in ;login: (FIXME: Add link) that it was based on, but he expanded on it considerably. Let me see what I can recall:

Longer story: Because of the nature of his work, he's got boxes that he has to keep working when he knows next to nothing about what they're meant to do. Case in point: a new Sun box arrives ("and it's literally painted black!"), but the person responsible for it wants to send it back because it doesn't work -- which means that when they click the icon to start the app it's meant to run, it doesn't launch and there's no visible sign that it's running. There's no documentation. And yet he's obligated to support this application. What do you do?

Even tracking down the path to the program launched by the icon is a challenge, but he does, tracks down the nested shell scripts and finally finds the jar that is the app ("Aha! It is Java!"). He finds log files which are verbose but useless. He contacts the company that wrote it, and is told he needs a support contract...which the government, when putting together the contract for the thing, did not think to include. So he calls back an hour later, talks to the help desk and tells them he's lost the number -- "Can you help a brother out?" They do, but they're stumped as well, and say they've never seen anything like this.

Time to pull out truss, which produces a huge amount of output. Somewhere in the middle of all that he notices a failing hard read of a file in /bin: it was trying to read 6 bytes and failing. Turns out the damned thing was trying to keep state in /bin, and failing because the file was zero bytes long. He removed the file, and suddenly the app works.

Andy also talked about trying to get a multiple GB dump file from Florida to Qatar. Physical transport was not an option, because arranging it would take too long. So he tries FTPing the file -- which works until he goes home for the day, at which point the network connection goes down and he loses a day. So he writes a Perl script that divides the file into 300MB chunks, then sends those one at a time. It works!

At this point, someone yells out "What about split?" Andy says, "What?" He hadn't known about it. There was a lot of good-natured laughter. He asked, "Is there an unsplit?" "Cat!" came the response from all over the room. He smacked his forehead and laughed. "This is why I come to LISA," he said. "At my job, I've been there 10 years. People come to me 'cos I'm the smart one. Here, I'm the dumb one. I love that."

There are two things I would like to say at this point.

First off, Andy is at least the tenth coolest person on the entire Eastern seaboard. No, he didn't know about cat -- but not only did he reimplement it in Perl rather than give up, he didn't even flinch when being told about it in the middle of giving a talk at LISA. I would probably have self-combusted from embarassment ("foomp!"), and I would have felt awful. Andy's attitude? "I learned something." That's incredibly strong. (Although he told a story later about being in the elevator with some Google people. They recognized him and said, "Hey, it's the 'man cat' guy!")

Second, when he said, "Here, I'm the dumb one. I love that" I sat up straight and thought, "Holy shit, he's right." Here I am at LISA for the first time ever. I've met people who can help me, and people I can help. I've made a crapload of new friends and have learned more in one week than I would've thought possible. And I'm worried 'cos it might be a few years before I can think about presenting a paper? That's messed up. I tend to set unreasonably high goals for myself and then get depressed when I can't reach them. Andy's statement made me feel a whole lot better.

During Q & A I asked what he did for peer support, since his ability to (say) post to a mailing list asking for help must be pretty restricted. He said that he's started a wiki for internal use and it's getting used...but both the culture and the job function mean that it's slow going. He's also started a conference for fellow sysadmins: 100 or so this year, and he's hoping for more next year.

In conclusion: if you ever get the chance to go see him, do so. And then buy him a beer.

Tags: cfengine, lisa.
Presentation(s), conference, nagios exchange, Project U-13, Project U-14
29th September 2007

I've had a bunch of ideas lately. I'm inflicting them on you.

The presentation went well...I didn't get too nervous, or run too long, or start screaming at people (damn Induced Tourette's Syndrome) or anything. There were maybe 30 or so people there, and a bunch of them had questions at the end too. Nice! I was embiggened enough by the whole experience that, when the local LUG announced that they were having a newbie's night and asked for presenters to explain stuff, I volunteered. It's coming up in a few weeks; we'll see what happens.

And then I thought some more. A few days before I'd been listening to the almost-latest episode of LugRadio (nice new design!), where they were talking about GUADEC and PyCon UK. PyCon was especially interesting to hear about; the organizers had thought "Wouldn't it be cool to have a Python conference here in the UK?", so they made one.

So I thought, "It's a shame I'm not going to be able to go to LISA this year. Why don't we have our own conference here in Vancouver?" The more I thought about it, the better the idea seemed. We could have it at UBC in the summer, where I'm pretty sure there are cheap venues to be had. Start out modest — say, a day long the first time around. We could have, say, a training track and a papers track. I'm going to talk about this to some folks and see what they think.

Memo to myself: still on my list of stuff to do is to join pool.ntp.org. Do it, monkey boy!

Another idea I had: a while back I exchanged secondary DNS service, c/o ns2exchange.com. It's working pretty well so far, but I'm not monitoring it so it's hard for me to be sure that I can get rid of the other DNS servers I've got. (Everydns.net is fine, but they don't do TXT or IPv6 records.) I'm in the process of setting up Nagios to watch my own server, but of course that doesn't tell me what things look like from the outside.

So it hit me: what about Nagios exchange? I'll watch your services if you watch mine. You wouldn't want your business depending on me, of course, but this'd be fine for the slightly anal sysadmin looking to monitor his home machines. :-) The comment link's at the end of the article; let me know if you're interested, or if you think it's a good/bad/weird idea.

The presentation also made me think about how this job has been, in many ways, a lot like the last job: implementing a lot of Things That Really Should Be Done (I hate to say "Best Practices) in a small shop. Time is tight and there's a lot to do, so I've been slowly making my way through the list:

Some of these things have been held up by my trying to remember what I did the last time. And then there's just getting up to speed on bootstrapping a Cfengine installation (say).

So what if all these things were available in one easy package? Not an appliance, since we're sysadmins — but integrated nicely into one machine, easily broken up if needed, and ready to go? Furthermore, what if that tool was a Linux distro, with all its attendant tools and security? What if that tool was easily regenerated, and itself served as a nicely annotated set of files to get the newbie up and running?

Between FAI (because if it's not Debian, you're working too hard) and cfengine, it should be easy to make a machine look like this. Have it work on a live ISO, with installation afterward with saved customizations from when you were playing around with it.

Have it be a godsend for the newbie, a timesaver for the experienced, and a lifeline for those struggling in rapidly expanding shops. Make this the distro I'd want to take to the next job like this.

I'm tentatively calling this Project U-13. We'll see how it goes.

Oh, and over here we've got Project U-14. So, you know, I've got lots of spare time.

Tags: cfengine, conferenceorganization, dns, geekdad, monitoring, ntp, projectu13.
Stay on target...
21st December 2007

Holy crap, it's been a while since I last wrote here. Mainly that's because I've been working on web stuff at work and have felt very little like a sysadmin of late. Thankfully we've got a webmaster hired, and to some extent the work'll be shifted to him in the new year. Of course, that still leaves the redesign of the website and its back end…that's not done 'til it's done.

This week, though, has been slow, and I've been catching up a little on sysadmin work. Part of it was setting up a devel server for the webmaster, and detailing what I was doing in Cfengine as I went along. It was gratifying to get LDAP working (I haven't done that on a Linux machine before; shame on me), and irritating when I realized that I couldn't mount the home directories from the server because I hadn't restarted nscd on the server.

The last two days were spent trying to get encrypted Bacula working between here and $other_university. This was an enormous pain in the ass for two reasons:

  1. The Right Way (tm) of doing it is by using TLS, which is what the kids are calling SSL these days, and I have never fully grokked SSL, or the openssl command. I know that there's encryption going on; I know that there are certificates signed by CAs; I know that there's a lot of negotiating of different options. But start throwing in x509 versus PEM, Diffie-Helman parameters and the single most cryptic set of error messages I've ever come across, and I just feel thick. I was reduced to looking at tcpdump output of the negotiation to figure out what was going on, and I couldn't; the Bacula FD client complained that the Bacula Director wasn't producing a certificate, and that was all I knew. The otherwise incredibly excellent docs from Bacula were a trifle thin on all of this, and I couldn't find out much about my situation (going the self-CA route).

  2. So okay, fuckit, right? That's why God invented OpenSSH. So whee, start tunnelling port 9102 over SSH so the Director can contact the FD at $other_university, and 9103 back so the FD can contact the Storage Daemon. Only it turns out (my bad for not knowing this before) that not only does the client want to contact the SD, so does the director. Thus, my plan to tunnel to the firewall at the other end and tell the client that it could find the Storage Daemon there didn't work, 'cos the director wanted to contact it there too. (I did briefly try allowing the director to contact the tunnel at the other end: so even though the Storage was working on the same machine as the director, for that one job the Director's connection to it was going to the remote end and getting tunnelled back over SSH. But:

    1. that's horrible, and
    2. I was afraid that when it came time to restore, the Director would figure that it had to contact the Storage Daemon remotely again, complicating an already complicated setup.)

And why was I trying to connect to the remote firewall via SSH, rather than the client I'm trying to back up itself? Because that client is a Solaris machine authenticating against LDAP, and that turns out to bork key-based logins over SSH. What a crock.

Oh well. I did add three other machines here to Bacula this week, so that's good.

Project U-13 is coming along. I'm pretty close to a 0.0.2 release (woot), which should have the following working:

And by "working" I mean "installed". But I've got a decent setup on my laptop for building and testing it, which means I get up to a couple hours a day to work on it (New Westminster -> UBC == long). Thanks to Andy, he of the amazing speaking skills, for kicking my ass into action.

I'm learning a bit more about Mercurial in the process. After coming from CVS and Subversion, it seems really weird to me that the usual way of branching is "Go ahead, clone another repo! We're Mercurial! We don't care! Repos for everyone!" But if you figure on distributed development — something Linux-y than a controlled work environment — then it makes sense. Not that I think I'll have lots of people working on this thing, but it makes sense that if someone were to take this for their own ends, they wouldn't want to bother copying all the branches…just the one(s) they're interested in.

Last word to my son:

Q: What does a Camel say, Arlo? A: Purhl!

2 comments. Tags: cfengine, projectu13.
Coming up
18th January 2008

My laptop hard drive started giving scary errors a couple days ago on the way to work (I've got a 90-minute commute by public transit [uck] so I fill the time by reading, listening to podcasts, or working on Project U-13). Fortunately, working at a university means that there are two computer stores on campus. I ran out at lunch, picked up a 100GB drive, and had things back to normal by the next morning.

Well, normal modulo one false start with Debian; I decided to try encrypted filesystems just for fun. But then I suspended, came back with a newere kernel, and it could not read the encrypted LVM group anymore. Whoops.

Still lots of free space on this thing, and I'm thinking of installing Ubuntu, FreeBSD and maybe NetBSD just for fun. Of course, I've got to do it all via PXE since this thing doesn't have any CDROM drive, but that just adds to the geek points.

Project U-13 is coming up on 0.0.3, btw; Andy suggested adding Rackmonkey, which looks quite cool. There's no package for it, so I'm having to do some rather ugly scripted installation…but I can stand it for now. And I've got the barest skeleton of a cfengine file in there too. Watch the skies!

Tags: bsd, cfengine, hardware, projectu13.
Project U-13, 0.0.3
23rd January 2008

Version 0.0.3 of Project U-13, a distro for sysadmins, has been released!

The main change is the addition of RackMonkey, which its website describes as "a web-based tool for managing racks of equipment such as web servers, video encoders, routers and storage devices", at the suggestion of Andy Seely. Also, Lynx has been installed, and there's also the skeletal beginnings of a Cfengine config file.

The ISO has been signed with my public key. Share and enjoy, and comments on a postcard, please.

Tags: cfengine, projectu13.
cfengine: Received signal 2 (SIGKILL) while doing pre-lock-state
30th July 2008

Ran into a problem today when adding this stanza to cfengine on a Debian Etch machine:

editfiles:
        { /etc/aliases
                AppendIfNoSuchLine "root: sysadmin@pims.math.ca"
                DefineClasses "rebuild_aliases:restart_postfix"
        }

The cfengine reference file I've got, which sez it's for version 2.2.1, says you can define multiple classes in DefineClasses (or DefineInGroup), as long as they're separated by commas, spaces or dots. (The version in Etch is 2.2.20.)

However, when I ran cfagent, it just hung immediately after performing the edit, and gave this error when I ctrl-c'd it:

cfengine: Received signal 2 (SIGKILL) while doing [pre-lock-state]

Running cfengine with -d2 showed endless repetitions of AddClassToHeap() at this point, so either there's something wrong with my syntax or there's a bug in cfengine. (I'm guessing the former.) Searching for pre-lock-state and cfengine only turned up cases where the clients were syncing with the master; thus this note.

The fix was to just make it one class:

                DefineClasses "rebuild_aliases"

Asking to restart Postfix was probably a bit of overkill anyhow...

Tags: cfengine.
This is The Working Hour; we are paid by those who learn by our mistakes
18th November 2008

I'm in the process of setting up a bunch of new servers for $job_2. All but one are CentOS 5.2, kickstart installed and managed with cfengine. This is the third time I've goen thorugh a cfengine setup, and it always feels like starting from scratch each time. It seems -- and I'm not at all sure this is fair or accurate -- that each time I set up one of these systems, there's a lot that I've lost from the last time and have to relearn. I'm fortunate this time that I can refer to $job_1's setup to see how I did things last time, but if I didn't have that I'd be significantly further behind than I am.

I'm not sure what the solution is. Part of me thinks I should just be more aggressive about taking notes, or committing stuff to a private repository, or writing it down here more; part of me thinks that this might be a clue that cfengine is too low-level for my head. It feels like when I was trying to learn C, and couldn't believe that I had to remember all this stuff just to print something, or read a file, or connect to another machine over the Internet. By contrast, Perl (or any other scripted language) was such a relief...just print, or open, or use the Net::Telnet module, or whatever. The details are there and they are important, sometimes very much so; that doesn't mean I want to learn more metallurgy every time I need a fork. (No, I don't think that metaphor's tortured; why do you ask?)

Another thing is that I'm trying to get multipath connections working for the first time. We've got two database servers, each of which is connected via dual SAS HBAs to outboard disk arrays. (I don't think anyone else calls them "outboard", but I like the sound of it. See this hard drive? It's outboard, baby!) The arrays are from Sun and come with drivers, but the documentation is confusing: it says it's available for RHEL 5 (aka CentOS 5), but the actual download says it's only for RHEL 4.

As a temporary respite, I'm trying to see if I can get these working using Linux's own multipath daemon, and it's also confusing. The documentation for it is tough to track down, and I just don't understand the different device names: am I meant to put /dev/dm-2 in fstab, or /dev/mpath/mpath2p1? If the latter, why does the name sometimes change to the WWUID (/dev/mpath/$(cat /dev/random)) when I restart multipathd? (use_friendly_names is uncommented in the config file.) If the whole point of multipath is failover, why does this sequence:

(where /mnt is where I've got this array mounted, obvs) sometimes work, and sometimes end with "I/O error" being logged, and the filesystem being read-only? Is this the sort of thing that the Sun driver will fix? I can't find anything about this.

And I mentioned electrical problems. When we got our servers installed, the Sun guys told us they'd tripped breakers on the PDU and/or breakers in the room's electrical cabinet. Since it had a sign on it saying "100A", I figured we might be running up against power limtis -- either in the room as a whole, if my figures were 'way out, or on individual PDUs. Turns out I was probably wrong: I missed the bit on the sign that said 3-phase, which means (deep breath) we probably have 3 x 100A power available (I think).

It's more complicated than that, because some of it is in 120V, some of it is in twist-lock 220V 30A circuits, and so on. But I should've checked before emailing the faculty member who, in a year or two, will be going into this room (we're there as guests of the department) and happens to sit on the facilities committee. He had asked how we were doing, so I sent him an email -- nice, polite, and including a bit about how grateful we were for the room and the help of the local sysadmins (all of which is true).

I was under the impression that he was asking for info now, so that he could bring it up for action in a few months when we were out. Instead, two hours later when I'm swearing at multipath, in come the facilities manager and one of the sysadmins I was dealing with, looking to find out just how much power we were using anyhow. I apologized profusely, and they were very cool about it. But when the committee guy asks questions, people jump. I had not anticipated this. Welcome to University Politics 101. I emailed again and explained my mistake.

There are lots of remedial courses I could take. However, today I would most like to take "Electricity and wiring for sysadmins".

And on another note: Ack! My laptop's home partition is 93% full! How the hell did that happen?

And again: How did I not know about apt-file? This is perfect!

(Touch o' the hat to Tears For Fears and Steve Kemp; I'm moving closer every day to switching to Chronicle.)

Tags: cfengine, hardware, linux, meta.
Long-term planning
28th January 2009

Another thing I'm trying to do at my new job is make/take more time for long-term planning. I've been dinged by mgt. for this in the past, and while it's not easy to hear I think there has been some validity to this. (My inclination is to concentrate hard on fixing the problems I'm faced with; giving up on something broken, even when doing so would make so much more sense and would free up resources to look for a replacement, just rankles and feels like...well, giving up.) Since the department I'm in is so new, it's even more important to pay attention to this.

Part of the problem is just recognizing that I need to make time. An hour a week to be isolated, and to (say) figure out what I'm going to need to do for the next month, is a habit I'm very conciously trying to adopt.

But another problem is how to keep track of all this. What I've done so far:

So where does that leave me? ATM, (paper planner Cycle) attempting some longer-term project tracking w/org-mode. I figure the TODO bits from org-mode will fit well with the planner, and the flexibility of Emacs and org-mode (different from paper...oh, how I wish I could grep paper) will work well for projects...the records for which should, ideally, be suitable for pasting into wiki-based documentation.

If anyone has any suggestions, please let me know. If I make it to LISA this year, I'll be looking for a BOF about this. (Or maybe I'll just tackle Tom Limoncelli to the ground and holler "I love you, man!" a la "Say Anything".)

Moving on:

And now it is time for bed.

Tags: career, cfengine, emacs, time.
Bacula, gossip, advice
2nd July 2009
This sounds like when I was at my previous employer and they asked if
I could develop a web-based system to take surveys.  I nearly said,
"yes" because, well, I know perl, I know CGI, and I could do it.
However, I was smart enough to say "no, but surveymonkey.com will do
it for cheap."  Best of all it was self-service and the HR person was
able to do it entirely without me.  If I had said I could write such a
program, it would have been days of back-and-forth changes which would
have driven me crazy.  Instead, she was happy to be empowered to do it
herself.  In fact, doing it herself without any help became a feather
in her cap.

The lesson I learned is that "can I do it?" includes "do I want to do
it?".  If I can do something but don't want to, the answer is, "No, I
don't know how" not "I know how but don't want to".  The first makes
you look like you know your limits.  The latter sounds like you are
just being difficult.
Tags: backups, cfengine, reading.
Bad Time Equals LDAP Failure
9th September 2009

Just ran into an interesting problem: after replacing memory on a server, CentOS booting hung at "Starting system message bus..."

So what does dbus have to do with anything? This turned out to be an LDAP failure; dbus was trying to run as UID root, and since the LDAP server couldn't be contacted it hung. Why couldn't the LDAP server be contacted? The LDAP server logs only showed this:

[09/Sep/2009:12:04:32 -0700] conn=41492 op=-1 fd=112 closed - SSL
peer cannot verify your certificate.

The CA cert I use was in place, and another machine had just rebooted w/o problems (all this is taken care of with cfengine, so they were identical in this respect). I could connect to the LDAP server on the right port without any problems.

I finally figured out what was going on when I ran:

openssl s_client -connect ldap.example.com:636 -CApath /path/to/cacert_directory

and saw:

Verify return code: 9 (certificate is not yet valid)

date said it was December 31, 2001. What the what now? ntpdate to set things correctly, then I got:

Verify return code: 0 (ok)

I figure the CMOS clock (or whatever the kids are calling it these days) got reset when we had to remove the CPU daughtercard to get at the memory underneath.

And now you know...the rest of the story.

Tags: cfengine, ldap.
chkconfig woes
14th December 2009

Irritating: chkconfig on RHEL/CentOS returns non-zero if a service isn't configured for a runlevel. IOW, you can do:

chkconfig --level 3 foo

and have 0 returned if it's on, 1 if it's not.

But not SuSE; nope, it just returns 0 whether or not it's enabled, or even if the service itself doesn't exist. Because, you know, grep doesn't get used enough.

I'm doing this because I'm trying to use cfengine 2 to manage services. This works well in CentOS, where you can add something like:

service_foo_on = (ReturnsZero("/sbin/chkconfig --level 3 foo"))

and it'll work. ("service_foo_on" is a bit of a misnomer, because I'm checking runlevels, not whether it's actually running.)

Update: Nope, I'm wrong. chkconfig --check does exactly what I want. Many thanks to yaloki on #openSUSE-server for the help.

Tags: cfengine, opensuse, packagemanagement.
Xmas maintenance
31st December 2009

A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.

Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.

It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.

One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)

The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:

  1. You have to rebuild the drivers after a kernel change.
  2. This only showed up on two servers because the third server had not upgraded its kernel (or indeed, any of its packages). Why? cfservd had refused its connection because I had the MaxConnections parameter too low.
  3. And of the two that did upgrade, the one machine we'd tested the Linux drivers on still had an old multipath.conf file in /etc, which even though the multipathd. service wasn't starting up was enough to get drivers loaded. This took a while to figure out because I'd completely forgotten how to tell which driver was in use.

I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)

Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.

(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)

Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.

And best of 2010 to all of you!

Tags: centos, cfengine, monitoring, packagemanagement, serverroom, upgrades, work.
Presentation done
8th December 2010

My presentation on Cfengine 3 went pretty well yesterday. There were about 20 people there...I had been hoping for more, but that's a pretty good turnout. I was a little nervous beforehand, but I think I did okay during the talk. (I recorded it -- partly to review afterward, partly 'cos my dad wanted to hear the talk. :-)

One thing that did trip me up a bit were questions from one person in the audience that went fairly deep into how to use Cfengine, what its requirements were and so on. Since this was meant to be an introduction and I only had an hour, I wasn't prepared for this. Also, the questions went on...and on...and I'm not good at taking charge of a conversation to prevent it being hijacked. The questions were good, and though he and I disagree on this subject I respect his views. It's just that it really threw off my timing, and would have been best left for after. Any tips?

At some point I'm going to put up more on Cf3 that I couldn't really get into in the talk -- how it compares to Cf2, some of the (IMHO) shortfalls, and so on.

Tags: cfengine.
Xmas Maintenance 2010: Lessons learned
11th January 2011

Xmas vacation is when I get to do big, disruptive maintenance with a fairly free hand. Here's some of what I did and what I learned this year.

Order of rebooting

I made the mistake of rebooting one machine first: the one that held the local CentOS mirror. I did this thinking that it would be a good guinea pig, but then other machines weren't able to fetch updates from it; I had to edit their repo files. Worse, there was no remote console on it, and no time (I thought) to take a look.

Automating patching

Last year I tried getting machines to upgrade using Cfengine like so:

centos.some_group_of_servers.Hr14.Day29.December.Yr2009::
          "/usr/bin/yum -q -y clean all"
          "/usr/bin/yum -q -y upgrade"
          "/usr/bin/reboot"

This didn't work well: I hadn't pushed out the changes in advance, because I was paranoid that I'd miss something. When I did push it out, all the machines hit on the cfserver at the same time (more or less) and didn't get the updated files because the server was refusing connections. I ended up doing it by hand.

This year I pushed out the changes in advance, but it still didn't work because of the problems with the repo. I ran cssh, edited the repos file and updated by hand.

This worked okay, but I had to do the machines in separate batches -- some needed to have their firewall tweaked to let them reach a mirror in the first place, some I wanted to watch more carefully, and so on. That meant going through a list of machines, trying to figure out if I'd missed any, adding them by hand to cssh sessions, and so on.

I may need to give in and look at RHEL, or perhaps func or better Cfengine tweaking will do the job.

Staggering reboots

Quick and dirty way to make sure you don't overload your PDUs:

sleep $(expr $RANDOM / 200 ) && reboot

Remote consoles

Rebooting one server took a long time because the ILOM was not working well, and had to be rebooted itself.

Upgrading the database servers w/the 3 TB arrays took a long time: stock MySQL packages conflicted with the official MySQL rpms, and fscking the arrays takes maybe an hour -- and there's no sign of life on the console while you're doing it. Problems with one machine's ILOM meant I couldn't even get a console for it.

OpenSuSE

Holy mother of god, what an awful time this was. I spent eight hours on upgrades for just nine desktop machines. Sadly, most of it was my fault, or at least bad configuration:

Special machines

These machines run some scientific software: one master, three slaves. When the master starts up at boot time, it tries to SSH to the slaves to copy over the binary. There appears to be no, or poor, rate throttling; if the slaves are not available when the master comes up, you end up with the following symptoms:

The problem is that umpty scp processes on the slave are holding open the binary, and the kernel gets confused trying to run it.

I also ran into problems with a duff cable on the master; confusingly, both the kernel and the switch said it was still up. This took a while to track down.

Virtual Machines

It turned out that a couple of my kvm-based VMs did not have jumbo frames turned on. I had to use virt-manager to shut down the machines, turn on virtio on the drivers, then reboot. However, kudzu on the VMs then saw these as new interfaces and did not configure them correctly. This caused problems because the machines were LDAP clients and hung when the network was unavailable.

Tags: cfengine, jumboframes, mysql, rant, toptip, work.
Cfengine 3: copying config files for services
28th January 2011

At $work I'm migrating slowly to Cfengine 3. One of the attractions is the ability to do what this page shows: loop over lists in a Cf-ish kind of way.

Here's the first bundle. (It's pretty much stolen from that page, but customized for my environment.) It tells you some basic details about the config file, the process name and the restart command for different daemons:

bundle common services {
  vars:
    redhat|centos::
      "cfg_file_prefix" string => "centos/5";

      "cfg_file[ssh]" string => "/etc/ssh/sshd_config";
      "daemon[ssh]"   string => "sshd";
      "start[ssh]"    string => "/sbin/service sshd restart";
      "enable[ssh]"   string => "/sbin/chkconfig sshd on";

      "cfg_file[iptables]" string => "/etc/sysconfig/iptables";
      "start[iptables]"    string => "/sbin/service iptables restart";
      "enable[iptables]"       string => "/sbin/chkconfig iptables on";
}

Here's the bundle that copies config files and restarts the daemon if necessary:

bundle agent fix_service(service) {
  files:
    "$(services.cfg_file[$(service)])"
      copy_from => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
      perms => mog("0600","root","root"),
      classes => if_repaired("$(service)_restart"),
      comment => "Copy a stock configuration file template from repository";

  processes:
    "$(services.daemon[$(service)])"
      comment => "Check that the server process is running, and start if necessary",
      restart_class => canonify("$(service)_restart");

  commands:
    "$(services.start[$(service)])"
      comment => "Method for starting this service",
      ifvarclass => canonify("$(service)_restart");

    "$(services.enable[$(service)])"
      comment => "Method for enabling this service",
      ifvarclass => canonify("$(service)_restart");
}

And here's the loop that puts it all together:

bundle agent redhat {
  vars:
    "service" slist => { "ssh", "iptables" };

methods:
  "any" usebundle => fix_service("$(service)"),
    comment => "Make sure the basic application services are running";

}

I ran into a problem with this, though: it would always, without fail, restart iptables even though no config file had been copied. The problem was with the process check: there's no process to check for with iptables. And from what I can tell, when the processes stanza was asked to check for a non-existent variable, it checked for the literal string $(services.daemon[$(service)]) -- that is, dollar-bracket-s-e-r-v-.... Since there was no such thing, it decided it needed restarting.

The way around this was to add this variable to the services bundle (the one that has all the info about the daemons):

"daemon[iptables]" string => "cf_null";

I also had to modify the processes stanza:

processes:
  $(services.daemon[$(service)])"
  comment => "Check that the server process is running, and start if necessary",
  restart_class => canonify("$(service)_restart"),
  ifvarclass => canonify("$(services.daemon[$(service)])");

That ifvarclass check on the last line says to run iff there is a value for daemon. cf_null is a NULL value special to cfengine. Since the check fails for iptables, the process check isn't run and we only restart if we copy over a new config file.

Tags: cfengine.
Well, which one would YOU pick?
24th August 2011

At work, I'm about to open up the Rocks cluster to production, or at least beta. I'm finally setting up the attached disk array, along with home directories and quotas, and I've just bumped into an unsettled question:

How the hell do I manage this machine?

On our other servers, I use Cfengine. It's a mix of version 2 and 3, but I'm migrating to 3. I've used Cf3 on the front end of the cluster semi-regularly, and by hand, to set things like LDAP membership, automount, and so on -- basically, to install or modify files and make sure I've got the packages I want. Unlike the other machines, I'm not using cfexecd to run Cf3 continuously.

The assumption behind Cf3 and other configuration management tools -- at least in my mind -- is that if you're doing it once, you'll want to do it again. (Of course, there's also stuff like convergence, distributed management and resisting change, but leave that for now.) This has been a big help, because the changes I needed to apply to the Rocks FE were mostly duplicates of my usual setup.

If/when I change jobs/get hit by a bus, I've made it abundantly clear in my documentation that Cfengine is The Way I Do Things. For a variety of reasons, I think I'm fairly safe in the assumption that Cf3 will not be too hard for a successor to pick up. If someone wants to change it afterward, fine, but at least they know where to start.

OTOH, Rocks has the idea of a "Restore Roll" -- essentially a package you install on a new frontend (after the old one has burned down, say) to reinstall all the files you've customized. You can edit a particular file that creates this roll, and ask it to include more files. Edited /etc/bashrc? Add it to the list.

I think the assumption behind the Restore Roll is that, really, you set up a new FE once every N years -- that a working FE is the result of rare and precious work. The resulting configuration, like the hardware it rests on, is a unique gem. Replacing it is going to be a pain, no matter what you do. There aren't that many Rocks developers, and making it Really, Really Frickin' Nice is probably a waste of their time.

(I also think it fits in with the rest of Rocks, which seems like some really nice bits surrounded by furiously undocumented hacks and workarounds. But I'm probably just annoyed at YET ANOTHER UNDOCUMENTED SET OF HACKS AND WORKAROUNDS.)

And so you have both a number of places where you can list files to be restored, and an amusing uncertainty about whether the whole mechanism works:

I found that after a re-install of Rocks 5.0.3, not all the files I asked for were restored! I suspect it has to do with the order things get installed.

So now I'm torn.

Do I stick with Cf3? I haven't mentioned my unhappiness with its obtuseness and some poor choices in the language (nine positional arguments for a function? WTF?). I'm familiar with it because I've really dived into it and taken a course at LISA from Mark Burgess his own bad self, but it's taken a while to get here. But it is the way I do just about everything else.

Or do I use the Rocks Restore Roll mechanism? Considered on its own, it's the least surprising option for a successor or fill-in. I just wish I could be sure it would work, and I'm annoyed that I'd have to duplicate much of the effort I've put into Cf3.

Gah. What a mess.

Tags: cfengine, rant, rocks.
New workstation
12th January 2012

I've got a new workstation at $WORK. (Well, where else would it be?) It's pretty sweet: i7 quad-core processor, clock speed > 3GHz (honestly, I barely keep track anymore), and 8GB of RAM. 8GB! Insane.

When I arrived in 2008, I used a -- not cast-off, but unused P4 with 4 GB of RAM. I didn't want to make a big fuss about it; I saved the fuss, instead, for a nice business laptop from Dell that worked well with Linux. Since 90% of my work is Firefox + Emacs + XTerms, and my WM of choice at the moment is Awesome, speed was not a problem and the memory was fine.

Lately, though, I've discovered Vagrant. It looks pretty sweet, but my current machine is sloooow when I try to run a couple of VMs. (So's my laptop, despite a better processor; I suspect the 5400RPM drive.) I'm hoping that the new machine will make a big difference.

Just gotta install Ubuntu and move stuff over. Fortunately I've been pretty good about keeping my machine config in Cfengine, so that'll help. And then build some VMs. I'm always surprised at people who feel comfortable downloading random VM images from the Internet. Yeah, it's probably okay...but how do you know?

One thing that Vagrant is missing is integration with Cfengine. Fortunately, the documentation for extending it seems pretty good (plus, I can always kick things off with a shell script). This might be an excuse to learn Ruby.

Tags: cfengine, hardware, virtualization.
Cfengine 3 error: Redefinition of body "control" for "common" is a broken promise, near token '{'
19th January 2012

I tripped across this error today with Cfengine 3:

cf3:./inputs/promises.cf:1,22: Redefinition of body "control" for "common" is a broken promise, near token '{'

The weird thing was this was a stripped down promises.cf, and I could not figure out why it was complaining about redefinitions. I finally found the error:

body common control {
bundlesequence => { "test" };
inputs => { "promises.cf", "cfengine_stdlib.cf" };
}

Yep, including the promises.cf file itself in the inputs section borked everything; removing it fixed things right away.

Tags: cfengine.
Cfengine 3 and SELinux
19th January 2012
Tags: cfengine, selinux.
PPD changes in Oneiric
20th January 2012

In Cfengine3, I had been setting up printers for people using lpadmin commands. Among other things, it used a particular PPD file for the local HP printer. It turns out that in Oneiric, those files are no longer present, or even available; judging by what I found on my laptop, the PPD file is (I think) generated automagically by /usr/share/cups/ppd-updaters/hplip-cups.

It's possible that I could figure this out for my new workstation. But right now, I don't think I can be bothered. I'm going to just set this up by hand, and hope that either I'll get a print server or I'll figure it out.

Tags: cfengine, ubuntu.
Cfengine 3 syntax
23rd January 2012

Cfengine 3 has a lot of things going for it. But its syntax is not one of them.

Consider this situation: you have CentOS machines, SuSE machines and Solaris machines. All of them should run, say, SSH, NTP and Apache why not? The files are slightly different between them, and so is the method of starting/stopping/enabling services, but mostly we're doing the same thing.

I've got a bundle in Cfengine that looks like this:

bundle common services {
  vars:
    redhat|centos::
      "cfg_file_prefix"     string => "centos/5";

      "cfg_file[httpd]"     string => "/etc/httpd/conf/httpd.conf";
      "daemon[httpd]"       string => "httpd";
      "start[httpd]"        string => "/sbin/service httpd start";
      "enable[httpd]"       string => "/sbin/chkconfig httpd on";

      "cfg_file[ssh]"       string => "/etc/ssh/sshd_config";
      "daemon[ssh]"         string => "sshd";
      "start[ssh]"          string => "/sbin/service sshd restart";
      "enable[ssh]"         string => "/sbin/chkconfig sshd on";

...and so on. We're basically setting up four hashes -- daemon, start, enable and cfg -- and populating them with the appropriate entries for Red Hat/Centos ssh and Apache configs; you can imagine slightly different entries for Solaris and SuSE. The cfg_file_prefix allows me to put CentOS' config files in a separate directory from other OS.

Then there's this bundle:

bundle agent fix_service(service) {
  files:
    "$(services.cfg_file[$(service)])"
      copy_from     => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
      classes       => if_repaired("$(service)_restart"),
      comment       => "Copy a stock configuration file template from repository";

  processes:
    "$(services.daemon[$(service)])"
      comment       => "Check that the server process is running, and start if necessary",
      restart_class => canonify("$(service)_restart"),
      ifvarclass    => canonify("$(services.daemon[$(service)])");

  commands:
    "$(services.start[$(service)])"
      comment       => "Method for starting this service",
      ifvarclass    => canonify("$(service)_restart");

    "$(services.enable[$(service)])"
      comment       => "Method for enabling this service",
      ifvarclass    => canonify("$(service)_restart");
}

This bundle takes a service name as an argument, and assigns it to the local variable "service". It copies the OS-and-service-appropriate config file into place if it needs to, and enables/starts the service if it needs to. How does it know if it needs to? By setting the class "$(service)_restart" if the service isn't running, or if the config file had to be copied.

So far, so good. Well, except for the mess of brackets. All those hashes are in the services bundle, so you need to be explicit about the scope. (There are provisions for global variables, but I've kept my use of 'em to a minimum.) And so what in Perl would be, say:

$services->start{$service}

becomes

"$(services.start[$(service)])"

Square brackets for the hash, round brackets for the string (and to indicate that you're using a variable -- IOW, it's "$(variable)", not "$variable" like you're used to), and dots to indicate scope ("services.start" == the start variable in the services bundle).

It's...well, it's an ugly mess o' brackets. But I can deal with that. And this arrangement/pattern, which came from the Cfengine documentation itself, has been pretty helpful to me for dealing with single config file services.

But what about the case where a service has more than one config file? Like autofs: you gotta copy around a map file but in SuSE you also need /etc/sysconfig/autofs to set the LDAP variables.

Again, in Perl this would be an anonymous array on top of a hash -- something like:

$services->cfg_file{"autofs"}[0] = "/etc/auto.master
$services->cfg_file{"autofs"}[1] = "/etc/sysconfig/aufofs"

and you'd walk it like so:

foreach my $i in ($services->cfg_file{"autofs"}) { # something with $i }

or even:

while ($services->cfg_file{"autofs"}) { # something with $_ }

(I think...I'm embarrassed sometimes at how rusty my Perl is.)

In Cfengine, you pile an anonymous array on top of a has like so:

  "cfg_file[autofs]" slist => { "/etc/auto.master", "/etc/sysconfig/autofs" };

An slist is a list of strings. All right, fine; different layout, same idea, stick it in the services bundle and away we go. But: remote scalars can be referenced; remote lists cannot without gymnastics. From the docs:

During list expansion, only local lists can be expanded, thus global list references have to be mapped into a local context if you want to use them for iteration. Instead of doing this in some arbitrary way, with possibility of name collisions, cfengine asks you to make this explicit. There are two possible approaches.

The first of those two approaches is, I think, passing the list as a parameter, whereupon it just works? maybe? (It's a not-so-minor nitpick that there are lots of examples in the Cf3 handbook that are not explained and don't make much sense. They apparently work, but how is not at all clear, or discernible.) I think it's meant to be like Perl's let's-flatten-everything-into-a-list approach to passing variables.

The second is to just go ahead and redeclare the remote slist (array) as a local one that's set to the remote value. Again, from the docs:

bundle common va {
  vars:
   "tmpdirs"  slist => { "/tmp", "/var/tmp", "/usr/tmp"  };
}

bundle agent hardening {
  classes:
    "ok" expression => "any";

  vars:
   "other"    slist => { "/tmp", "/var/tmp" };
   "x"        slist => { @(va.tmpdirs) };

  reports:
    ok::
      "Do $(x)";
      "Other: $(other)";
}

which makes this prelude to all of that handwaving even more irritating:

Instead of doing this in some arbitrary way, with possibility of name collisions...

...

...I mean...

...I mean, what is the point of requiring explicit paths to variables in other scopes if you're just going to insert random speedbumps to assauge needless worries about name collisions? What the hell is with this let's-redeclare-it-AGAIN approach?

The rage, it fills me.

Did you just tell me to go fuck myself?

Tags: cfengine, didyoujusttellmetogofuckmyself.
Cfengine 3 Syntax Part II
24th January 2012

Mark Burgess was kind enough to respond to my earlier post about Cfengine syntax:

markburgess_osl: @saintaardvark (soothing) Syntax is definitely an acquired taste (re perl ;)). The list-ref prob can go away soon. Think doc not code 4 cf3

And then, via tweetsification, we were all like:

saintaardvark: .@markburgess_osl Heh, thanks for the reply -- I was going to ask you about this. Fair pt re: syntax being an acquired taste...[1/2]

saintaardvark: .@markburgess_osl ...but any chance the mess of brackets will be reduced? [2/2]

markburgess_osl: @saintaardvark trade one set of () for -> Don't see much point in that. $() has long precedence in sh / make etc. It delimits clearly in txt

saintaardvark: .@markburgess_osl Fair enough, but I'm also thinking of eg "$(services.cfg_file[$(service)])": dollar bracket scopedot square dollar bracket

markburgess_osl: @saintaardvark I agree it's clumsy, but it's also an edge case. You rarely write this if you make good use of patterns. Perl also ugly here.

But this layout came from their own dang documentation! I feel like I'm stuck here:

[old entry recovered from backup!]

That last point: what I mean is that the whole appeal of that layout (pattern/whatever) was that you could just say fix_service('foo'), and The Right Thing(tm) would happen. Now I have to rethink this; it seems to mean either having lots of bundles like "fix_ntp", "fix_autofs", etc -- with lots of sections like:

vars:
  SuSE::
    "files" slist => {"this", "that"};
  Centos::
    "files" string => "just_this";

...or else having separate "fix_service" bundles for each class. (Forgive me, I'm thinking about all this w/o having a Cf3 instance to play with in front of me.)

I'm trying not to sound whiny here; I'm grateful for Cf3, for the documentation (which is pretty extensive), and that Mark took the time to respond. But this is frustrating.

Tags: cfengine.
Cfengine 3 Syntax Part III (or, How to debate smart people in 140 character snippets)
9th February 2012

More conversations with Mark Burgess via Twitter (a continuation from here. I should note that this was all a week or so ago now; I've been meaning to put this up here.

markburgess_osl: @saintaardvark Doc is "what" code is "how". I believe the lasting intention comes before a specific implementation. #devops #sysadmin

saintaardvark: .@markburgess_osl Hm. So let's see if I've got this right: the programmer in me notices lots of overlap in my Cf3 config...

saintaardvark: .@markburgess_osl ...and wants to consolidate. Cf3 syntax makes this a hairy proposition at best. But this is not really a problem...

saintaardvark: .@markburgess_osl ...because I should be thinking about this as documentation (which can be long) of the desired system state...

saintaardvark: .@markburgess_osl ...rather than code (where the drive is for efficiency and lack of duplication). Have I got that right? #sysadmin

markburgess_osl: @saintaardvark Documentation => focus on end state (like GPS), Code => focus on start state + directions. The journey is irrelevant.

markburgess_osl @saintaardvark Docs also improved by seeing themes and patterns. That is still WHAT not HOW. So no contradiction.

So putting this in practical (can't resist the temptation to say "less Yoda-like") terms: what I think he's saying is, don't worry about code duplication or getting clever; you're documenting desired system state, and it's okay to be verbose.

Using the example I started with, it's okay to have NTP settings in multiple places (because SuSE needs two files, Solaris 1, etc). The coder in me wants to clean those up because it's all NTP, but the documentationist ("writers", I think they're called) relaxes and says "Can't have too much documentation." Which is fair.

But then I worry about having Multiple Sources of Truth(tm). The advantage of the first setup is that when I change the NTP server, it's ALL in one place; in the second setup, I have to remember: did I change it for SuSE? Solaris? CentOS? I've learned the hard way to be wary of such setups. I nearly always miss something; that's why I'm aggressive about consolidating.

I'm still mulling all this over.

Tags: cfengine.
No questions, please, we're Cfengine
19th September 2012

Over the last two days, in a frenzy of activity, I got some awesome done at work: using git and Vagrant, I finally got Cfengine to install packages in Ubuntu without asking me any goram questions. There were two bits involved:

Now: Fully automated package installation FTMFW.

And did you know that Emacs can check your laptop battery status? I didn't.

Tags: cfengine, git, packagemanagement, ubuntu, vagrant.
Forensic accounting with Cfengine 3
1st November 2012

Just had a dream where I'd been called into Sun, just before Oracle's takeover, to figure out why they were spending so much money on eyeglasses for employees. "We think it's part of their benefits, but our accounting department doesn't have a separate line item for it," someone explained. My eyebrows lifted in disbelief. "Well, then, it's damned lucky for you I've got Cfengine."

Tags: cfengine, wtf.
Invoking Cfengine from Nagios
21st November 2012

Nagios and Cf3 each have their strengths:

Nagios plugins, frankly, are hard to duplicate in Cfengine. Check out this Cf3 implementation of a web server check:

bundle agent check_tcp_response {
  vars:
    "read_web_srv_response" string  => readtcp("php.net", "80", "GET /manual/en/index.php HTTP/1.1$(const.r)$(const.n)Host: php.net$(const.r)$(const.n)$(const.r)$(const.n)", 60);

  classes:
    "expectedResponse" expression   => regcmp(".*200 OK.*\n.*", "$(read_web_srv_response)");

  reports:
    !expectedResponse::
      "Something is wrong with php.net - see for yourself: $(read_web_srv_response)";

}

That simply does not compare with this Nagios stanza:

define service{
    use                             local-service         ; Name of service template to use
    hostgroup_name                  http-servers
    service_description             HTTP
    check_command                   check_http
}
define command{
    command_name                    check_http
    command_line                    $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}

My idea, which I totally stole from this article, was to invoke Cfengine from Nagios when necessary, and let Cf3 restart the service. Example: I've got this one service that monitors a disk array for faults. It's flaky, and needs to be restarted when it stops responding. I've already got a check for the service in Nagios, so I added an event handler:

define service{
    use                             local-service         ; Name of service template to use
    host_name                       diskarray-mon
    service_description             diskarray-mon website
    check_command                   check_http!-H diskmon.example.com -S -u /login.html
    event_handler                   invoke_cfrunagent
}
define command{
    command_name invoke_cfrunagent
    command_line $USER2/invoke_cfrunagent.sh  -n "$SERVICEDESC" -s $SERVICESTATE$ -t $SERVICESTATETYPE$ -a $HOSTADDRESS$
}

Leaving out some getopt() stuff, invoke_cfrunagent.sh looks like this:

# Convert "diskarray-mon website to disarray-mon_website":
SVC=${SVC/ /_}
STATE="nagios_$STATE"
TYPE="nagios_$TYPE"

# Debugging
echo "About to run sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE" | /usr/bin/logger
# We allow this in sudoers:
sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE

cf-runagent is a request, not an order, to the running cf-server process to fulfill already-configured processes; it's like saying "If you don't mind, could you please run now?"

Finally, this was to be detected in Cf3 like so:

  methods:
    diskarray-mon_website.nagios_CRITICAL.nagios_HARD::
      "Restart the diskarray monitoring service" usebundle => restart_diskarray_monitor();

(This stanza is in a bundle that I know is called on the disk array monitor.)

Here's what works:

What doesn't work:

What might work better is using this Cf3 wrapper for Nagios plugins (which I think is the same approach, or possibly code, discussed in this mailing list post).

Anyhow...This is a sort of half-assed attempt in a morning to get something working. Not there yet.

Tags: cfengine, nagios.
Deploying SELinux modules from Cfengine
23rd November 2012

Back in January, yo, I wrote about trying to figure out how to use Cfengine3 to do SELinux tasks; one of those was pushing out SELinux modules. These are encapsulated bits of policy, usually generated by piping SELinux logs to the audit2allow command. audit2allow usually makes two files: a source file that's human-readable, and a sorta-compiled version that's actually loaded by semodule.

So how do you deploy this sort of thing on multiple machines? One option would be to copy around the compiled module...but while that's technically possible, the SELinux developers don't guarantee it'll work (link lost, sorry). The better way is to copy around the source file, compile it, and then load it.

SANSNOC used this approach in puppet. I contacted them to ask if it was okay for me to copy their approach/translate their code to Cf3, and they said go for it. Here's my implementation:

bundle agent add_selinux_module(module) {
  # This whole approach copied/ported from the SANS Institute's puppet modules:
  # https://github.com/sansnoc/puppet
   files:
     centos::
       "/etc/selinux/local/."
         comment        => "Create local SELinux directory for modules, etc.",
         create         => "true",
         perms          => mog("700", "root", "root");

       "/etc/selinux/local/$(module).te"
         comment        => "Copy over module source.",
         copy_from      => secure_cp("$(g.masterfiles)/centos/5/etc/selinux/local/$(module).te", "$(g.masterserver)"),
         perms          => mog("440", "root", "root"),
         classes        => if_repaired("rebuild_$(module)");

       "/etc/selinux/local/setup.cf3_template"
         comment        => "Copy over module source.",
         copy_from      => secure_cp("$(g.masterfiles)/centos/5/etc/selinux/local/setup.cf3_template", "$(g.masterserver)"),
         perms          => mog("750", "root", "root"),
         classes        => if_repaired("rebuild_$(module)");

       "/etc/selinux/local/$(module)-setup.sh"
         comment        => "Create setup script. FIXME: This was easily done in one step in Puppet, and may be stupid for Cf3.",
         create         => "true",
         edit_line      => expand_template("/etc/selinux/local/setup.cf3_template"),
         perms          => mog("750", "root", "root"),
         edit_defaults  => empty,
         classes        => if_repaired("rebuild_$(module)");


  commands:
    centos::
      "/etc/selinux/local/$(module)-setup.sh"
        comment         => "Actually rebuild module.",
        ifvarclass      => canonify("rebuild_$(module)");
}

Here's how I invoke it as part of setting up a mail server:

bundle agent mail_server {
  vars:
    centos::
      "selinux_mailserver_modules" slist => { "postfixpipe",
                                              "dovecotdeliver" };

  methods:
    centos.selinux_on::
      "Add mail server SELinux modules" usebundle => add_selinux_module("$(selinux_mailserver_modules)");
}

(Yes, that really is all I do as part of setting up a mail server. Why do you ask? :-) )

So in the add_selinux_module bundle, a directory is created for local modules. The module source code, named after the module itself, is copied over, and a setup script created from a Cf3 template. The setup template looks like this:

#!/bin/sh
# This file is configured by cfengine.  Any local changes will be overwritten!
#
# Note that with template files, the variable needs to be referenced
# like so:
#
#   $(bundle_name.variable_name)

# Where to store selinux related files
SOURCE=/etc/selinux/local
BUILD=/etc/selinux/local

/usr/bin/checkmodule -M -m -o ${BUILD}/$(add_selinux_module.module).mod ${SOURCE}/$(add_selinux_module.module).te
/usr/bin/semodule_package -o ${BUILD}/$(add_selinux_module.module).pp -m ${BUILD}/$(add_selinux_module.module).mod
/usr/sbin/semodule -i ${BUILD}/$(add_selinux_module.module).pp

/bin/rm ${BUILD}/$(add_selinux_module.module).mod ${BUILD}/$(add_selinux_module.module).pp

Note the two kinds of disambiguating brackets here: {curly} to indicate shell variables, and (round) to indicate Cf3 variables.

As noted in the bundle comment, the template might be overkill; I think it would be easy enough to have the rebuild script just take the name of the module as an argument. But it was a good excuse to get familiar with Cf3 templates.

I've been using this bundle a lot in the last few days as I prep a new mail server, which will be running under SELinux, and it works well. Actually creating the module source file is something I'll put in another post. Also, at some point I should probably put this up on Github FWIW. (SANS had their stuff in the public domain, so I'll probably do BSD or some such... in the meantime,please use this if it's helpful to you.)

UPDATE: It's available on Github and my own server; released under the MIT license. Share and enjoy!

Tags: cfengine, selinux.
A sub for Cf3
28th November 2012

When sub was released by 37signals, I liked it a lot. Over the last couple of months I've been putting together a sub for Cfengine. Now it's up on Github, and of course my own repo. It's not pretty, but there are some pretty handy things in there. Enjoy!

Tags: cfengine.
Standalone bundles in Cf3
3rd December 2012

I always seem to forget how to do this, but it's actually pretty simple. Assume you want to test a new bundle called "test", and it's in a file called "test.cf". First, make sure your file has a control stanza like this:

body common control {
  inputs => { "/var/cfengine/inputs/cfengine_stdlib.cf" } ;
  bundlesequence => { "test" } ;
}

Note:

Second, invoke it like so:

sudo /var/cfeing/bin/cf-agent -KI -f /path/to/test.cf

Note:

Tags: cfengine.
Trying to make things easier
3rd January 2013

First day back at $WORK after the winter break yesterday, and some...interesting...things. Like finding out about the service that didn't come back after a power outage three weeks ago. Fuck. Add the check to Nagios, bring it up; when the light turns green, the trap is clean.

Or when I got a page about a service that I recognized as having, somehow, to do with a webapp we monitor, but no real recollection of what it does or why it's important. Go talk to my boss, find out he's restarted it and it'll be up in a minute, get the 25-word version of what it does, add him to the contact list for that service and add the info to documentation.

I start to think about how to include a link to documentation in Nagios alerts, and a quick search turns up "Default monitoring alerts are awful" , a blog post by Jeff Goldschrafe about just this. His approach looks damned cool, and I'm hoping he'll share how he does this. Inna meantime, there's the Nagios config options "notes", "notes_url" and "action_url", which I didn't know about. I'll start adding stuff to the Nagios config. (Which really makes me wish I had a way of generating Nagios config...sigh. Maybe NConf?)

But also on Jeff's blog I found a post about Kaboli, which lets you interact with Nagios/Icinga through email. That's cool. Repo here.

Planning. I want to do something better with planning. I've got RT to catch problems as they emerge, and track them to completion. Combined with orgmode, it's pretty good at giving me a handy reference for what I'm working on (RT #666) and having the whole history available. What it's not good at is big-picture planning...everything is just a big list of stuff to do, not sorted by priority or labelled by project, and it's a big intimidating mess. I heard about Kanban when I was at LISA this year, and I want to give it a try...not suure if it's exactly right, but it seems close.

And then I came across Behaviour-driven infrastructure through Cucumber, a blog post from Lindsay Holmwood. Which is damn cool, and about which I'll write more another time. Which led to the Github repo for a cucumber/nagios plugin, and reading more about Cucumber, and behaviour-driven development versus test-driven development (hint: they're almost exactly the same thing).

My god, it's full of stars.

Tags: cfengine, documentation, lisa, nagios, programming, sysadmin, testing.

RSS Feed