The Life of a Sysadmin

Carousel is a lie!

Entries tagged "cfengine".

1 != 2
2005-08-26 12:00:36

I love cfengine. If you haven't checked it out yet, do so. You can do really neat stuff like this:

editfiles::
        { /etc/Xprint/C/print/attributes/document
                BeginGroupIfNoLineMatching "^\*default-printer-resolution: 300"
                        CommentLinesMatching "^\*default-printer-resolution: 600"
                        LocateLineMatching "^# \*default-printer-resolution: 600"
                        InsertLine "*default-printer-resolution: 300"
                        DefineInGroup restart_xprint
                EndGroup
        }

shell::
        debian.restart_xprint::
                "/etc/init.d/xprint restart"

(Which, by the way, totally fixes the problem of Debian printing 'way huge stuff. Bug number 262958. You should totally look it up.) Look at that. It's lovely. It's obvious what it's looking for, what it'll do if it can't find it, and what'll happen after that. And it does it automagically. At night. From cron. The way God intended all system administration to be done. However -- and I cannot emphasize how important it is to keep this in mind -- it is absolutely NFG reading the documentation for an hour trying to figure out why the DefineInGroup statement just does not work if:

  1. you're reading the docs for cfengine v2, and
  2. you're working with cfengine v1.

It's my own fault for printing out v2 docs and not thinking much about it. However, in my own defense it would be nice if cfengine would complain about something it appears not to recognize. Not even with -d2 (which produces output along the lines of CheckingDateForSolarEclipseToday [no]) did it whisper a word about this.

Tags: cfengine.
cfengine classes and shellcommands
2005-10-06 18:16:56

cfengine is great, it really is. But there are some things that tripped me up. Often you want to set up a daemon to run The Right Way, which involves changing its config file. After that, of course, you want to restart it. What to do? The naive way (ie, the first way I tried) of doing things is:

control::
        sequence ( editfiles shellcommands )

editfiles::
        debian:
                { /etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                Define restart_foo
                        EndGroup
                }

        freebsd:

                { /usr/local/etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                Define restart_foo
                        EndGroup
                }

shellcommands::
        debian.restart_foo:
                "/etc/init.d/foo restart"

        freebsd.restart_foo:
                "/usr/local/etc/rc.d/foo restart"

However, the correct way of doing this is:

control::
        sequence = ( editfiles shellcommands )
        AddInstallable = ( restart_foo )

editfiles::
        debian:
                { /etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                DefineInGroup "restart_foo"
                        EndGroup
                }

        freebsd:
                { /usr/local/etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                DefineInGroup "restart_foo"
                        EndGroup
                }

shellcommands::
        debian.restart_foo:
                "/etc/init.d/foo restart"

        freebsd.restart_foo:
                "/usr/local/etc/rc.d/foo restart"

Without both the enumeration of all your made-up classes in AddInstallable and the enclosing of that class in quotes, cfengine will fail to do what you want -- and will do so quietly and with no clue about why. God, that took me a long time to find.

Tags: cfengine.
We're An American Band
2006-03-13 20:27:33

More fallout today from Saturday's power outage: two workstations that failed to boot up (BIOS checksum error for one of 'em, which is a new one for me), some NIS-related services that didn't get started properly (not sure what's going on there), and so on. Plus the return of the where-are-those-seven-machines? that didn't get done on Friday because of all of this.

But I did learn some stuff about Cfengine. For example, if you have something like:

my_url = ( http://www.example.com/foo/bar )

then you'd better precede it with:

split = ( "+" )

or some other character that isn't used. The colon is treated as a list separator by default, which means that later on, when you try and do something like:

shell::
    linux.need_some_file:
        "/bin/wget $(my_url)/baz"

what it'll actually do is this:

/bin/wget http/baz
/bin/wget //www.example.com/foo/bar/baz

'cos it's iterating over the two lists, see?

And SuSE's dhcp client, by default (I think), will change /etc/yp.conf without telling you, and then on exit put back the old version (saved conveniently at /etc/yp.conf.sv. It took me a long time to figure out that this was happening, and it pissed me off mightily. /etc/resolv.conf is filled with comments when the dhcp client modifies it -- hell, they even throw in the PID. So why not do that with yp.conf? At least you can turn it off by changing DHCLIENT_MODIFY_NIS_CONF in /etc/sysconfig/networking/dhcp.

Tags: cfengine.
Little Green Bag
2006-06-01 20:01:15

Some days are fun days. I got this error on a Debian workstation when starting X:

Xlib: Connection to ":0.0" refused by server Xblib: Protocol not
supported by server.  Xrdb: Can't open display ':0'

Turns out that an .xsession file, with one commented-out line, caused that. Remove the line (so now it's empty) and everything works.

Next we got the same user, who's had his home directory moved around on the machine. Machines mounting his home dir via amd (FreeBSD, Debian) work fine, but the SuSE machines running autofs fail miserably with "permission denied" and the ever-popular:

$ cd
-bash: cd: /home/foo: Unknown error 521

Which, if you look up /usr/include/linux/errno.h -- which, you know, is the logical thing to do -- you see this:

/* Defined for the NFSv3 protocol */
#define EBADHANDLE      521     /* Illegal NFS file handle */

Another weird thing with AutoFS: I was running cfengine on a machine, and it hung when querying which RPMs were installed. strace on the rpm command shows its trying to lock a file and failing; looking at /proc/number/fd shows that, yep, it's trying and failing to lock /var/lib/rpm/Packages, the Berkeley DB file that knows all and sees all. So lsof to see who's holding it open, and that hangs; strace shows it's hanging trying to access the home directory of a user whose machine is down right now for reinstall. Try to unmount that directory and it fails. So I bring up the machine with the user's home directory, which allows me to unmount his home directory on the SuSE machine, which allows cfengine to run rpm, which succeeds in locking the Berkely DB file. Strange; possibly similar to this problem.

On top of everything else, someone asked me if I could be a "network prime". I think they mean "person we can talk to with authority to make network changes", or possibly "network contact". Not entirely sure.

But on the other hand: figured out how to run wpkg, package manager for Windows of the elder gods, as a service using Cygwin's cygrunsrv. The instructions are on the wiki for your viewing enjoyment.

Tags: amd, cfengine, windows.
Choose It
2006-12-04 17:39:03
Two sips from the cup of human kindness, and I'm shit-faced
Just laid to waste
If there's a choice between chance and flight, Choose it tonight.
"Choose It", The New Pornographers

Just got back from a whirlwind walk from the Lincoln Memorial to the Washington Monument to the White House. Beautiful, all of it...though a) the White House is small and b) there was something being filmed/videotaped in the courtyard, which made me think of Vancouver.

Training again. AFrisch was good, convering Cfengine quite well; would've liked to see more info about expect. (Apparently there are Perl/Python bindings...I had no idea.) Afternoon course was "Interviewing For System Administrators" by Adam Moskowitz and that was great -- lots of things I didn't know, lots of tips on doing it better next time.

Saw Tom Limoncelli in the hall during a break. Managed to restrain myself. I have the reputation for quiet restraint of a nation to uphold.

Very tired now. Time to go get beer.

Tags: cfengine, lisa.
Electric Version
2006-12-07 17:28:24
Sound of tires, sound of God...
"Electric Version", The New Pornographers.

Thursday morning came far too early. My roommate offered some of his 800mg Ibuprofins, and I accepted. First thing I attended was the presentation "Drowning in the Data Tsunami" by Lee Damon and Evan Marcus. It was interesting, but seemed to be mostly about US data regulations (HIPPA/SOX et al.) and wasn't really relevant to me. I had been expecting more of an outline of, say, how in God's name we're going to preserve information for, say, a hundred years (heroic efforts of the Internet Archive notwithstanding). There was mention of an interesting approach to simply not accumulating cruft as you upgrade storage (because it's easier than sorting through to see what can be discarded; "Why bother weeding out 200MB when the new disk is 800GB?"): a paper by Rhadia Perlman (sp?) (she of OSPF fame) that proposes an encrypted data storage system combined with key escrow that, to expire data, simply deletes the key when the time is up. Still, I moved on before too long.

...Which was good, because I sat in on Alva Couch's presentation on his and Mark Burgess' paper, "Modelling Next-Generation Configuration Management Tools". Some very, very confusing stuff about aspects, promises and closures -- confusing because the bastard didn't preface his talk with "This is what Hugh from Vancouver will need to know to understand this." (May be in the published paper; will check later.) Here's what I could gather:

I will do the right thing and read his paper, and I may update this later; these are just my notes and impressions, and aren't gospel. Couch is an incredibly enthusiastic speaker, and even though I didn't understand a lot of it I ended up excited anyway. :-) He gave another talk later in the week that Ricky went to, about how system administration will have to become more automatic; as a result, we'd all better learn how to think high-level and to be better communicators, because more and more of our stuff will be management -- and not just in the sense of managing computers. I'm going to seek out more of his stuff and see if it'll fit in my head.

After the break was a talk on "QA and the System Administrator", presented by a Google sysadmin. I went because it was Google, and frankly it wasn't that interesting. One thing that did jump out at me was when he described a Windows tool called Eggplant, a QA/validation tool. It has OCR built-in to recognize a menu, no matter where it is on the screen. This astounded me; when you start needing OCR to script things, that's broken. I don't doubt that it's a good tool, and I can think of lots of ways that would come in handy. But come on. I mean, a system that requires that is just so ugly.

I went out to lunch with Jay, a sysadmin from a shop that's just got permission from the boss to BSD a unit-testing program they've come up with for OpenBSD firewalls: it uses QEMU instances to fully test a firewall with production IP addresses, making sure that you're blocking and allowing everything you want. It sounds incredibly cool, and he's promised to send me a copy when he gets back. I can't wait to have a look at it.

After that was the meet-the-author session. I got to thank Tom Limoncelli for "Time Management for System Administrators", and got an autograph sticker from him and Strata Rose Chalup, his co-author for Ed 2. Sadly, I didn't get a chance to thank Tobias Oetiker (who I nearly ran into at lunch the day before).

Next up was the talk from Tom Limoncelli and Adam Moskovitz (Adam's looking for a job! Somebody hire him!) about how to get your paper accepted at LISA. Probably basic stuff if you've written a paper before, but I haven't so it was good to know. Thing like how to write a good abstract, what kind of paper is good for LISA, and how you shouldn't say things like "...and if our paper is accepted, we'll start work right away on the solution." Jay asked whether a paper on the pf testing tool would be good, and they both nodded enthusiastically.

Must Google:

Quotes from the talk:

At this point I started getting fairly depressed. Part of it was just being tired, but I kept thinking that not only could I not think of something to write a paper about, I could not think of how I'd get to find something to write about. I wandered over to the next talk feeling rather sad and lost.

The next talk was from Andy Seely on being a sysadmin in US Armed Forces Command and Control. Jessica was there, and we chatted a bit about how this talk conflicted with Tom Limoncelli's Time Management Guru session, and maybe ducking over to see that. Then Andy came over and asked Jessica to snap some picture, so she ended up staying. I was prepared to give it five minutes before deciding whether or not to leave.

Well, brother, let me tell you: Andy Seely is one of the best goddamned speakers on the planet. He was funny, engaging, and I could no more leave the room than I could get my jaw to undrop. Not only that, his talk was fascinating, and not just because he's a sysadmin for the US Armed Forces while simultaneously having a ponytail, earrings and tattoos. You can read the article in ;login: (FIXME: Add link) that it was based on, but he expanded on it considerably. Let me see what I can recall:

Longer story: Because of the nature of his work, he's got boxes that he has to keep working when he knows next to nothing about what they're meant to do. Case in point: a new Sun box arrives ("and it's literally painted black!"), but the person responsible for it wants to send it back because it doesn't work -- which means that when they click the icon to start the app it's meant to run, it doesn't launch and there's no visible sign that it's running. There's no documentation. And yet he's obligated to support this application. What do you do?

Even tracking down the path to the program launched by the icon is a challenge, but he does, tracks down the nested shell scripts and finally finds the jar that is the app ("Aha! It is Java!"). He finds log files which are verbose but useless. He contacts the company that wrote it, and is told he needs a support contract...which the government, when putting together the contract for the thing, did not think to include. So he calls back an hour later, talks to the help desk and tells them he's lost the number -- "Can you help a brother out?" They do, but they're stumped as well, and say they've never seen anything like this.

Time to pull out truss, which produces a huge amount of output. Somewhere in the middle of all that he notices a failing hard read of a file in /bin: it was trying to read 6 bytes and failing. Turns out the damned thing was trying to keep state in /bin, and failing because the file was zero bytes long. He removed the file, and suddenly the app works.

Andy also talked about trying to get a multiple GB dump file from Florida to Qatar. Physical transport was not an option, because arranging it would take too long. So he tries FTPing the file -- which works until he goes home for the day, at which point the network connection goes down and he loses a day. So he writes a Perl script that divides the file into 300MB chunks, then sends those one at a time. It works!

At this point, someone yells out "What about split?" Andy says, "What?" He hadn't known about it. There was a lot of good-natured laughter. He asked, "Is there an unsplit?" "Cat!" came the response from all over the room. He smacked his forehead and laughed. "This is why I come to LISA," he said. "At my job, I've been there 10 years. People come to me 'cos I'm the smart one. Here, I'm the dumb one. I love that."

There are two things I would like to say at this point.

First off, Andy is at least the tenth coolest person on the entire Eastern seaboard. No, he didn't know about cat -- but not only did he reimplement it in Perl rather than give up, he didn't even flinch when being told about it in the middle of giving a talk at LISA. I would probably have self-combusted from embarassment ("foomp!"), and I would have felt awful. Andy's attitude? "I learned something." That's incredibly strong. (Although he told a story later about being in the elevator with some Google people. They recognized him and said, "Hey, it's the 'man cat' guy!")

Second, when he said, "Here, I'm the dumb one. I love that" I sat up straight and thought, "Holy shit, he's right." Here I am at LISA for the first time ever. I've met people who can help me, and people I can help. I've made a crapload of new friends and have learned more in one week than I would've thought possible. And I'm worried 'cos it might be a few years before I can think about presenting a paper? That's messed up. I tend to set unreasonably high goals for myself and then get depressed when I can't reach them. Andy's statement made me feel a whole lot better.

During Q & A I asked what he did for peer support, since his ability to (say) post to a mailing list asking for help must be pretty restricted. He said that he's started a wiki for internal use and it's getting used...but both the culture and the job function mean that it's slow going. He's also started a conference for fellow sysadmins: 100 or so this year, and he's hoping for more next year.

In conclusion: if you ever get the chance to go see him, do so. And then buy him a beer.

Tags: cfengine, lisa.
Presentation(s), conference, nagios exchange, Project U-13, Project U-14
Sat Sep 29 19:19:39 PDT 2007

I've had a bunch of ideas lately. I'm inflicting them on you.

The presentation went well...I didn't get too nervous, or run too long, or start screaming at people (damn Induced Tourette's Syndrome) or anything. There were maybe 30 or so people there, and a bunch of them had questions at the end too. Nice! I was embiggened enough by the whole experience that, when the local LUG announced that they were having a newbie's night and asked for presenters to explain stuff, I volunteered. It's coming up in a few weeks; we'll see what happens.

And then I thought some more. A few days before I'd been listening to the almost-latest episode of LugRadio (nice new design!), where they were talking about GUADEC and PyCon UK. PyCon was especially interesting to hear about; the organizers had thought "Wouldn't it be cool to have a Python conference here in the UK?", so they made one.

So I thought, "It's a shame I'm not going to be able to go to LISA this year. Why don't we have our own conference here in Vancouver?" The more I thought about it, the better the idea seemed. We could have it at UBC in the summer, where I'm pretty sure there are cheap venues to be had. Start out modest — say, a day long the first time around. We could have, say, a training track and a papers track. I'm going to talk about this to some folks and see what they think.

Memo to myself: still on my list of stuff to do is to join pool.ntp.org. Do it, monkey boy!

Another idea I had: a while back I exchanged secondary DNS service, c/o ns2exchange.com. It's working pretty well so far, but I'm not monitoring it so it's hard for me to be sure that I can get rid of the other DNS servers I've got. (Everydns.net is fine, but they don't do TXT or IPv6 records.) I'm in the process of setting up Nagios to watch my own server, but of course that doesn't tell me what things look like from the outside.

So it hit me: what about Nagios exchange? I'll watch your services if you watch mine. You wouldn't want your business depending on me, of course, but this'd be fine for the slightly anal sysadmin looking to monitor his home machines. :-) The comment link's at the end of the article; let me know if you're interested, or if you think it's a good/bad/weird idea.

The presentation also made me think about how this job has been, in many ways, a lot like the last job: implementing a lot of Things That Really Should Be Done (I hate to say "Best Practices) in a small shop. Time is tight and there's a lot to do, so I've been slowly making my way through the list:

Some of these things have been held up by my trying to remember what I did the last time. And then there's just getting up to speed on bootstrapping a Cfengine installation (say).

So what if all these things were available in one easy package? Not an appliance, since we're sysadmins — but integrated nicely into one machine, easily broken up if needed, and ready to go? Furthermore, what if that tool was a Linux distro, with all its attendant tools and security? What if that tool was easily regenerated, and itself served as a nicely annotated set of files to get the newbie up and running?

Between FAI (because if it's not Debian, you're working too hard) and cfengine, it should be easy to make a machine look like this. Have it work on a live ISO, with installation afterward with saved customizations from when you were playing around with it.

Have it be a godsend for the newbie, a timesaver for the experienced, and a lifeline for those struggling in rapidly expanding shops. Make this the distro I'd want to take to the next job like this.

I'm tentatively calling this Project U-13. We'll see how it goes.

Oh, and over here we've got Project U-14. So, you know, I've got lots of spare time.

Tags: cfengine, conferenceorganization, dns, geekdad, monitoring, ntp, projectu13.
Stay on target...
Fri Dec 21 15:23:09 PST 2007

Holy crap, it's been a while since I last wrote here. Mainly that's because I've been working on web stuff at work and have felt very little like a sysadmin of late. Thankfully we've got a webmaster hired, and to some extent the work'll be shifted to him in the new year. Of course, that still leaves the redesign of the website and its back end…that's not done 'til it's done.

This week, though, has been slow, and I've been catching up a little on sysadmin work. Part of it was setting up a devel server for the webmaster, and detailing what I was doing in Cfengine as I went along. It was gratifying to get LDAP working (I haven't done that on a Linux machine before; shame on me), and irritating when I realized that I couldn't mount the home directories from the server because I hadn't restarted nscd on the server.

The last two days were spent trying to get encrypted Bacula working between here and $other_university. This was an enormous pain in the ass for two reasons:

  1. The Right Way (tm) of doing it is by using TLS, which is what the kids are calling SSL these days, and I have never fully grokked SSL, or the openssl command. I know that there's encryption going on; I know that there are certificates signed by CAs; I know that there's a lot of negotiating of different options. But start throwing in x509 versus PEM, Diffie-Helman parameters and the single most cryptic set of error messages I've ever come across, and I just feel thick. I was reduced to looking at tcpdump output of the negotiation to figure out what was going on, and I couldn't; the Bacula FD client complained that the Bacula Director wasn't producing a certificate, and that was all I knew. The otherwise incredibly excellent docs from Bacula were a trifle thin on all of this, and I couldn't find out much about my situation (going the self-CA route).

  2. So okay, fuckit, right? That's why God invented OpenSSH. So whee, start tunnelling port 9102 over SSH so the Director can contact the FD at $other_university, and 9103 back so the FD can contact the Storage Daemon. Only it turns out (my bad for not knowing this before) that not only does the client want to contact the SD, so does the director. Thus, my plan to tunnel to the firewall at the other end and tell the client that it could find the Storage Daemon there didn't work, 'cos the director wanted to contact it there too. (I did briefly try allowing the director to contact the tunnel at the other end: so even though the Storage was working on the same machine as the director, for that one job the Director's connection to it was going to the remote end and getting tunnelled back over SSH. But:

    1. that's horrible, and
    2. I was afraid that when it came time to restore, the Director would figure that it had to contact the Storage Daemon remotely again, complicating an already complicated setup.)

And why was I trying to connect to the remote firewall via SSH, rather than the client I'm trying to back up itself? Because that client is a Solaris machine authenticating against LDAP, and that turns out to bork key-based logins over SSH. What a crock.

Oh well. I did add three other machines here to Bacula this week, so that's good.

Project U-13 is coming along. I'm pretty close to a 0.0.2 release (woot), which should have the following working:

And by "working" I mean "installed". But I've got a decent setup on my laptop for building and testing it, which means I get up to a couple hours a day to work on it (New Westminster -> UBC == long). Thanks to Andy, he of the amazing speaking skills, for kicking my ass into action.

I'm learning a bit more about Mercurial in the process. After coming from CVS and Subversion, it seems really weird to me that the usual way of branching is "Go ahead, clone another repo! We're Mercurial! We don't care! Repos for everyone!" But if you figure on distributed development — something Linux-y than a controlled work environment — then it makes sense. Not that I think I'll have lots of people working on this thing, but it makes sense that if someone were to take this for their own ends, they wouldn't want to bother copying all the branches…just the one(s) they're interested in.

Last word to my son:

Q: What does a Camel say, Arlo? A: Purhl!

2 comments. Tags: cfengine, projectu13.
Coming up
Fri Jan 18 06:07:07 PST 2008

My laptop hard drive started giving scary errors a couple days ago on the way to work (I've got a 90-minute commute by public transit [uck] so I fill the time by reading, listening to podcasts, or working on Project U-13). Fortunately, working at a university means that there are two computer stores on campus. I ran out at lunch, picked up a 100GB drive, and had things back to normal by the next morning.

Well, normal modulo one false start with Debian; I decided to try encrypted filesystems just for fun. But then I suspended, came back with a newere kernel, and it could not read the encrypted LVM group anymore. Whoops.

Still lots of free space on this thing, and I'm thinking of installing Ubuntu, FreeBSD and maybe NetBSD just for fun. Of course, I've got to do it all via PXE since this thing doesn't have any CDROM drive, but that just adds to the geek points.

Project U-13 is coming up on 0.0.3, btw; Andy suggested adding Rackmonkey, which looks quite cool. There's no package for it, so I'm having to do some rather ugly scripted installation…but I can stand it for now. And I've got the barest skeleton of a cfengine file in there too. Watch the skies!

Tags: bsd, cfengine, hardware, projectu13.
Project U-13, 0.0.3
Wed Jan 23 05:58:00 PST 2008

Version 0.0.3 of Project U-13, a distro for sysadmins, has been released!

The main change is the addition of RackMonkey, which its website describes as "a web-based tool for managing racks of equipment such as web servers, video encoders, routers and storage devices", at the suggestion of Andy Seely. Also, Lynx has been installed, and there's also the skeletal beginnings of a Cfengine config file.

The ISO has been signed with my public key. Share and enjoy, and comments on a postcard, please.

Tags: cfengine, projectu13.
cfengine: Received signal 2 (SIGKILL) while doing pre-lock-state
Wed Jul 30 11:29:58 PDT 2008

Ran into a problem today when adding this stanza to cfengine on a Debian Etch machine:

editfiles:
        { /etc/aliases
                AppendIfNoSuchLine "root: sysadmin@pims.math.ca"
                DefineClasses "rebuild_aliases:restart_postfix"
        }

The cfengine reference file I've got, which sez it's for version 2.2.1, says you can define multiple classes in DefineClasses (or DefineInGroup), as long as they're separated by commas, spaces or dots. (The version in Etch is 2.2.20.)

However, when I ran cfagent, it just hung immediately after performing the edit, and gave this error when I ctrl-c'd it:

cfengine: Received signal 2 (SIGKILL) while doing [pre-lock-state]

Running cfengine with -d2 showed endless repetitions of AddClassToHeap() at this point, so either there's something wrong with my syntax or there's a bug in cfengine. (I'm guessing the former.) Searching for pre-lock-state and cfengine only turned up cases where the clients were syncing with the master; thus this note.

The fix was to just make it one class:

                DefineClasses "rebuild_aliases"

Asking to restart Postfix was probably a bit of overkill anyhow...

Tags: cfengine.
This is The Working Hour; we are paid by those who learn by our mistakes
Tue Nov 18 20:21:26 PST 2008

I'm in the process of setting up a bunch of new servers for $job_2. All but one are CentOS 5.2, kickstart installed and managed with cfengine. This is the third time I've goen thorugh a cfengine setup, and it always feels like starting from scratch each time. It seems — and I'm not at all sure this is fair or accurate — that each time I set up one of these systems, there's a lot that I've lost from the last time and have to relearn. I'm fortunate this time that I can refer to $job_1's setup to see how I did things last time, but if I didn't have that I'd be significantly further behind than I am.

I'm not sure what the solution is. Part of me thinks I should just be more aggressive about taking notes, or committing stuff to a private repository, or writing it down here more; part of me thinks that this might be a clue that cfengine is too low-level for my head. It feels like when I was trying to learn C, and couldn't believe that I had to remember all this stuff just to print something, or read a file, or connect to another machine over the Internet. By contrast, Perl (or any other scripted language) was such a relief…just print, or open, or use the Net::Telnet module, or whatever. The details are there and they are important, sometimes very much so; that doesn't mean I want to learn more metallurgy every time I need a fork. (No, I don't think that metaphor's tortured; why do you ask?)

Another thing is that I'm trying to get multipath connections working for the first time. We've got two database servers, each of which is connected via dual SAS HBAs to outboard disk arrays. (I don't think anyone else calls them "outboard", but I like the sound of it. See this hard drive? It's outboard, baby!) The arrays are from Sun and come with drivers, but the documentation is confusing: it says it's available for RHEL 5 (aka CentOS 5), but the actual download says it's only for RHEL 4.

As a temporary respite, I'm trying to see if I can get these workign using Linux's own multipath daemon, and it's also confusing. The documentation for it is tough to track down, and I just don't understand the different device names: am I meant to put /dev/dm-2 in fstab, or /dev/mpath/mpath2p1? If the latter, why does the name sometimes change to the WWUID (/dev/mpath/$(cat /dev/random)) when I restart multipathd? (use_friendly_names is uncommented in the config file.) If the whole point of multipath is failover, why does this sequence:

(where /mnt is where I've got this array mounted, obvs) sometimes work, and sometimes end with "I/O error" being logged, and the filesystem being read-only? Is this the sort of thing that the Sun driver will fix? I can't find anything about this.

And I mentioned electrical problems. When we got our servers installed, the Sun guys told us they'd tripped breakers on the PDU and/or breakers in the room's electrical cabinet. Since it had a sign on it saying "100A", I figured we might be running up against power limtis — either in the room as a whole, if my figures were 'way out, or on individual PDUs. Turns out I was probably wrong: I missed the bit on the sign that said 3-phase, which means (deep breath) we probably have 3 x 100A power available (I think).

It's more complicated than that, because some of it is in 120V, some of it is in twist-lock 220V 30A circuits, and so on. But I should've checked before emailing the faculty member who, in a year or two, will be going into this room (we're there as guests of the department) and happens to sit on the facilities committee. He had asked how we were doing, so I sent him an email — nice, polite, and including a bit about how grateful we were for the room and the help of the local sysadmins (all of which is true).

I was under the impression that he was asking for info now, so that he could bring it up for action in a few months when we were out. Instead, two hours later when I'm swearing at multipath, in come the facilities manager and one of the sysadmins I was dealing with, looking to find out just how much power we were using anyhow. I apologized profusely, and they were very cool about it. But when the committee guy asks questions, people jump. I had not anticipated this. Welcome to University Politics 101. I emailed again and explained my mistake.

There are lots of remedial courses I could take. However, today I would most like to take "Electricity and wiring for sysadmins".

And on another note: Ack! My laptop's home partition is 93% full! How the hell did that happen?

And again: How did I not know about apt-file? This is perfect!

(Touch o' the hat to Tears For Fears and Steve Kemp; I'm moving closer every day to switching to Chronicle.)

Tags: cfengine, hardware, linux, meta.
Long-term planning
Wed Jan 28 20:48:17 PST 2009

Another thing I'm trying to do at my new job is make/take more time for long-term planning. I've been dinged by mgt. for this in the past, and while it's not easy to hear I think there has been some validity to this. (My inclination is to concentrate hard on fixing the problems I'm faced with; giving up on something broken, even when doing so would make so much more sense and would free up resources to look for a replacement, just rankles and feels like…well, giving up.) Since the department I'm in is so new, it's even more important to pay attention to this.

Part of the problem is just recognizing that I need to make time. An hour a week to be isolated, and to (say) figure out what I'm going to need to do for the next month, is a habit I'm very conciously trying to adopt.

But another problem is how to keep track of all this. What I've done so far:

So where does that leave me? ATM, (paper planner Cycle) attempting some longer-term project tracking w/org-mode. I figure the TODO bits from org-mode will fit well with the planner, and the flexibility of Emacs and org-mode (different from paper…oh, how I wish I could grep paper) will work well for projects…the records for which should, ideally, be suitable for pasting into wiki-based documentation.

If anyone has any suggestions, please let me know. If I make it to LISA this year, I'll be looking for a BOF about this. (Or maybe I'll just tackle Tom Limoncelli to the ground and holler "I love you, man!" a la "Say Anything".)

Moving on:

And now it is time for bed.

Tags: career, cfengine, emacs, time.
Bacula, gossip, advice
Thu Jul 2 16:31:35 PDT 2009
This sounds like when I was at my previous employer and they asked if
I could develop a web-based system to take surveys.  I nearly said,
"yes" because, well, I know perl, I know CGI, and I could do it.
However, I was smart enough to say "no, but surveymonkey.com will do
it for cheap."  Best of all it was self-service and the HR person was
able to do it entirely without me.  If I had said I could write such a
program, it would have been days of back-and-forth changes which would
have driven me crazy.  Instead, she was happy to be empowered to do it
herself.  In fact, doing it herself without any help became a feather
in her cap.

The lesson I learned is that "can I do it?" includes "do I want to do
it?".  If I can do something but don't want to, the answer is, "No, I
don't know how" not "I know how but don't want to".  The first makes
you look like you know your limits.  The latter sounds like you are
just being difficult.
Tags: backups, cfengine, reading.
Bad Time Equals LDAP Failure
Wed Sep 9 12:28:27 PDT 2009

Just ran into an interesting problem: after replacing memory on a server, CentOS booting hung at "Starting system message bus..."

So what does dbus have to do with anything? This turned out to be an LDAP failure; dbus was trying to run as UID root, and since the LDAP server couldn't be contacted it hung. Why couldn't the LDAP server be contacted? The LDAP server logs only showed this:

[09/Sep/2009:12:04:32 -0700] conn=41492 op=-1 fd=112 closed - SSL
peer cannot verify your certificate.

The CA cert I use was in place, and another machine had just rebooted w/o problems (all this is taken care of with cfengine, so they were identical in this respect). I could connect to the LDAP server on the right port without any problems.

I finally figured out what was going on when I ran:

openssl s_client -connect ldap.example.com:636 -CApath /path/to/cacert_directory

and saw:

Verify return code: 9 (certificate is not yet valid)

date said it was December 31, 2001. What the what now? ntpdate to set things correctly, then I got:

Verify return code: 0 (ok)

I figure the CMOS clock (or whatever the kids are calling it these days) got reset when we had to remove the CPU daughtercard to get at the memory underneath.

And now you know...the rest of the story.

Tags: cfengine, ldap.
chkconfig woes
Mon Dec 14 15:40:57 PST 2009

Irritating: chkconfig on RHEL/CentOS returns non-zero if a service isn't configured for a runlevel. IOW, you can do:

chkconfig --level 3 foo

and have 0 returned if it's on, 1 if it's not.

But not SuSE; nope, it just returns 0 whether or not it's enabled, or even if the service itself doesn't exist. Because, you know, grep doesn't get used enough.

I'm doing this because I'm trying to use cfengine 2 to manage services. This works well in CentOS, where you can add something like:

service_foo_on = (ReturnsZero("/sbin/chkconfig --level 3 foo"))

and it'll work. ("service_foo_on" is a bit of a misnomer, because I'm checking runlevels, not whether it's actually running.)

Update: Nope, I'm wrong. chkconfig --check does exactly what I want. Many thanks to yaloki on #openSUSE-server for the help.

Tags: cfengine, opensuse, packagemanagement.
Xmas maintenance
Thu Dec 31 05:57:47 PST 2009

A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.

Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.

It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.

One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)

The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:

  1. You have to rebuild the drivers after a kernel change.
  2. This only showed up on two servers because the third server had not upgraded its kernel (or indeed, any of its packages). Why? cfservd had refused its connection because I had the MaxConnections parameter too low.
  3. And of the two that did upgrade, the one machine we'd tested the Linux drivers on still had an old multipath.conf file in /etc, which even though the multipathd. service wasn't starting up was enough to get drivers loaded. This took a while to figure out because I'd completely forgotten how to tell which driver was in use.

I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)

Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.

(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)

Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.

And best of 2010 to all of you!

Tags: centos, cfengine, monitoring, packagemanagement, serverroom, upgrades, work.

RSS Feed