Posts tagged “cfengine”

June 05, 2014 Cfengine 3: permission denied when copying to symlink

Today at $WORK I upgraded Cfengine on a server to 3.5.3. After that, I suddenly started seeing a lot of errors like this:

2014-06-05T13:52:47-0700    error: NetCopy to destination 'cfengine.example.com:/opt/sources/foo.tar.bz2.cfnew' security - failed attempt to exploit a race? (Not copied). (open: Permission denied)
2014-06-05T13:52:47-0700    error: /test/methods/'Copy /opt/'copy_opt/files/'/opt/: Was not able to copy '/var/cfengine/files/ALL/opt/sources/foo.tar.bz2' to '/opt/sources/foo.tar.bz2'

Running in verbose mode gave a bit more info, but nothing helpful:

2014-06-05T13:44:12-0700  verbose: Destination file '/opt/sources/foo.tar.bz2' already exists
2014-06-05T13:44:12-0700     info: Cannot open file for hashing '/opt/sources/foo.tar.bz2'. (fopen: Permission denied)
2014-06-05T13:44:12-0700  verbose: Image file '/opt/sources/foo.tar.bz2' has a wrong digest/checksum, should be copy of '/var/cfengine/files/ALL/opt/sources/foo.tar.bz2'
2014-06-05T13:44:12-0700    error: NetCopy to destination 'cfengine.example.com:/opt/sources/foo.tar.bz2.cfnew' security - failed attempt to exploit a race? (Not copied). (open: Permission denied)
2014-06-05T13:44:12-0700    error: /test/methods/'Copy /opt'/copy_opt_files/'/opt': Was not able to copy '/var/cfengine/files/ALL/o\ pt/sources/foo.tar.bz2' to '/opt/sources/foo.tar.bz2'

Wasn't SELinux, wasn't secret attributes...turned out that the new(er) version of Cf3 didn't like the fact that /opt was a symlink to /usr/opt. I'd set that up long ago and it was no longer needed, so I was free to just recreate it:

rm -rf /usr/opt
mkdir /opt
cf-agent -KI        # Which populates it as needed.

November 13, 2013 Documenting the options for the Cfengine provisioner plugin for Vagrant
I've just upgraded to the latest version of Vagrant, which includes a plugin that lets you use Cfengine as a provisioner. It doesn't seem to be documented right now, so here's my first stab at laying out the options. Apologies for the rough notes.
- ampolicyhub: From the source: "Policy hubs need to do additional things before they're ready to accept agents. Force that run now..." Runs "cf-agent -KI -f /var/cfengine/masterfiles/failsafe.cf [classes]", then "cf-agent -KI [classes] [extraagentargs]
- extraagentargs: Just what it says.
- classes: Define extra classes; appends "-D [class]" args to cf-agent. Multiple classes must be separated by spaces. (or is this a ruby array?)
- debrepofile, deprepoline: Specify a deb repo line, to be placed in debrepofile, before running "apt-get install [packagename]". debrepo_file will be clobbered.
- filespath: Copy localpath to /var/cfengine using install_files method defined in cfengine/provisioner.rb. Example: you do a git checkout of your repo and want it copied to the machine.
- forcebootstrap: Not sure; checked by cfengine/cap/linuxcfengineneeds_bootstrap.rb, but does not appear to do anything. FIXME: See where this module is called from.
- install: "force" option seems to be the only poss. value, but not clear what it does. Doesn't seem to be mentioned anywhere else but in provisioner.rb.
- mode: Poss values are:
  - "bootstrap", which runs cf-agent --bootstrap from policyserveraddress (if set) or from the instance itself (if not set)
  - "default" -- I think just run once.
- policyserveraddress: Just what it says.
- repogpgkey_url: Just what it says.
- runfile: Single run if set? Uploads to VM and runs "cf-agent -KI -f [file] [classes] [extraagent_args]".
- uploadpath: Where to copy runfile. Default is /tmp/vagrant-cfengine-file.
- yumrepofile: Default is /etc/yum.repos.d/cfengine-community.repo. Probably clobbered.
- yumrepourl: Default is http://cfengine.com/pub/yum/.
- package_name: For use by yum or apt. Default is cfengine-community.
July 26, 2013 Wordpress and Vagrant
I wanted to test a new version of Wordpress on $WORK's website, and ran into an interesting set of problems. I figured it would be worth setting them down here.

First: I set up a Vagrant box, got it to forward port 8000 to port 80 on the VM, told Cfengine to set it up as a webserver, then copied over the files and database. Turned out I'd forgotten a few things, like installation of modreverseproxy. I really need to make that into an RPM, especially since it should be pretty trivial, but for now I settled for documenting and scripting my instructions. There were a few other things like that; it's always a good exercise to do this and see what you've left out. Eventually I got it down to a Makefile that I could run on the box itself:
```
OLD=wordpress-3.4.2
NEW=wordpress-3.5.2

CF=/var/cfengine/bin/cf-agent -f /vagrant/cfengine/masterfiles/promises.cf -KI

go: /var/cfengine/bin /etc/firstrun /www/www.example.com-wordpress /usr/bin/mysql /var/lib/mysql/chibiwp /etc/httpd/modules/mod_proxy_html.so
```
```
    sudo $(CF)
```
```
/var/cfengine/bin:
```
```
    sudo rpm -ivh cfengine-community-3.3.0-1.x86_64.rpm
```
```
/etc/firstrun:
```
```
    sudo $(CF) -Dinstall_now_please
    sudo touch /etc/firstrun
```
```
/www/www.example.com-wordpress:
```
```
    sudo tar -C /www -xvzf /vagrant/wordpress.tgz
```
```
/var/lib/mysql/example-wordpress:
```
```
    mysql -u root -e"create database example-wordpress; grant all on example-wordpress.* to 'wordpress'@'localhost' identified by 's33kr1t'; flush privileges; use example-wordpress; source /vagrant/example-wordpresswp.sql;"
```
```
/usr/bin/mysql:
```
```
    sudo /var/cfengine/bin/cf-agent  -f /vagrant/cfengine/masterfiles/promises.cf -KI -Dinstall_now_please
```
```
/etc/httpd/modules/mod_proxy_html.so:
```
```
    tar -C /tmp -xvjf /vagrant/mod_proxy_html.tar.bz2
    sudo bash -c 'cd /tmp/mod_proxy_html ; /usr/sbin/apxs -I /usr/include/libxml2 -I . -c -i mod_proxy_html.c'
```
```
disable_plugins:
```
```
    mysql -B -u root chibiwp -e "select option_value from wp_options where option_name='active_plugins';" | sed -e's/^/update wp_options set option_value=QQQ/;s/$/QQQ where option_name="active_plugins";/;' | tail -1 | sed -e"s/QQQ/'/g" > /tmp/restore
    mysql -u root chibiwp -e'update wp_options set option_value="a:0:{}" where option_name="active_plugins";'
```
```
enable_plugins:
```
```
    mysql -u root chibiwp < /tmp/restore
```
```
unpack_wp:
```
```
    sudo tar -C /www/www.example.com-wordpress -xvzf /vagrant/wordpress-3.5.2.tar.gz
    sudo mv /www/www.example.com-wordpress/wordpress $(NEW)
    -sudo rm -r $(OLD)/wp-includes
    -sudo rm -r $(OLD)/wp-admin
    -sudo mv $(NEW)/wp-includes $(OLD)
    -sudo mv $(NEW)/wp-admin $(OLD)
    sudo find $(NEW) -type f -maxdepth 1 -exec cp -v {} $(OLD) \;
```
```
force_upgrade:
```
```
    wget "http://localhost/wp-admin/upgrade.php?step=1&backto=%2Fwp-admin%2F"
```
```
upgrade: disable_plugins unpack_wp force_upgrade enable_plugins
```
However, when I browsed to localhost:8000 it tried to redirect me to the actual URL for the work website (http://work.example.com), rather than simply showing me the page and serving it all locally. Turns out this is a known problem, and the solution is to use one of Wordpress' many ways to set the site URL. The original poster used the RELOCATE method, but I had better luck setting the URL manually:
```
define('WP_HOME','http://localhost:8000');
define('WP_SITEURL','http://localhost:8000');
```
I can do this manually, but it's better to get Cfengine to do this. First, we have an agent bundle to edit the file:
```
bundle agent configure_wp_for_vagrant_testing {
  files:
```
```
vagrantup_com::
  "/var/www/wordpress/wp-config.php"
    edit_line => vagrant_testing_wpconfig;
```
```
}
```
We specify the lines to add. Rather than install the lines in two passes, which is non-convergent, we add just one line that happens to have an embedded newline:
```
bundle edit_line vagrant_testing_wpconfig {
  insert_lines:
```
```
"define('WP_HOME','http://localhost:8000');
```
```
define('WP_SITEURL','http://localhost:8000');" location => wp_config_thatsallfolks;
}
```
(I found that on the Cfengine mailing list, but I've lost the link.) And finally, we specify the location. This depends on having the default comment in wp-config that indicates the end of user-settable vars, but it seems a safe bet:
```
body location wp_config_thatsallfolks {
  select_line_matching => "^/\* That's all, stop editing. Happy blogging. \*\/.*$";
  before_after => "before";
}
```
Second, the production webserver actually hosts a bunch of different sites, and we have separate config files for each of them. Since I was getting Cf3 to configure the VM just as if it was production, the VM got all these config files too. Turned out that browsing to http://localhost:8000 gave me what Apache thought was the default site -- which is the VirtualHost config listed first, which in our case was not our main site. I got around that by renaming our main site's config file to 000-www.example.com.conf (a trick I stole from Debian). Now I could see our main website at http://localhost:8000.

Third, testing: normally I rely on Nagios to do this sort of thing, but it's kind of hard to point it at a VM that might be only around for a few minutes. I could add tests to Cfengine, and that's probably a good idea; however, right now I wanted to try out serverspec, a Ruby-based test suite that lets you verify server attributes.

The serverspec docs say they can run tests on a Vagrant machine, and that all you have to do is tell it so when running "serverspec init." However, I had problems with this; it asked me for a VM name, and I didn't have one...there was only one machine set up, and it didn't seem to like "default". I didn't spend a lot of time on this, but instead went to running the serverspec tests on the Vagrant box itself. That brought its own problems, sinc installing gems in CentOS 5 via the default Ruby (1.8.5) causes buffer overflows. A better person would build a newer RPM, rather than complain about non-schedule repos. However, this Gist does the trick rather nicely (though I also removed the stock Ruby and didn't bother installing Chef).

Okay, so: running "serverspec init" on the Vagrant box created a nice set of default tests for a website. I modified the test for the website config file to look for the right config file and server name:
```
describe file('/etc/httpd/conf.d/000-www.example.com.conf') do
  it { should be_file }
  it { should contain "ServerName www.example.com }
end
```

June 24, 2013 A couple of handy Emacs functions

First, a short function to attach a file when editing a Markdown page in ikiwiki:

(defun x-hugh-wiki-attach-file-to-wiki-page (filename)
  "This is my way of doing things."
  (interactive "fAttach file: ")
  ;; doubled slash, but this makes it clear
  (let* ((page-name (file-name-nondirectory (file-name-sans-extension (buffer-file-name))))
         (local-attachments-dir (format "%s/attachments/%s" (file-name-directory (buffer-file-name)) page-name))

     (attachment-file (file-name-nondirectory filename))
     (attachment-url (format "https://wiki.example.org/wiki/attachments/%s/%s" page-name attachment-file)))
(make-directory local-attachments-dir 1)
(copy-file filename local-attachments-dir 1)
(insert-string (format "[[%s|%s]]" attachment-file attachment-url))))

Note the way I'm organizing things: there's a directory in the wiki/tree called "attachments"; a subdirectory is created for each page; and the file is dumped there.

Second, a stupid copy-file-template function for Cfengine:

(defun x-hugh-cf3-insert-file-template (file)
  "Insert a copy-file template."
  (interactive "sFile to copy: ")
  (newline-and-indent)
  (insert-string (format "\"%s\"" file))
  (newline-and-indent)
  (insert-string (format "  comment => \"Copy %s into place.\"," file))
  (newline-and-indent)
  (insert-string (format "  perms   => mog(\"0755\", \"root\", \"wheel\"),"))
  (newline-and-indent)
  (insert-string (format "  copy_from => secure_cp(\"$(g.masterfiles)/centos/5%s\", \"$(g.masterserver)\";" file)))

Both are mostly learning exercises and excuses to post.

January 03, 2013 Trying to make things easier
First day back at $WORK after the winter break yesterday, and some...interesting...things. Like finding out about the service that didn't come back after a power outage three weeks ago. Fuck. Add the check to Nagios, bring it up; when the light turns green, the trap is clean.

Or when I got a page about a service that I recognized as having, somehow, to do with a webapp we monitor, but no real recollection of what it does or why it's important. Go talk to my boss, find out he's restarted it and it'll be up in a minute, get the 25-word version of what it does, add him to the contact list for that service and add the info to documentation.

I start to think about how to include a link to documentation in Nagios alerts, and a quick search turns up "Default monitoring alerts are awful" , a blog post by Jeff Goldschrafe about just this. His approach looks damned cool, and I'm hoping he'll share how he does this. Inna meantime, there's the Nagios config options "notes", "notesurl" and "actionurl", which I didn't know about. I'll start adding stuff to the Nagios config. (Which really makes me wish I had a way of generating Nagios config...sigh. Maybe NConf?)

But also on Jeff's blog I found a post about Kaboli, which lets you interact with Nagios/Icinga through email. That's cool. Repo here.

Planning. I want to do something better with planning. I've got RT to catch problems as they emerge, and track them to completion. Combined with orgmode, it's pretty good at giving me a handy reference for what I'm working on (RT #666) and having the whole history available. What it's not good at is big-picture planning...everything is just a big list of stuff to do, not sorted by priority or labelled by project, and it's a big intimidating mess. I heard about Kanban when I was at LISA this year, and I want to give it a try...not suure if it's exactly right, but it seems close.

And then I came across Behaviour-driven infrastructure through Cucumber, a blog post from Lindsay Holmwood. Which is damn cool, and about which I'll write more another time. Which led to the Github repo for a cucumber/nagios plugin, and reading more about Cucumber, and behaviour-driven development versus test-driven development (hint: they're almost exactly the same thing).

My god, it's full of stars.
December 03, 2012 Standalone bundles in Cf3
I always seem to forget how to do this, but it's actually pretty simple. Assume you want to test a new bundle called "test", and it's in a file called "test.cf". First, make sure your file has a control stanza like this:
```
body common control {
  inputs => { "/var/cfengine/inputs/cfengine_stdlib.cf" } ;
  bundlesequence => { "test" } ;
}
```
Note:
- inputs must not include the file "test.cf" itself -- otherwise, you'll get the error "Redefinition of body "control" for "common" is a broken promise, near token '{'. Think of "inputs" as really being named "additional inputs".
- I'm including the cfengine_stdlib.cf file. You should too.
- bundlesequence is set to your bundle (which I'm leaving out of this entry for simplicity).
Second, invoke it like so:
```
sudo /var/cfeing/bin/cf-agent -KI -f /path/to/test.cf
```
Note:
- -K means "run no matter how soon after the last time it was run."
- -I shows a list of promises repaired.
- -f gives the path to the file you're testing.
November 28, 2012 A sub for Cf3
When sub was released by 37signals, I liked it a lot. Over the last couple of months I've been putting together a sub for Cfengine. Now it's up on Github, and of course my own repo. It's not pretty, but there are some pretty handy things in there. Enjoy!

November 23, 2012 Deploying SELinux modules from Cfengine

Back in January, yo, I wrote about trying to figure out how to use Cfengine3 to do SELinux tasks; one of those was pushing out SELinux modules. These are encapsulated bits of policy, usually generated by piping SELinux logs to the audit2allow command. audit2allow usually makes two files: a source file that's human-readable, and a sorta-compiled version that's actually loaded by semodule.

So how do you deploy this sort of thing on multiple machines? One option would be to copy around the compiled module...but while that's technically possible, the SELinux developers don't guarantee it'll work (link lost, sorry). The better way is to copy around the source file, compile it, and then load it.

SANSNOC used this approach in puppet. I contacted them to ask if it was okay for me to copy their approach/translate their code to Cf3, and they said go for it. Here's my implementation:

bundle agent add_selinux_module(module) {
  # This whole approach copied/ported from the SANS Institute's puppet modules:
  # https://github.com/sansnoc/puppet
   files:
     centos::
       "/etc/selinux/local/."
         comment        => "Create local SELinux directory for modules, etc.",
         create         => "true",
         perms          => mog("700", "root", "root");

       "/etc/selinux/local/$(module).te"
         comment        => "Copy over module source.",
         copy_from      => secure_cp("$(g.masterfiles)/centos/5/etc/selinux/local/$(module).te", "$(g.masterserver)"),
         perms          => mog("440", "root", "root"),
         classes        => if_repaired("rebuild_$(module)");

       "/etc/selinux/local/setup.cf3_template"
         comment        => "Copy over module source.",
         copy_from      => secure_cp("$(g.masterfiles)/centos/5/etc/selinux/local/setup.cf3_template", "$(g.masterserver)"),
         perms          => mog("750", "root", "root"),
         classes        => if_repaired("rebuild_$(module)");

       "/etc/selinux/local/$(module)-setup.sh"
         comment        => "Create setup script. FIXME: This was easily done in one step in Puppet, and may be stupid for Cf3.",
         create         => "true",
         edit_line      => expand_template("/etc/selinux/local/setup.cf3_template"),
         perms          => mog("750", "root", "root"),
         edit_defaults  => empty,
         classes        => if_repaired("rebuild_$(module)");


  commands:
    centos::
      "/etc/selinux/local/$(module)-setup.sh"
        comment         => "Actually rebuild module.",
        ifvarclass      => canonify("rebuild_$(module)");
}

Here's how I invoke it as part of setting up a mail server:

bundle agent mail_server {
  vars:
    centos::
      "selinux_mailserver_modules" slist => { "postfixpipe",
                                              "dovecotdeliver" };

  methods:
    centos.selinux_on::
      "Add mail server SELinux modules" usebundle => add_selinux_module("$(selinux_mailserver_modules)");
}

(Yes, that really is all I do as part of setting up a mail server. Why do you ask? :-) )

So in the add_selinux_module bundle, a directory is created for local modules. The module source code, named after the module itself, is copied over, and a setup script created from a Cf3 template. The setup template looks like this:

#!/bin/sh
# This file is configured by cfengine.  Any local changes will be overwritten!
#
# Note that with template files, the variable needs to be referenced
# like so:
#
#   $(bundle_name.variable_name)

# Where to store selinux related files
SOURCE=/etc/selinux/local
BUILD=/etc/selinux/local

/usr/bin/checkmodule -M -m -o ${BUILD}/$(add_selinux_module.module).mod ${SOURCE}/$(add_selinux_module.module).te
/usr/bin/semodule_package -o ${BUILD}/$(add_selinux_module.module).pp -m ${BUILD}/$(add_selinux_module.module).mod
/usr/sbin/semodule -i ${BUILD}/$(add_selinux_module.module).pp

/bin/rm ${BUILD}/$(add_selinux_module.module).mod ${BUILD}/$(add_selinux_module.module).pp

Note the two kinds of disambiguating brackets here: {curly} to indicate shell variables, and (round) to indicate Cf3 variables.

As noted in the bundle comment, the template might be overkill; I think it would be easy enough to have the rebuild script just take the name of the module as an argument. But it was a good excuse to get familiar with Cf3 templates.

I've been using this bundle a lot in the last few days as I prep a new mail server, which will be running under SELinux, and it works well. Actually creating the module source file is something I'll put in another post. Also, at some point I should probably put this up on Github FWIW. (SANS had their stuff in the public domain, so I'll probably do BSD or some such... in the meantime,please use this if it's helpful to you.)

UPDATE: It's available on Github and my own server; released under the MIT license. Share and enjoy!

November 21, 2012 Invoking Cfengine from Nagios
Nagios and Cf3 each have their strengths:
- Nagios has nicely-encapsulated checks for lots of different things, and I'm quite familiar with it.
- Cfengine is a nice way of sanely ensuring things are the way we want them to be (ie, without running amok and restarting something infinity times).
Nagios plugins, frankly, are hard to duplicate in Cfengine. Check out this Cf3 implementation of a web server check:
```
bundle agent check_tcp_response {
  vars:
    "read_web_srv_response" string  => readtcp("php.net", "80", "GET /manual/en/index.php HTTP/1.1$(const.r)$(const.n)Host: php.net$(const.r)$(const.n)$(const.r)$(const.n)", 60);

  classes:
    "expectedResponse" expression   => regcmp(".*200 OK.*\n.*", "$(read_web_srv_response)");

  reports:
    !expectedResponse::
      "Something is wrong with php.net - see for yourself: $(read_web_srv_response)";

}
```
That simply does not compare with this Nagios stanza:
```
define service{
    use                             local-service         ; Name of service template to use
    hostgroup_name                  http-servers
    service_description             HTTP
    check_command                   check_http
}
define command{
    command_name                    check_http
    command_line                    $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}
```
My idea, which I totally stole from this article, was to invoke Cfengine from Nagios when necessary, and let Cf3 restart the service. Example: I've got this one service that monitors a disk array for faults. It's flaky, and needs to be restarted when it stops responding. I've already got a check for the service in Nagios, so I added an event handler:
```
define service{
    use                             local-service         ; Name of service template to use
    host_name                       diskarray-mon
    service_description             diskarray-mon website
    check_command                   check_http!-H diskmon.example.com -S -u /login.html
    event_handler                   invoke_cfrunagent
}
define command{
    command_name invoke_cfrunagent
    command_line $USER2/invoke_cfrunagent.sh  -n "$SERVICEDESC" -s $SERVICESTATE$ -t $SERVICESTATETYPE$ -a $HOSTADDRESS$
}
```
Leaving out some getopt() stuff, invoke_cfrunagent.sh looks like this:
```
# Convert "diskarray-mon website to disarray-mon_website":
SVC=${SVC/ /_}
STATE="nagios_$STATE"
TYPE="nagios_$TYPE"

# Debugging
echo "About to run sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE" | /usr/bin/logger
# We allow this in sudoers:
sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE
```
cf-runagent is a request, not an order, to the running cf-server process to fulfill already-configured processes; it's like saying "If you don't mind, could you please run now?"

Finally, this was to be detected in Cf3 like so:
```
  methods:
    diskarray-mon_website.nagios_CRITICAL.nagios_HARD::
      "Restart the diskarray monitoring service" usebundle => restart_diskarray_monitor();
```
(This stanza is in a bundle that I know is called on the disk array monitor.)

Here's what works:
- If I run cf-agent -- not cf-runagent -- with those args ("-D diskarray-monwebsite -D nagiosCRITICAL -D nagios_HARD"), it'll run the restart script.
What doesn't work:
- running cf-runagent, either as root or as nagios. It seems to stop after analyzing classes and such, and not actually do anything. I'm probably misunderstanding how cf-runagent is meant to work.
- Nagios will only run an event handler when things change -- not all the time until things get better. That means that if the first attempt by Cf3 to restart doesn't work, for whatever reason, it won't get run again.
What might work better is using this Cf3 wrapper for Nagios plugins (which I think is the same approach, or possibly code, discussed in this mailing list post).

Anyhow...This is a sort of half-assed attempt in a morning to get something working. Not there yet.
November 01, 2012 Forensic accounting with Cfengine 3
Just had a dream where I'd been called into Sun, just before Oracle's takeover, to figure out why they were spending so much money on eyeglasses for employees. "We think it's part of their benefits, but our accounting department doesn't have a separate line item for it," someone explained. My eyebrows lifted in disbelief. "Well, then, it's damned lucky for you I've got Cfengine."
September 19, 2012 No questions, please, we're Cfengine
Over the last two days, in a frenzy of activity, I got some awesome done at work: using git and Vagrant, I finally got Cfengine to install packages in Ubuntu without asking me any goram questions. There were two bits involved:
- Telling Apt to use old config files. This prevents it from asking for confirmation when it comes across your already-installed-with-Cfengine config file. Cfengine doesn't do things in a particular order, and in any case I do package installation once an hour -- so I might well have an NTP config file show up before the NTP package itself.
- Preseeding, which I've known about for a while but have not had a chance to get right. My summer student came up with a script to do this, and I hope to be able to release it.
Now: Fully automated package installation FTMFW.

And did you know that Emacs can check your laptop battery status? I didn't.
February 09, 2012 Cfengine 3 Syntax Part III (or, How to debate smart people in 140 character snippets)
More conversations with Mark Burgess via Twitter (a continuation from here. I should note that this was all a week or so ago now; I've been meaning to put this up here.

markburgess_osl: @saintaardvark Doc is "what" code is "how". I believe the lasting intention comes before a specific implementation. #devops #sysadmin

saintaardvark: .@markburgess_osl Hm. So let's see if I've got this right: the programmer in me notices lots of overlap in my Cf3 config...

saintaardvark: .@markburgess_osl ...and wants to consolidate. Cf3 syntax makes this a hairy proposition at best. But this is not really a problem...

saintaardvark: .@markburgess_osl ...because I should be thinking about this as documentation (which can be long) of the desired system state...

saintaardvark: .@markburgess_osl ...rather than code (where the drive is for efficiency and lack of duplication). Have I got that right? #sysadmin

markburgess_osl: @saintaardvark Documentation => focus on end state (like GPS), Code => focus on start state + directions. The journey is irrelevant.

markburgess_osl @saintaardvark Docs also improved by seeing themes and patterns. That is still WHAT not HOW. So no contradiction.

So putting this in practical (can't resist the temptation to say "less Yoda-like") terms: what I think he's saying is, don't worry about code duplication or getting clever; you're documenting desired system state, and it's okay to be verbose.

Using the example I started with, it's okay to have NTP settings in multiple places (because SuSE needs two files, Solaris 1, etc). The coder in me wants to clean those up because it's all NTP, but the documentationist ("writers", I think they're called) relaxes and says "Can't have too much documentation." Which is fair.

But then I worry about having Multiple Sources of Truth(tm). The advantage of the first setup is that when I change the NTP server, it's ALL in one place; in the second setup, I have to remember: did I change it for SuSE? Solaris? CentOS? I've learned the hard way to be wary of such setups. I nearly always miss something; that's why I'm aggressive about consolidating.

I'm still mulling all this over.
January 24, 2012 Cfengine 3 Syntax Part II
Mark Burgess was kind enough to respond to my earlier post about Cfengine syntax:

markburgess_osl: @saintaardvark (soothing) Syntax is definitely an acquired taste (re perl ;)). The list-ref prob can go away soon. Think doc not code 4 cf3

And then, via tweetsification, we were all like:

saintaardvark: .@markburgess_osl Heh, thanks for the reply -- I was going to ask you about this. Fair pt re: syntax being an acquired taste...[1/2]

saintaardvark: .@markburgess_osl ...but any chance the mess of brackets will be reduced? [2/2]

markburgess_osl: @saintaardvark trade one set of () for -> Don't see much point in that. $() has long precedence in sh / make etc. It delimits clearly in txt

saintaardvark: .@markburgessosl Fair enough, but I'm also thinking of eg "$(services.cfgfile[$(service)])": dollar bracket scopedot square dollar bracket

markburgess_osl: @saintaardvark I agree it's clumsy, but it's also an edge case. You rarely write this if you make good use of patterns. Perl also ugly here.

But this layout came from their own dang documentation! I feel like I'm stuck here:
- This config works fine for single-file config
- But there's lots of stuff that's not single-file
- Can't yet iterate over remote lists (though prob going away soon)
- So for now that leaves iterating over local lists
- Which seems to mean lots of duplication
[old entry recovered from backup!]

That last point: what I mean is that the whole appeal of that layout (pattern/whatever) was that you could just say fix_service('foo'), and The Right Thing(tm) would happen. Now I have to rethink this; it seems to mean either having lots of bundles like "fixntp", "fixautofs", etc -- with lots of sections like:
```
vars:
  SuSE::
    "files" slist => {"this", "that"};
  Centos::
    "files" string => "just_this";
```
...or else having separate "fix_service" bundles for each class. (Forgive me, I'm thinking about all this w/o having a Cf3 instance to play with in front of me.)

I'm trying not to sound whiny here; I'm grateful for Cf3, for the documentation (which is pretty extensive), and that Mark took the time to respond. But this is frustrating.
January 23, 2012 Cfengine 3 syntax
Cfengine 3 has a lot of things going for it. But its syntax is not one of them.

Consider this situation: you have CentOS machines, SuSE machines and Solaris machines. All of them should run, say, SSH, NTP and Apache why not? The files are slightly different between them, and so is the method of starting/stopping/enabling services, but mostly we're doing the same thing.

I've got a bundle in Cfengine that looks like this:
```
bundle common services {
  vars:
    redhat|centos::
      "cfg_file_prefix"     string => "centos/5";

      "cfg_file[httpd]"     string => "/etc/httpd/conf/httpd.conf";
      "daemon[httpd]"       string => "httpd";
      "start[httpd]"        string => "/sbin/service httpd start";
      "enable[httpd]"       string => "/sbin/chkconfig httpd on";

      "cfg_file[ssh]"       string => "/etc/ssh/sshd_config";
      "daemon[ssh]"         string => "sshd";
      "start[ssh]"          string => "/sbin/service sshd restart";
      "enable[ssh]"         string => "/sbin/chkconfig sshd on";
```
...and so on. We're basically setting up four hashes -- daemon, start, enable and cfg -- and populating them with the appropriate entries for Red Hat/Centos ssh and Apache configs; you can imagine slightly different entries for Solaris and SuSE. The cfg_file_prefix allows me to put CentOS' config files in a separate directory from other OS.

Then there's this bundle:
```
bundle agent fix_service(service) {
  files:
    "$(services.cfg_file[$(service)])"
      copy_from     => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
      classes       => if_repaired("$(service)_restart"),
      comment       => "Copy a stock configuration file template from repository";

  processes:
    "$(services.daemon[$(service)])"
      comment       => "Check that the server process is running, and start if necessary",
      restart_class => canonify("$(service)_restart"),
      ifvarclass    => canonify("$(services.daemon[$(service)])");

  commands:
    "$(services.start[$(service)])"
      comment       => "Method for starting this service",
      ifvarclass    => canonify("$(service)_restart");

    "$(services.enable[$(service)])"
      comment       => "Method for enabling this service",
      ifvarclass    => canonify("$(service)_restart");
}
```
This bundle takes a service name as an argument, and assigns it to the local variable "service". It copies the OS-and-service-appropriate config file into place if it needs to, and enables/starts the service if it needs to. How does it know if it needs to? By setting the class "$(service)_restart" if the service isn't running, or if the config file had to be copied.

So far, so good. Well, except for the mess of brackets. All those hashes are in the services bundle, so you need to be explicit about the scope. (There are provisions for global variables, but I've kept my use of 'em to a minimum.) And so what in Perl would be, say:
```
$services->start{$service}
```
becomes
```
"$(services.start[$(service)])"
```
Square brackets for the hash, round brackets for the string (and to indicate that you're using a variable -- IOW, it's "$(variable)", not "$variable" like you're used to), and dots to indicate scope ("services.start" == the start variable in the services bundle).

It's...well, it's an ugly mess o' brackets. But I can deal with that. And this arrangement/pattern, which came from the Cfengine documentation itself, has been pretty helpful to me for dealing with single config file services.

But what about the case where a service has more than one config file? Like autofs: you gotta copy around a map file but in SuSE you also need /etc/sysconfig/autofs to set the LDAP variables.

Again, in Perl this would be an anonymous array on top of a hash -- something like:
```
$services->cfg_file{"autofs"}[0] = "/etc/auto.master
$services->cfg_file{"autofs"}[1] = "/etc/sysconfig/aufofs"
```
and you'd walk it like so:
```
foreach my $i in ($services->cfg_file{"autofs"}) { # something with $i }
```
or even:
```
while ($services->cfg_file{"autofs"}) { # something with $_ }
```
(I think...I'm embarrassed sometimes at how rusty my Perl is.)

In Cfengine, you pile an anonymous array on top of a has like so:
```
  "cfg_file[autofs]" slist => { "/etc/auto.master", "/etc/sysconfig/autofs" };
```
An slist is a list of strings. All right, fine; different layout, same idea, stick it in the services bundle and away we go. But: remote scalars can be referenced; remote lists cannot without gymnastics. From the docs:

During list expansion, only local lists can be expanded, thus global list references have to be mapped into a local context if you want to use them for iteration. Instead of doing this in some arbitrary way, with possibility of name collisions, cfengine asks you to make this explicit. There are two possible approaches.

The first of those two approaches is, I think, passing the list as a parameter, whereupon it just works? maybe? (It's a not-so-minor nitpick that there are lots of examples in the Cf3 handbook that are not explained and don't make much sense. They apparently work, but how is not at all clear, or discernible.) I think it's meant to be like Perl's let's-flatten-everything-into-a-list approach to passing variables.

The second is to just go ahead and redeclare the remote slist (array) as a local one that's set to the remote value. Again, from the docs:
```
bundle common va {
  vars:
   "tmpdirs"  slist => { "/tmp", "/var/tmp", "/usr/tmp"  };
}

bundle agent hardening {
  classes:
    "ok" expression => "any";

  vars:
   "other"    slist => { "/tmp", "/var/tmp" };
   "x"        slist => { @(va.tmpdirs) };

  reports:
    ok::
      "Do $(x)";
      "Other: $(other)";
}
```
which makes this prelude to all of that handwaving even more irritating:

Instead of doing this in some arbitrary way, with possibility of name collisions...

...

...I mean...

...I mean, what is the point of requiring explicit paths to variables in other scopes if you're just going to insert random speedbumps to assauge needless worries about name collisions? What the hell is with this let's-redeclare-it-AGAIN approach?

The rage, it fills me.
January 20, 2012 PPD changes in Oneiric
In Cfengine3, I had been setting up printers for people using lpadmin commands. Among other things, it used a particular PPD file for the local HP printer. It turns out that in Oneiric, those files are no longer present, or even available; judging by what I found on my laptop, the PPD file is (I think) generated automagically by /usr/share/cups/ppd-updaters/hplip-cups.

It's possible that I could figure this out for my new workstation. But right now, I don't think I can be bothered. I'm going to just set this up by hand, and hope that either I'll get a print server or I'll figure it out.
January 19, 2012 Cfengine 3 and SELinux
- No native support in Cf3 for SELinux.
- I've added a bundle that enables/disables booleans and have used it on one machine; this is pretty trivial.
- File contexts and restorecon appear to be mainly controlled by plain old files in /etc/selinux/targeted/contexts/files, but there are stern warnings about letting libselinux manage them. However, this thread on the SELinux mailing list seems to say it's okay to copy them around.
- Puppet appears to be further ahead in this. This guy compiles policy files locally using Puppet; this other dude has a couple of posts on this. There are yet other other folks using Puppet to do this, and it would be worth checking them out as a source of ideas.
- I need to improve my collection of collective pronouns.
January 19, 2012 Cfengine 3 error: Redefinition of body \"control\" for \"common\" is a broken promise, near token '{'
I tripped across this error today with Cfengine 3:
```
cf3:./inputs/promises.cf:1,22: Redefinition of body "control" for "common" is a broken promise, near token '{'
```
The weird thing was this was a stripped down promises.cf, and I could not figure out why it was complaining about redefinitions. I finally found the error:
```
body common control {
```
```
bundlesequence => { "test" };
inputs => { "promises.cf", "cfengine_stdlib.cf" };
```
```
}
```
Yep, including the promises.cf file itself in the inputs section borked everything; removing it fixed things right away.
January 12, 2012 New workstation
I've got a new workstation at $WORK. (Well, where else would it be?) It's pretty sweet: i7 quad-core processor, clock speed > 3GHz (honestly, I barely keep track anymore), and 8GB of RAM. 8GB! Insane.

When I arrived in 2008, I used a -- not cast-off, but unused P4 with 4 GB of RAM. I didn't want to make a big fuss about it; I saved the fuss, instead, for a nice business laptop from Dell that worked well with Linux. Since 90% of my work is Firefox + Emacs + XTerms, and my WM of choice at the moment is Awesome, speed was not a problem and the memory was fine.

Lately, though, I've discovered Vagrant. It looks pretty sweet, but my current machine is sloooow when I try to run a couple of VMs. (So's my laptop, despite a better processor; I suspect the 5400RPM drive.) I'm hoping that the new machine will make a big difference.

Just gotta install Ubuntu and move stuff over. Fortunately I've been pretty good about keeping my machine config in Cfengine, so that'll help. And then build some VMs. I'm always surprised at people who feel comfortable downloading random VM images from the Internet. Yeah, it's probably okay...but how do you know?

One thing that Vagrant is missing is integration with Cfengine. Fortunately, the documentation for extending it seems pretty good (plus, I can always kick things off with a shell script). This might be an excuse to learn Ruby.
August 24, 2011 Well, which one would YOU pick?
At work, I'm about to open up the Rocks cluster to production, or at least beta. I'm finally setting up the attached disk array, along with home directories and quotas, and I've just bumped into an unsettled question:

How the hell do I manage this machine?

On our other servers, I use Cfengine. It's a mix of version 2 and 3, but I'm migrating to 3. I've used Cf3 on the front end of the cluster semi-regularly, and by hand, to set things like LDAP membership, automount, and so on -- basically, to install or modify files and make sure I've got the packages I want. Unlike the other machines, I'm not using cfexecd to run Cf3 continuously.

The assumption behind Cf3 and other configuration management tools -- at least in my mind -- is that if you're doing it once, you'll want to do it again. (Of course, there's also stuff like convergence, distributed management and resisting change, but leave that for now.) This has been a big help, because the changes I needed to apply to the Rocks FE were mostly duplicates of my usual setup.

If/when I change jobs/get hit by a bus, I've made it abundantly clear in my documentation that Cfengine is The Way I Do Things. For a variety of reasons, I think I'm fairly safe in the assumption that Cf3 will not be too hard for a successor to pick up. If someone wants to change it afterward, fine, but at least they know where to start.

OTOH, Rocks has the idea of a "Restore Roll" -- essentially a package you install on a new frontend (after the old one has burned down, say) to reinstall all the files you've customized. You can edit a particular file that creates this roll, and ask it to include more files. Edited /etc/bashrc? Add it to the list.

I think the assumption behind the Restore Roll is that, really, you set up a new FE once every N years -- that a working FE is the result of rare and precious work. The resulting configuration, like the hardware it rests on, is a unique gem. Replacing it is going to be a pain, no matter what you do. There aren't that many Rocks developers, and making it Really, Really Frickin' Nice is probably a waste of their time.

(I also think it fits in with the rest of Rocks, which seems like some really nice bits surrounded by furiously undocumented hacks and workarounds. But I'm probably just annoyed at YET ANOTHER UNDOCUMENTED SET OF HACKS AND WORKAROUNDS.)

And so you have both a number of places where you can list files to be restored, and an amusing uncertainty about whether the whole mechanism works:

I found that after a re-install of Rocks 5.0.3, not all the files I asked for were restored! I suspect it has to do with the order things get installed.

So now I'm torn.

Do I stick with Cf3? I haven't mentioned my unhappiness with its obtuseness and some poor choices in the language (nine positional arguments for a function? WTF?). I'm familiar with it because I've really dived into it and taken a course at LISA from Mark Burgess his own bad self, but it's taken a while to get here. But it is the way I do just about everything else.

Or do I use the Rocks Restore Roll mechanism? Considered on its own, it's the least surprising option for a successor or fill-in. I just wish I could be sure it would work, and I'm annoyed that I'd have to duplicate much of the effort I've put into Cf3.

Gah. What a mess.

January 28, 2011 Cfengine 3: copying config files for services

At $work I'm migrating slowly to Cfengine 3. One of the attractions is the ability to do what this page shows: loop over lists in a Cf-ish kind of way.

Here's the first bundle. (It's pretty much stolen from that page, but customized for my environment.) It tells you some basic details about the config file, the process name and the restart command for different daemons:

bundle common services {
  vars:
    redhat|centos::
      "cfg_file_prefix" string => "centos/5";

      "cfg_file[ssh]" string => "/etc/ssh/sshd_config";
      "daemon[ssh]"   string => "sshd";
      "start[ssh]"    string => "/sbin/service sshd restart";
      "enable[ssh]"   string => "/sbin/chkconfig sshd on";

      "cfg_file[iptables]" string => "/etc/sysconfig/iptables";
      "start[iptables]"    string => "/sbin/service iptables restart";
      "enable[iptables]"       string => "/sbin/chkconfig iptables on";
}

Here's the bundle that copies config files and restarts the daemon if necessary:

bundle agent fix_service(service) {
  files:
    "$(services.cfg_file[$(service)])"
      copy_from => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
      perms => mog("0600","root","root"),
      classes => if_repaired("$(service)_restart"),
      comment => "Copy a stock configuration file template from repository";

  processes:
    "$(services.daemon[$(service)])"
      comment => "Check that the server process is running, and start if necessary",
      restart_class => canonify("$(service)_restart");

  commands:
    "$(services.start[$(service)])"
      comment => "Method for starting this service",
      ifvarclass => canonify("$(service)_restart");

    "$(services.enable[$(service)])"
      comment => "Method for enabling this service",
      ifvarclass => canonify("$(service)_restart");
}

And here's the loop that puts it all together:

bundle agent redhat {
  vars:
    "service" slist => { "ssh", "iptables" };

methods:
  "any" usebundle => fix_service("$(service)"),
    comment => "Make sure the basic application services are running";

}

I ran into a problem with this, though: it would always, without fail, restart iptables even though no config file had been copied. The problem was with the process check: there's no process to check for with iptables. And from what I can tell, when the processes stanza was asked to check for a non-existent variable, it checked for the literal string $(services.daemon[$(service)]) -- that is, dollar-bracket-s-e-r-v-.... Since there was no such thing, it decided it needed restarting.

The way around this was to add this variable to the services bundle (the one that has all the info about the daemons):

"daemon[iptables]" string => "cf_null";

I also had to modify the processes stanza:

processes:
  $(services.daemon[$(service)])"
  comment => "Check that the server process is running, and start if necessary",
  restart_class => canonify("$(service)_restart"),
  ifvarclass => canonify("$(services.daemon[$(service)])");

That ifvarclass check on the last line says to run iff there is a value for daemon. cf_null is a NULL value special to cfengine. Since the check fails for iptables, the process check isn't run and we only restart if we copy over a new config file.

January 11, 2011 Xmas Maintenance 2010: Lessons learned
Xmas vacation is when I get to do big, disruptive maintenance with a fairly free hand. Here's some of what I did and what I learned this year.

Order of rebooting

I made the mistake of rebooting one machine first: the one that held the local CentOS mirror. I did this thinking that it would be a good guinea pig, but then other machines weren't able to fetch updates from it; I had to edit their repo files. Worse, there was no remote console on it, and no time (I thought) to take a look.
- Lesson: Don't do that.
Automating patching

Last year I tried getting machines to upgrade using Cfengine like so:
```
centos.some_group_of_servers.Hr14.Day29.December.Yr2009::
          "/usr/bin/yum -q -y clean all"
          "/usr/bin/yum -q -y upgrade"
          "/usr/bin/reboot"
```
This didn't work well: I hadn't pushed out the changes in advance, because I was paranoid that I'd miss something. When I did push it out, all the machines hit on the cfserver at the same time (more or less) and didn't get the updated files because the server was refusing connections. I ended up doing it by hand.

This year I pushed out the changes in advance, but it still didn't work because of the problems with the repo. I ran cssh, edited the repos file and updated by hand.

This worked okay, but I had to do the machines in separate batches -- some needed to have their firewall tweaked to let them reach a mirror in the first place, some I wanted to watch more carefully, and so on. That meant going through a list of machines, trying to figure out if I'd missed any, adding them by hand to cssh sessions, and so on.
- Lesson: I need a better way of doing this.
- Lesson: I need a way to check whether updates are needed.
I may need to give in and look at RHEL, or perhaps func or better Cfengine tweaking will do the job.

Staggering reboots

Quick and dirty way to make sure you don't overload your PDUs:
```
sleep $(expr $RANDOM / 200 ) && reboot
```
Remote consoles

Rebooting one server took a long time because the ILOM was not working well, and had to be rebooted itself.
- Lesson: I need to test the SP before doing big upgrades; the simplest way of doing this may just be rebooting them.
Upgrading the database servers w/the 3 TB arrays took a long time: stock MySQL packages conflicted with the official MySQL rpms, and fscking the arrays takes maybe an hour -- and there's no sign of life on the console while you're doing it. Problems with one machine's ILOM meant I couldn't even get a console for it.
- Lesson: Again, make sure the SP is okay before doing an upgrade.
- Lesson: Fscking a few TB will take an hour with ext3.
- Lesson: Start the console session on those machines before you reboot, so that you can at least see the progress of the boot messages up until the time it starts fscking.
- Lesson: Might be worth editing fstab so that they're not mounted at boot time; you can fsck them manually afterward. However, you'll need to remember to edit fstab again and reboot (just to make sure)...this may be more trouble than it's worth.
OpenSuSE

Holy mother of god, what an awful time this was. I spent eight hours on upgrades for just nine desktop machines. Sadly, most of it was my fault, or at least bad configuration:
- Two of the machines were running OpenSuSE 11.1; the rest were running 11.2. The latter lets you upgrade to the latest release from the command line using "zypper dist-upgrade"; the former does not, and you need to run over with a DVD to upgrade them.
- By default, zypper fetches packages one at a time, installs them, then fetches them again. I'm not certain, but I think that means there's a lot more TCP overhead and less chance to ratchet up the speed. Sure as hell seemed slow downloading 1.8GB x 9 machines this way.
- Graphics drivers: awful. Four different versions, and I'd used the local install scripts rather than creating an RPM and installing that. (Though to be fair, that would just rebuild the driver from scratch when it was installed, rather than do something sane like build a set of modules for a particular kernel.) And I didn't figure out where the uninstall script was 'til 7pm, meaning lots of fun trying to figure out why the hell one machine wouldn't start X.
- Lesson: This really needs to be automated.
- Lesson: The ATI uninstall script is at /usr/share/ati/fglrx-uninstall.sh. Use it.
- Lesson: Next time, uninstall the driver and build a goddamn RPM.
- Lesson: A better way of managing xorg.conf would be nice.
- Lesson: Look for prefetch options for zypper. And start a local mirror.
- Lesson: Pick a working version of the driver, and commit that fucker to Subversion.
Special machines

These machines run some scientific software: one master, three slaves. When the master starts up at boot time, it tries to SSH to the slaves to copy over the binary. There appears to be no, or poor, rate throttling; if the slaves are not available when the master comes up, you end up with the following symptoms:
- Lots of SSH/scp processes on the master
- Lots of SSH/scp processes on the slave (if it's up)
- If you try to run the slave binary on the slave, you get errors like "lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)" (from strace) or "ESPIPE text file busy" (from running it in the shell).
The problem is that umpty scp processes on the slave are holding open the binary, and the kernel gets confused trying to run it.
- Lesson: Bring up the slaves first, then bring up the master.
- Lesson: There are lots of interesting and obscure Unix errors.
I also ran into problems with a duff cable on the master; confusingly, both the kernel and the switch said it was still up. This took a while to track down.
- Lesson: Network cables are surprisingly fragile at the connection with the jack.
Virtual Machines

It turned out that a couple of my kvm-based VMs did not have jumbo frames turned on. I had to use virt-manager to shut down the machines, turn on virtio on the drivers, then reboot. However, kudzu on the VMs then saw these as new interfaces and did not configure them correctly. This caused problems because the machines were LDAP clients and hung when the network was unavailable.
- Lesson: To get around this, go into single-user mode and copy /etc/sysconfig/network-scripts/ifcfg-eth0.bak to ifcfg.eth0.
- Lesson: Be sure you're monitoring everything in Nagios; it's a sysadmin's regression test.
December 08, 2010 Presentation done
My presentation on Cfengine 3 went pretty well yesterday. There were about 20 people there...I had been hoping for more, but that's a pretty good turnout. I was a little nervous beforehand, but I think I did okay during the talk. (I recorded it -- partly to review afterward, partly 'cos my dad wanted to hear the talk. :-)

One thing that did trip me up a bit were questions from one person in the audience that went fairly deep into how to use Cfengine, what its requirements were and so on. Since this was meant to be an introduction and I only had an hour, I wasn't prepared for this. Also, the questions went on...and on...and I'm not good at taking charge of a conversation to prevent it being hijacked. The questions were good, and though he and I disagree on this subject I respect his views. It's just that it really threw off my timing, and would have been best left for after. Any tips?

At some point I'm going to put up more on Cf3 that I couldn't really get into in the talk -- how it compares to Cf2, some of the (IMHO) shortfalls, and so on.
December 31, 2009 Xmas maintenance
A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.

Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.

It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.

One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)

The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:
1. You have to rebuild the drivers after a kernel change.
2. This only showed up on two servers because the third server had not upgraded its kernel (or indeed, any of its packages). Why? cfservd had refused its connection because I had the MaxConnections parameter too low.
3. And of the two that did upgrade, the one machine we'd tested the Linux drivers on still had an old multipath.conf file in /etc, which even though the multipathd. service wasn't starting up was enough to get drivers loaded. This took a while to figure out because I'd completely forgotten how to tell which driver was in use.
I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)

Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.

(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)

Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.

And best of 2010 to all of you!
December 14, 2009 chkconfig woes
Irritating: chkconfig on RHEL/CentOS returns non-zero if a service isn't configured for a runlevel. IOW, you can do:
```
chkconfig --level 3 foo
```
and have 0 returned if it's on, 1 if it's not.

But not SuSE; nope, it just returns 0 whether or not it's enabled, or even if the service itself doesn't exist. Because, you know, grep doesn't get used enough.

I'm doing this because I'm trying to use cfengine 2 to manage services. This works well in CentOS, where you can add something like:
```
service_foo_on = (ReturnsZero("/sbin/chkconfig --level 3 foo"))
```
and it'll work. ("servicefooon" is a bit of a misnomer, because I'm checking runlevels, not whether it's actually running.)

Update: Nope, I'm wrong. chkconfig --check does exactly what I want. Many thanks to yaloki on #openSUSE-server for the help.
September 09, 2009 Bad Time Equals LDAP Failure
Just ran into an interesting problem: after replacing memory on a server, CentOS booting hung at "Starting system message bus..."

So what does dbus have to do with anything? This turned out to be an LDAP failure; dbus was trying to run as UID root, and since the LDAP server couldn't be contacted it hung. Why couldn't the LDAP server be contacted? The LDAP server logs only showed this:
```
[09/Sep/2009:12:04:32 -0700] conn=41492 op=-1 fd=112 closed - SSL
peer cannot verify your certificate.
```
The CA cert I use was in place, and another machine had just rebooted w/o problems (all this is taken care of with cfengine, so they were identical in this respect). I could connect to the LDAP server on the right port without any problems.

I finally figured out what was going on when I ran:
```
openssl s_client -connect ldap.example.com:636 -CApath /path/to/cacert_directory
```
and saw:
```
Verify return code: 9 (certificate is not yet valid)
```
date said it was December 31, 2001. What the what now? ntpdate to set things correctly, then I got:
```
Verify return code: 0 (ok)
```
I figure the CMOS clock (or whatever the kids are calling it these days) got reset when we had to remove the CPU daughtercard to get at the memory underneath.

And now you know...the rest of the story.
July 02, 2009 Bacula, gossip, advice
- Bacula config coming along; figured out today that /dev/nst0 corresponds to what mtx sees as Data Transfer Element 1 (as opposed to DTE 0), which explains why previous attempts to run label barcode just failed miserably. (Neat command that.) And I had thought that DTE meant the arm, but no: upon reflection, it's a subtle/obtuse (not the right word, but oh well) way of referring to the tape drive itself.
- Rather interesting comment, if you like that sort of thing, from Mark Burgess (originator of Cfengine on Puppet and Luke Kanies. I know, I should remain above, but it is weirdly fascinating.
- And to go out on a high note, some excellent advice from Tom Limoncelli on setting priorities as a sysadmin:
This sounds like when I was at my previous employer and they asked if I could develop a web-based system to take surveys. I nearly said, "yes" because, well, I know perl, I know CGI, and I could do it. However, I was smart enough to say "no, but surveymonkey.com will do it for cheap." Best of all it was self-service and the HR person was able to do it entirely without me. If I had said I could write such a program, it would have been days of back-and-forth changes which would have driven me crazy. Instead, she was happy to be empowered to do it herself. In fact, doing it herself without any help became a feather in her cap.

The lesson I learned is that "can I do it?" includes "do I want to do it?". If I can do something but don't want to, the answer is, "No, I don't know how" not "I know how but don't want to". The first makes you look like you know your limits. The latter sounds like you are just being difficult.
January 28, 2009 Long-term planning
Another thing I'm trying to do at my new job is make/take more time for long-term planning. I've been dinged by mgt. for this in the past, and while it's not easy to hear I think there has been some validity to this. (My inclination is to concentrate hard on fixing the problems I'm faced with; giving up on something broken, even when doing so would make so much more sense and would free up resources to look for a replacement, just rankles and feels like...well, giving up.) Since the department I'm in is so new, it's even more important to pay attention to this.

Part of the problem is just recognizing that I need to make time. An hour a week to be isolated, and to (say) figure out what I'm going to need to do for the next month, is a habit I'm very conciously trying to adopt.

But another problem is how to keep track of all this. What I've done so far:
- I'm a huge fan of Tom Limoncelli's Time Management for System Administrators, and his Cycle system has served me well. I've become a big fan of a paper organizer, so that's how I keep track of things. But it works est as a way of tracking day-to-day stuff; it's not so good at tracking a project that takes weeks, or months, or years.
- I've read GTD, and that seems like a good system — but it's very different from The Cycle. I don't want to give up the Cycle, I want to graft on to it. And I'm not sure how well I can do that w/GTD.
- I've tried org-mode in Emacs. I'm pretty happy with this, and in fact I switched to it for a while when I first started at this job back in July. It worked well for tracking day-to-day stuff, but I missed the flexibility and ubiquitousness of paper.
So where does that leave me? ATM, (paper planner Cycle) attempting some longer-term project tracking w/org-mode. I figure the TODO bits from org-mode will fit well with the planner, and the flexibility of Emacs and org-mode (different from paper...oh, how I wish I could grep paper) will work well for projects...the records for which should, ideally, be suitable for pasting into wiki-based documentation.

If anyone has any suggestions, please let me know. If I make it to LISA this year, I'll be looking for a BOF about this. (Or maybe I'll just tackle Tom Limoncelli to the ground and holler "I love you, man!" a la "Say Anything".)

Moving on:
- I really like TrueType font support in Emacs 23, and ttf-inconsolata in particular. Thanks to Emacs-fu for both suggestions.
- I and a co-worker picked up the servers that had, for the last two years or so, been racked at BC Women's hospital (of all places...my sons were both born there). We both had the same reaction when we saw them on a cart, ready to be loaded into our truck: "They're so small!" Seven little 1U servers plus one disk array...you start to think of them as larger-than-life when you're not looking at them all the time, and it's easy to forget just how small they are.
- Some interesting discussion on the Cfengine mailing list about how Cfengine should handle packages.
And now it is time for bed.
November 18, 2008 This is The Working Hour; we are paid by those who learn by our mistakes
I'm in the process of setting up a bunch of new servers for $job_2. All but one are CentOS 5.2, kickstart installed and managed with cfengine. This is the third time I've goen thorugh a cfengine setup, and it always feels like starting from scratch each time. It seems -- and I'm not at all sure this is fair or accurate -- that each time I set up one of these systems, there's a lot that I've lost from the last time and have to relearn. I'm fortunate this time that I can refer to $job_1's setup to see how I did things last time, but if I didn't have that I'd be significantly further behind than I am.

I'm not sure what the solution is. Part of me thinks I should just be more aggressive about taking notes, or committing stuff to a private repository, or writing it down here more; part of me thinks that this might be a clue that cfengine is too low-level for my head. It feels like when I was trying to learn C, and couldn't believe that I had to remember all this stuff just to print something, or read a file, or connect to another machine over the Internet. By contrast, Perl (or any other scripted language) was such a relief...just print, or open, or use the Net::Telnet module, or whatever. The details are there and they are important, sometimes very much so; that doesn't mean I want to learn more metallurgy every time I need a fork. (No, I don't think that metaphor's tortured; why do you ask?)

Another thing is that I'm trying to get multipath connections working for the first time. We've got two database servers, each of which is connected via dual SAS HBAs to outboard disk arrays. (I don't think anyone else calls them "outboard", but I like the sound of it. See this hard drive? It's outboard, baby!) The arrays are from Sun and come with drivers, but the documentation is confusing: it says it's available for RHEL 5 (aka CentOS 5), but the actual download says it's only for RHEL 4.

As a temporary respite, I'm trying to see if I can get these working using Linux's own multipath daemon, and it's also confusing. The documentation for it is tough to track down, and I just don't understand the different device names: am I meant to put /dev/dm-2 in fstab, or /dev/mpath/mpath2p1? If the latter, why does the name sometimes change to the WWUID (/dev/mpath/$(cat /dev/random)) when I restart multipathd? (use_friendly_names is uncommented in the config file.) If the whole point of multipath is failover, why does this sequence:
- touch /mnt/1
- remove first cable
- rm /mnt/1
- replace first cable
- touch /mnt/2
- remove second cable
- rm /mnt/2
- replace second cable
(where /mnt is where I've got this array mounted, obvs) sometimes work, and sometimes end with "I/O error" being logged, and the filesystem being read-only? Is this the sort of thing that the Sun driver will fix? I can't find anything about this.

And I mentioned electrical problems. When we got our servers installed, the Sun guys told us they'd tripped breakers on the PDU and/or breakers in the room's electrical cabinet. Since it had a sign on it saying "100A", I figured we might be running up against power limtis -- either in the room as a whole, if my figures were 'way out, or on individual PDUs. Turns out I was probably wrong: I missed the bit on the sign that said 3-phase, which means (deep breath) we probably have 3 x 100A power available (I think).

It's more complicated than that, because some of it is in 120V, some of it is in twist-lock 220V 30A circuits, and so on. But I should've checked before emailing the faculty member who, in a year or two, will be going into this room (we're there as guests of the department) and happens to sit on the facilities committee. He had asked how we were doing, so I sent him an email -- nice, polite, and including a bit about how grateful we were for the room and the help of the local sysadmins (all of which is true).

I was under the impression that he was asking for info now, so that he could bring it up for action in a few months when we were out. Instead, two hours later when I'm swearing at multipath, in come the facilities manager and one of the sysadmins I was dealing with, looking to find out just how much power we were using anyhow. I apologized profusely, and they were very cool about it. But when the committee guy asks questions, people jump. I had not anticipated this. Welcome to University Politics 101. I emailed again and explained my mistake.

There are lots of remedial courses I could take. However, today I would most like to take "Electricity and wiring for sysadmins".

And on another note: Ack! My laptop's home partition is 93% full! How the hell did that happen?

And again: How did I not know about apt-file? This is perfect!

(Touch o' the hat to Tears For Fears and Steve Kemp; I'm moving closer every day to switching to Chronicle.)
July 30, 2008 cfengine: Received signal 2 (SIGKILL) while doing pre-lock-state
Ran into a problem today when adding this stanza to cfengine on a Debian Etch machine:
```
editfiles:
```
    { /etc/aliases
            AppendIfNoSuchLine "root: sysadmin@pims.math.ca"
            DefineClasses "rebuild_aliases:restart_postfix"
    }
```
```
The cfengine reference file I've got, which sez it's for version 2.2.1, says you can define multiple classes in DefineClasses (or DefineInGroup), as long as they're separated by commas, spaces or dots. (The version in Etch is 2.2.20.)

However, when I ran cfagent, it just hung immediately after performing the edit, and gave this error when I ctrl-c'd it:
```
cfengine: Received signal 2 (SIGKILL) while doing [pre-lock-state]
```
Running cfengine with -d2 showed endless repetitions of AddClassToHeap() at this point, so either there's something wrong with my syntax or there's a bug in cfengine. (I'm guessing the former.) Searching for pre-lock-state and cfengine only turned up cases where the clients were syncing with the master; thus this note.

The fix was to just make it one class:
```
                DefineClasses "rebuild_aliases"
```
Asking to restart Postfix was probably a bit of overkill anyhow...
January 23, 2008 Project U-13, 0.0.3
Version 0.0.3 of Project U-13, a distro for sysadmins, has been released!

The main change is the addition of RackMonkey, which its website describes as "a web-based tool for managing racks of equipment such as web servers, video encoders, routers and storage devices", at the suggestion of Andy Seely. Also, Lynx has been installed, and there's also the skeletal beginnings of a Cfengine config file.

The ISO has been signed with my public key. Share and enjoy, and comments on a postcard, please.
January 18, 2008 Coming up
My laptop hard drive started giving scary errors a couple days ago on the way to work (I've got a 90-minute commute by public transit [uck] so I fill the time by reading, listening to podcasts, or working on Project U-13). Fortunately, working at a university means that there are two computer stores on campus. I ran out at lunch, picked up a 100GB drive, and had things back to normal by the next morning.

Well, normal modulo one false start with Debian; I decided to try encrypted filesystems just for fun. But then I suspended, came back with a newere kernel, and it could not read the encrypted LVM group anymore. Whoops.

Still lots of free space on this thing, and I'm thinking of installing Ubuntu, FreeBSD and maybe NetBSD just for fun. Of course, I've got to do it all via PXE since this thing doesn't have any CDROM drive, but that just adds to the geek points.

Project U-13 is coming up on 0.0.3, btw; Andy suggested adding Rackmonkey, which looks quite cool. There's no package for it, so I'm having to do some rather ugly scripted installation…but I can stand it for now. And I've got the barest skeleton of a cfengine file in there too. Watch the skies!
December 21, 2007 Stay on target...
Holy crap, it's been a while since I last wrote here. Mainly that's because I've been working on web stuff at work and have felt very little like a sysadmin of late. Thankfully we've got a webmaster hired, and to some extent the work'll be shifted to him in the new year. Of course, that still leaves the redesign of the website and its back end…that's not done 'til it's done.

This week, though, has been slow, and I've been catching up a little on sysadmin work. Part of it was setting up a devel server for the webmaster, and detailing what I was doing in Cfengine as I went along. It was gratifying to get LDAP working (I haven't done that on a Linux machine before; shame on me), and irritating when I realized that I couldn't mount the home directories from the server because I hadn't restarted nscd on the server.

The last two days were spent trying to get encrypted Bacula working between here and $other_university. This was an enormous pain in the ass for two reasons:
1. The Right Way (tm) of doing it is by using TLS, which is what the kids are calling SSL these days, and I have never fully grokked SSL, or the openssl command. I know that there's encryption going on; I know that there are certificates signed by CAs; I know that there's a lot of negotiating of different options. But start throwing in x509 versus PEM, Diffie-Helman parameters and the single most cryptic set of error messages I've ever come across, and I just feel thick. I was reduced to looking at tcpdump output of the negotiation to figure out what was going on, and I couldn't; the Bacula FD client complained that the Bacula Director wasn't producing a certificate, and that was all I knew. The otherwise incredibly excellent docs from Bacula were a trifle thin on all of this, and I couldn't find out much about my situation (going the self-CA route).
2. So okay, fuckit, right? That's why God invented OpenSSH. So whee, start tunnelling port 9102 over SSH so the Director can contact the FD at $other_university, and 9103 back so the FD can contact the Storage Daemon. Only it turns out (my bad for not knowing this before) that not only does the client want to contact the SD, so does the director. Thus, my plan to tunnel to the firewall at the other end and tell the client that it could find the Storage Daemon there didn't work, 'cos the director wanted to contact it there too. (I did briefly try allowing the director to contact the tunnel at the other end: so even though the Storage was working on the same machine as the director, for that one job the Director's connection to it was going to the remote end and getting tunnelled back over SSH. But:
  1. that's horrible, and
  2. I was afraid that when it came time to restore, the Director would figure that it had to contact the Storage Daemon remotely again, complicating an already complicated setup.)
And why was I trying to connect to the remote firewall via SSH, rather than the client I'm trying to back up itself? Because that client is a Solaris machine authenticating against LDAP, and that turns out to bork key-based logins over SSH. What a crock.

Oh well. I did add three other machines here to Bacula this week, so that's good.

Project U-13 is coming along. I'm pretty close to a 0.0.2 release (woot), which should have the following working:
- Cacti
- Nagios
- Request Tracker 3.6
- Cfengine
And by "working" I mean "installed". But I've got a decent setup on my laptop for building and testing it, which means I get up to a couple hours a day to work on it (New Westminster -> UBC == long). Thanks to Andy, he of the amazing speaking skills, for kicking my ass into action.

I'm learning a bit more about Mercurial in the process. After coming from CVS and Subversion, it seems really weird to me that the usual way of branching is "Go ahead, clone another repo! We're Mercurial! We don't care! Repos for everyone!" But if you figure on distributed development — something Linux-y than a controlled work environment — then it makes sense. Not that I think I'll have lots of people working on this thing, but it makes sense that if someone were to take this for their own ends, they wouldn't want to bother copying all the branches…just the one(s) they're interested in.

Last word to my son:

Q: What does a Camel say, Arlo? A: Purhl!
September 29, 2007 Presentation(s), conference, nagios exchange, Project U-13, Project U-14
I've had a bunch of ideas lately. I'm inflicting them on you.

The presentation went well...I didn't get too nervous, or run too long, or start screaming at people (damn Induced Tourette's Syndrome) or anything. There were maybe 30 or so people there, and a bunch of them had questions at the end too. Nice! I was embiggened enough by the whole experience that, when the local LUG announced that they were having a newbie's night and asked for presenters to explain stuff, I volunteered. It's coming up in a few weeks; we'll see what happens.

And then I thought some more. A few days before I'd been listening to the almost-latest episode of LugRadio (nice new design!), where they were talking about GUADEC and PyCon UK. PyCon was especially interesting to hear about; the organizers had thought "Wouldn't it be cool to have a Python conference here in the UK?", so they made one.

So I thought, "It's a shame I'm not going to be able to go to LISA this year. Why don't we have our own conference here in Vancouver?" The more I thought about it, the better the idea seemed. We could have it at UBC in the summer, where I'm pretty sure there are cheap venues to be had. Start out modest — say, a day long the first time around. We could have, say, a training track and a papers track. I'm going to talk about this to some folks and see what they think.

Memo to myself: still on my list of stuff to do is to join pool.ntp.org. Do it, monkey boy!

Another idea I had: a while back I exchanged secondary DNS service, c/o ns2exchange.com. It's working pretty well so far, but I'm not monitoring it so it's hard for me to be sure that I can get rid of the other DNS servers I've got. (Everydns.net is fine, but they don't do TXT or IPv6 records.) I'm in the process of setting up Nagios to watch my own server, but of course that doesn't tell me what things look like from the outside.

So it hit me: what about Nagios exchange? I'll watch your services if you watch mine. You wouldn't want your business depending on me, of course, but this'd be fine for the slightly anal sysadmin looking to monitor his home machines. :-) The comment link's at the end of the article; let me know if you're interested, or if you think it's a good/bad/weird idea.

The presentation also made me think about how this job has been, in many ways, a lot like the last job: implementing a lot of Things That Really Should Be Done (I hate to say "Best Practices) in a small shop. Time is tight and there's a lot to do, so I've been slowly making my way through the list:
- Improving backups (Bacula, Amanda)
- Automated install (FAI, Jumpstart)
- Monitoring services (Nagios)
- Monitoring performance (MRTG, Cacti)
- Ticket system (RT)
- Automating management (Cfengine)
Some of these things have been held up by my trying to remember what I did the last time. And then there's just getting up to speed on bootstrapping a Cfengine installation (say).

So what if all these things were available in one easy package? Not an appliance, since we're sysadmins — but integrated nicely into one machine, easily broken up if needed, and ready to go? Furthermore, what if that tool was a Linux distro, with all its attendant tools and security? What if that tool was easily regenerated, and itself served as a nicely annotated set of files to get the newbie up and running?

Between FAI (because if it's not Debian, you're working too hard) and cfengine, it should be easy to make a machine look like this. Have it work on a live ISO, with installation afterward with saved customizations from when you were playing around with it.

Have it be a godsend for the newbie, a timesaver for the experienced, and a lifeline for those struggling in rapidly expanding shops. Make this the distro I'd want to take to the next job like this.

I'm tentatively calling this Project U-13. We'll see how it goes.

Oh, and over here we've got Project U-14. So, you know, I've got lots of spare time.
December 07, 2006 Electric Version
```
Sound of tires, sound of God...

"Electric Version", The New Pornographers.
```
Thursday morning came far too early. My roommate offered some of his 800mg Ibuprofins, and I accepted. First thing I attended was the presentation "Drowning in the Data Tsunami" by Lee Damon and Evan Marcus. It was interesting, but seemed to be mostly about US data regulations (HIPPA/SOX et al.) and wasn't really relevant to me. I had been expecting more of an outline of, say, how in God's name we're going to preserve information for, say, a hundred years (heroic efforts of the Internet Archive notwithstanding). There was mention of an interesting approach to simply not accumulating cruft as you upgrade storage (because it's easier than sorting through to see what can be discarded; "Why bother weeding out 200MB when the new disk is 800GB?"): a paper by Radia Perlman (sp?) (she of OSPF fame) that proposes an encrypted data storage system (called The Ephemerizer) combined with key escrow that, to expire data, simply deletes the key when the time is up. Still, I moved on before too long.

...Which was good, because I sat in on Alva Couch's presentation on his and Mark Burgess' paper, "Modelling Next-Generation Configuration Management Tools". Some very, very confusing stuff about aspects, promises and closures -- confusing because the bastard didn't preface his talk with "This is what Hugh from Vancouver will need to know to understand this." (May be in the published paper; will check later.) Here's what I could gather:
- System administration could be described as the Pinky and the Brain problem: "What are we going to do tonight, Pinky?" "Same thing we do every night, Brain: try to take over the world!"
- IOW, the problem is too big -- and in the meantime you have all these competing theories (aspects from Luke/Puppet (I think), promise theory from Burgess (which I had heard about) and closures from the bcfg2 people) that need to be integrated, but currently aren't.
- Many tools model/modify configuration, not behaviour -- and implicit in there is the (unproven?) assumption that correct behaviour emerges from correct configuration as if by magic. There is no understanding in cfengine of outside forces.
- A promise, in sysadmin terms, is promise to do something. For example, an NFS server promises to make certain files available over the network. A client mounting a disk from the server promises to access some of those files.
- Closure is the whole of the problem: in the case of the NFS server, it's DNS plus routing plus mountd running plus nfsd running plus proper ACLs (which I only found out at this conference that nearly everyone pronounces "ackles" rather than "ay see ells").
- His model: closures encompass promises encompass aspects. By dividing up the problem this way, you no longer have to take over the whole world.
- His model accounts for site policy by designating it a soft aspect.
I will do the right thing and read his paper, and I may update this later; these are just my notes and impressions, and aren't gospel. Couch is an incredibly enthusiastic speaker, and even though I didn't understand a lot of it I ended up excited anyway. :-) He gave another talk later in the week that Ricky went to, about how system administration will have to become more automatic; as a result, we'd all better learn how to think high-level and to be better communicators, because more and more of our stuff will be management -- and not just in the sense of managing computers. I'm going to seek out more of his stuff and see if it'll fit in my head.

After the break was a talk on "QA and the System Administrator", presented by a Google sysadmin. I went because it was Google, and frankly it wasn't that interesting. One thing that did jump out at me was when he described a Windows tool called Eggplant, a QA/validation tool. It has OCR built-in to recognize a menu, no matter where it is on the screen. This astounded me; when you start needing OCR to script things, that's broken. I don't doubt that it's a good tool, and I can think of lots of ways that would come in handy. But come on. I mean, a system that requires that is just so ugly.

I went out to lunch with Jay, a sysadmin from a shop that's just got permission from the boss to BSD a unit-testing program they've come up with for OpenBSD firewalls: it uses QEMU instances to fully test a firewall with production IP addresses, making sure that you're blocking and allowing everything you want. It sounds incredibly cool, and he's promised to send me a copy when he gets back. I can't wait to have a look at it.

After that was the meet-the-author session. I got to thank Tom Limoncelli for "Time Management for System Administrators", and got an autograph sticker from him and Strata Rose Chalup, his co-author for Ed 2. Sadly, I didn't get a chance to thank Tobias Oetiker (who I nearly ran into at lunch the day before).

Next up was the talk from Tom Limoncelli and Adam Moskovitz (Adam's looking for a job! Somebody hire him!) about how to get your paper accepted at LISA. Probably basic stuff if you've written a paper before, but I haven't so it was good to know. Thing like how to write a good abstract, what kind of paper is good for LISA, and how you shouldn't say things like "...and if our paper is accepted, we'll start work right away on the solution." Jay asked whether a paper on the pf testing tool would be good, and they both nodded enthusiastically.

Must Google:
- When talking about papers that go over the same subject, a paper from a previous LISA was mentioned that surveyed 8 years of papers on data storage and found identifiable cycles from "Oh no, we've got more data than disks!" to "Oh no, we've got more data than tape!" (This made me feel better about skipping out on the 9am talk.)
- Apparently, Sun reimplemented cat(1) and improved performance 10x.
Quotes from the talk:
- Tom: "You're not supposed to publish your paper on your website until it's published at LISA. And if you're cool, you'll do that with a cron job."
- From an audience member at another conference presentation: "At any point, did you step back and look at your work? And if so, were you sufficiently disgusted?"
- Tom again, on how audience criticism is a good thing: "Every theory paper needs someone to go up to the mic and say, 'Okay, Buck Rogers, but I live in reality.'"
At this point I started getting fairly depressed. Part of it was just being tired, but I kept thinking that not only could I not think of something to write a paper about, I could not think of how I'd get to find something to write about. I wandered over to the next talk feeling rather sad and lost.

The next talk was from Andy Seely on being a sysadmin in US Armed Forces Command and Control. Jessica was there, and we chatted a bit about how this talk conflicted with Tom Limoncelli's Time Management Guru session, and maybe ducking over to see that. Then Andy came over and asked Jessica to snap some picture, so she ended up staying. I was prepared to give it five minutes before deciding whether or not to leave.

Well, brother, let me tell you: Andy Seely is one of the best goddamned speakers on the planet. He was funny, engaging, and I could no more leave the room than I could get my jaw to undrop. Not only that, his talk was fascinating, and not just because he's a sysadmin for the US Armed Forces while simultaneously having a ponytail, earrings and tattoos. You can read the article in ;login: (FIXME: Add link) that it was based on, but he expanded on it considerably. Let me see what I can recall:
- One slide, a computer display of a map of the Middle East with lots of dots: "This is a map of people dying." This is what a screw-up or a service outage means in his job: people across the planet die.
- "There are databases where you can search on anyone in Afghanistan named Mohammed. It's an entertaining database optimization problem, let me tell you."
- On deadlines: "The more you work with government, the more you find dates...well, they're filled with humour."
- "We've got headquarters with systems everywhere -- no surprise, where haven't we invaded yet?" (laughter) I yell out "Canada!" "We're thinking about it. But we're looking for some place that'll fight back." (more laughter) "I'm sorry, that came out wrong. But it was funny." This made it to IRC, which prompted Ricky and others to ditch what they were doing and come over to this talk. (I met Andy later on and he apologized profusely, saying that he meant Canada was an ally, so why would the US invade them in the first place? We had a duel, my shot grazed his shoulder, and Canada's honour was regained.)
- Having to support an app where there's strong debate over whether it's written in C, Ada or Java, or whether it uses UDP or TCP.
- Being told that an app that keeps failing is single-threaded, so throwing more CPUs at it won't do anything; it's RAM that it needs. Later investigation confirms that, in fact, it's multi-threaded and needs more CPUs, not RAM...which the vendor eventually confirms.
- He can't install a compiler, or a debugger, or anything that doesn't come with a default install of Solaris 8, or 7, or 2.x. That would be a huge security offence.
- A Sun E4000 mainboard blows up in the Middle East. Getting one through regular channels would take too long, so where do you go? That's right: Ebay. He's a contractor, so he has no budget...but he does have a government credit card with a $2500 limit. So he calls up the guy selling it and cuts a deal to buy the thing for $2500 (shipping was billed separately). Put it on a C130, and off she goes.
- Not being allowed to write a program...but he is allowed to string shell commands together...and sometimes those commands get written down in a file for reference purposes. If he's lucky, Perl's on the machine as well.
Longer story: Because of the nature of his work, he's got boxes that he has to keep working when he knows next to nothing about what they're meant to do. Case in point: a new Sun box arrives ("and it's literally painted black!"), but the person responsible for it wants to send it back because it doesn't work -- which means that when they click the icon to start the app it's meant to run, it doesn't launch and there's no visible sign that it's running. There's no documentation. And yet he's obligated to support this application. What do you do?

Even tracking down the path to the program launched by the icon is a challenge, but he does, tracks down the nested shell scripts and finally finds the jar that is the app ("Aha! It is Java!"). He finds log files which are verbose but useless. He contacts the company that wrote it, and is told he needs a support contract...which the government, when putting together the contract for the thing, did not think to include. So he calls back an hour later, talks to the help desk and tells them he's lost the number -- "Can you help a brother out?" They do, but they're stumped as well, and say they've never seen anything like this.

Time to pull out truss, which produces a huge amount of output. Somewhere in the middle of all that he notices a failing hard read of a file in /bin: it was trying to read 6 bytes and failing. Turns out the damned thing was trying to keep state in /bin, and failing because the file was zero bytes long. He removed the file, and suddenly the app works.

Andy also talked about trying to get a multiple GB dump file from Florida to Qatar. Physical transport was not an option, because arranging it would take too long. So he tries FTPing the file -- which works until he goes home for the day, at which point the network connection goes down and he loses a day. So he writes a Perl script that divides the file into 300MB chunks, then sends those one at a time. It works!

At this point, someone yells out "What about split?" Andy says, "What?" He hadn't known about it. There was a lot of good-natured laughter. He asked, "Is there an unsplit?" "Cat!" came the response from all over the room. He smacked his forehead and laughed. "This is why I come to LISA," he said. "At my job, I've been there 10 years. People come to me 'cos I'm the smart one. Here, I'm the dumb one. I love that."

There are two things I would like to say at this point.

First off, Andy is at least the tenth coolest person on the entire Eastern seaboard. No, he didn't know about cat -- but not only did he reimplement it in Perl rather than give up, he didn't even flinch when being told about it in the middle of giving a talk at LISA. I would probably have self-combusted from embarassment ("foomp!"), and I would have felt awful. Andy's attitude? "I learned something." That's incredibly strong. (Although he told a story later about being in the elevator with some Google people. They recognized him and said, "Hey, it's the 'man cat' guy!")

Second, when he said, "Here, I'm the dumb one. I love that" I sat up straight and thought, "Holy shit, he's right." Here I am at LISA for the first time ever. I've met people who can help me, and people I can help. I've made a crapload of new friends and have learned more in one week than I would've thought possible. And I'm worried 'cos it might be a few years before I can think about presenting a paper? That's messed up. I tend to set unreasonably high goals for myself and then get depressed when I can't reach them. Andy's statement made me feel a whole lot better.

During Q & A I asked what he did for peer support, since his ability to (say) post to a mailing list asking for help must be pretty restricted. He said that he's started a wiki for internal use and it's getting used...but both the culture and the job function mean that it's slow going. He's also started a conference for fellow sysadmins: 100 or so this year, and he's hoping for more next year.

In conclusion: if you ever get the chance to go see him, do so. And then buy him a beer.
December 04, 2006 Choose It
```
Two sips from the cup of human kindness, and I'm shit-faced

Just laid to waste

If there's a choice between chance and flight,
Choose it tonight.

"Choose It", The New Pornographers
```
Just got back from a whirlwind walk from the Lincoln Memorial to the Washington Monument to the White House. Beautiful, all of it...though a) the White House is small and b) there was something being filmed/videotaped in the courtyard, which made me think of Vancouver.

Training again. AFrisch was good, convering Cfengine quite well; would've liked to see more info about expect. (Apparently there are Perl/Python bindings...I had no idea.) Afternoon course was "Interviewing For System Administrators" by Adam Moskowitz and that was great -- lots of things I didn't know, lots of tips on doing it better next time.

Saw Tom Limoncelli in the hall during a break. Managed to restrain myself. I have the reputation for quiet restraint of a nation to uphold.

Very tired now. Time to go get beer.
June 01, 2006 Little Green Bag
Some days are fun days. I got this error on a Debian workstation when starting X:
```
Xlib: Connection to ":0.0" refused by server Xblib: Protocol not
supported by server.  Xrdb: Can't open display ':0'
```
Turns out that an .xsession file, with one commented-out line, caused that. Remove the line (so now it's empty) and everything works.

Next we got the same user, who's had his home directory moved around on the machine. Machines mounting his home dir via amd (FreeBSD, Debian) work fine, but the SuSE machines running autofs fail miserably with "permission denied" and the ever-popular:
```
$ cd
-bash: cd: /home/foo: Unknown error 521
```
Which, if you look up /usr/include/linux/errno.h -- which, you know, is the logical thing to do -- you see this:
```
/* Defined for the NFSv3 protocol */
#define EBADHANDLE      521     /* Illegal NFS file handle */
```
Another weird thing with AutoFS: I was running cfengine on a machine, and it hung when querying which RPMs were installed. strace on the rpm command shows its trying to lock a file and failing; looking at /proc/number/fd shows that, yep, it's trying and failing to lock /var/lib/rpm/Packages, the Berkeley DB file that knows all and sees all. So lsof to see who's holding it open, and that hangs; strace shows it's hanging trying to access the home directory of a user whose machine is down right now for reinstall. Try to unmount that directory and it fails. So I bring up the machine with the user's home directory, which allows me to unmount his home directory on the SuSE machine, which allows cfengine to run rpm, which succeeds in locking the Berkely DB file. Strange; possibly similar to this problem.

On top of everything else, someone asked me if I could be a "network prime". I think they mean "person we can talk to with authority to make network changes", or possibly "network contact". Not entirely sure.

But on the other hand: figured out how to run wpkg, package manager for Windows of the elder gods, as a service using Cygwin's cygrunsrv. The instructions are on the wiki for your viewing enjoyment.
March 13, 2006 We're An American Band
More fallout today from Saturday's power outage: two workstations that failed to boot up (BIOS checksum error for one of 'em, which is a new one for me), some NIS-related services that didn't get started properly (not sure what's going on there), and so on. Plus the return of the where-are-those-seven-machines? that didn't get done on Friday because of all of this.

But I did learn some stuff about Cfengine. For example, if you have something like:
```
my_url = ( http://www.example.com/foo/bar )
```
then you'd better precede it with:
```
split = ( "+" )
```
or some other character that isn't used. The colon is treated as a list separator by default, which means that later on, when you try and do something like:
```
shell::
    linux.need_some_file:
        "/bin/wget $(my_url)/baz"
```
what it'll actually do is this:
```
/bin/wget http/baz
/bin/wget //www.example.com/foo/bar/baz
```
'cos it's iterating over the two lists, see?

And SuSE's dhcp client, by default (I think), will change /etc/yp.conf without telling you, and then on exit put back the old version (saved conveniently at /etc/yp.conf.sv. It took me a long time to figure out that this was happening, and it pissed me off mightily. /etc/resolv.conf is filled with comments when the dhcp client modifies it -- hell, they even throw in the PID. So why not do that with yp.conf? At least you can turn it off by changing DHCLIENT_MODIFY_NIS_CONF in /etc/sysconfig/networking/dhcp.

October 06, 2005 cfengine classes and shellcommands

cfengine is great, it really is. But there are some things that tripped me up. Often you want to set up a daemon to run The Right Way, which involves changing its config file. After that, of course, you want to restart it. What to do? The naive way (ie, the first way I tried) of doing things is:

control::
        sequence ( editfiles shellcommands )

editfiles::
        debian:
                { /etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                Define restart_foo
                        EndGroup
                }

        freebsd:

                { /usr/local/etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                Define restart_foo
                        EndGroup
                }

shellcommands::
        debian.restart_foo:
                "/etc/init.d/foo restart"

        freebsd.restart_foo:
                "/usr/local/etc/rc.d/foo restart"

However, the correct way of doing this is:

control::
        sequence = ( editfiles shellcommands )
        AddInstallable = ( restart_foo )

editfiles::
        debian:
                { /etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                DefineInGroup "restart_foo"
                        EndGroup
                }

        freebsd:
                { /usr/local/etc/foo.conf
                        BeginGroupIfNoLineMatching "bar"
                                AddLine "bar"
                                DefineInGroup "restart_foo"
                        EndGroup
                }

shellcommands::
        debian.restart_foo:
                "/etc/init.d/foo restart"

        freebsd.restart_foo:
                "/usr/local/etc/rc.d/foo restart"

Without both the enumeration of all your made-up classes in AddInstallable and the enclosing of that class in quotes, cfengine will fail to do what you want -- and will do so quietly and with no clue about why. God, that took me a long time to find.

August 26, 2005 1 != 2
I love cfengine. If you haven't checked it out yet, do so. You can do really neat stuff like this:
```
editfiles::
        { /etc/Xprint/C/print/attributes/document
                BeginGroupIfNoLineMatching "^\*default-printer-resolution: 300"
                        CommentLinesMatching "^\*default-printer-resolution: 600"
                        LocateLineMatching "^# \*default-printer-resolution: 600"
                        InsertLine "*default-printer-resolution: 300"
                        DefineInGroup restart_xprint
                EndGroup
        }

shell::
        debian.restart_xprint::
                "/etc/init.d/xprint restart"
```
(Which, by the way, totally fixes the problem of Debian printing 'way huge stuff. Bug number 262958. You should totally look it up.) Look at that. It's lovely. It's obvious what it's looking for, what it'll do if it can't find it, and what'll happen after that. And it does it automagically. At night. From cron. The way God intended all system administration to be done. However -- and I cannot emphasize how important it is to keep this in mind -- it is absolutely NFG reading the documentation for an hour trying to figure out why the DefineInGroup statement just does not work if:
1. you're reading the docs for cfengine v2, and
2. you're working with cfengine v1.
It's my own fault for printing out v2 docs and not thinking much about it. However, in my own defense it would be nice if cfengine would complain about something it appears not to recognize. Not even with -d2 (which produces output along the lines of CheckingDateForSolarEclipseToday [no]) did it whisper a word about this.

Carousel is a LIE!

Posts tagged “cfengine”

Order of rebooting

Automating patching

Staggering reboots

Remote consoles

OpenSuSE

Special machines

Virtual Machines