editfiles: ``` { /etc/aliases AppendIfNoSuchLine "root: sysadmin@pims.math.ca" DefineClasses "rebuild_aliases:restart_postfix" }
Today at $WORK I upgraded Cfengine on a server to 3.5.3. After that, I suddenly started seeing a lot of errors like this:
2014-06-05T13:52:47-0700 error: NetCopy to destination 'cfengine.example.com:/opt/sources/foo.tar.bz2.cfnew' security - failed attempt to exploit a race? (Not copied). (open: Permission denied)
2014-06-05T13:52:47-0700 error: /test/methods/'Copy /opt/'copy_opt/files/'/opt/: Was not able to copy '/var/cfengine/files/ALL/opt/sources/foo.tar.bz2' to '/opt/sources/foo.tar.bz2'
Running in verbose mode gave a bit more info, but nothing helpful:
2014-06-05T13:44:12-0700 verbose: Destination file '/opt/sources/foo.tar.bz2' already exists
2014-06-05T13:44:12-0700 info: Cannot open file for hashing '/opt/sources/foo.tar.bz2'. (fopen: Permission denied)
2014-06-05T13:44:12-0700 verbose: Image file '/opt/sources/foo.tar.bz2' has a wrong digest/checksum, should be copy of '/var/cfengine/files/ALL/opt/sources/foo.tar.bz2'
2014-06-05T13:44:12-0700 error: NetCopy to destination 'cfengine.example.com:/opt/sources/foo.tar.bz2.cfnew' security - failed attempt to exploit a race? (Not copied). (open: Permission denied)
2014-06-05T13:44:12-0700 error: /test/methods/'Copy /opt'/copy_opt_files/'/opt': Was not able to copy '/var/cfengine/files/ALL/o\ pt/sources/foo.tar.bz2' to '/opt/sources/foo.tar.bz2'
Wasn't SELinux, wasn't secret attributes...turned out that the new(er) version of Cf3 didn't like the fact that /opt was a symlink to /usr/opt. I'd set that up long ago and it was no longer needed, so I was free to just recreate it:
rm -rf /usr/opt
mkdir /opt
cf-agent -KI # Which populates it as needed.
I've just upgraded to the latest version of Vagrant, which includes a plugin that lets you use Cfengine as a provisioner. It doesn't seem to be documented right now, so here's my first stab at laying out the options. Apologies for the rough notes.
ampolicyhub: From the source: "Policy hubs need to do additional things before they're ready to accept agents. Force that run now..." Runs "cf-agent -KI -f /var/cfengine/masterfiles/failsafe.cf [classes]", then "cf-agent -KI [classes] [extraagentargs]
extraagentargs: Just what it says.
classes: Define extra classes; appends "-D [class]" args to cf-agent. Multiple classes must be separated by spaces. (or is this a ruby array?)
debrepofile, deprepoline: Specify a deb repo line, to be placed in debrepofile, before running "apt-get install [packagename]". debrepo_file will be clobbered.
filespath: Copy localpath to /var/cfengine using install_files method defined in cfengine/provisioner.rb. Example: you do a git checkout of your repo and want it copied to the machine.
forcebootstrap: Not sure; checked by cfengine/cap/linuxcfengineneeds_bootstrap.rb, but does not appear to do anything. FIXME: See where this module is called from.
install: "force" option seems to be the only poss. value, but not clear what it does. Doesn't seem to be mentioned anywhere else but in provisioner.rb.
mode: Poss values are:
policyserveraddress: Just what it says.
repogpgkey_url: Just what it says.
runfile: Single run if set? Uploads to VM and runs "cf-agent -KI -f [file] [classes] [extraagent_args]".
uploadpath: Where to copy runfile. Default is /tmp/vagrant-cfengine-file.
yumrepofile: Default is /etc/yum.repos.d/cfengine-community.repo. Probably clobbered.
yumrepourl: Default is http://cfengine.com/pub/yum/.
package_name: For use by yum or apt. Default is cfengine-community.
I wanted to test a new version of Wordpress on $WORK's website, and ran into an interesting set of problems. I figured it would be worth setting them down here.
First: I set up a Vagrant box, got it to forward port 8000 to port 80 on the VM, told Cfengine to set it up as a webserver, then copied over the files and database. Turned out I'd forgotten a few things, like installation of modreverseproxy. I really need to make that into an RPM, especially since it should be pretty trivial, but for now I settled for documenting and scripting my instructions. There were a few other things like that; it's always a good exercise to do this and see what you've left out. Eventually I got it down to a Makefile that I could run on the box itself:
OLD=wordpress-3.4.2
NEW=wordpress-3.5.2
CF=/var/cfengine/bin/cf-agent -f /vagrant/cfengine/masterfiles/promises.cf -KI
go: /var/cfengine/bin /etc/firstrun /www/www.example.com-wordpress /usr/bin/mysql /var/lib/mysql/chibiwp /etc/httpd/modules/mod_proxy_html.so
sudo $(CF)
/var/cfengine/bin:
sudo rpm -ivh cfengine-community-3.3.0-1.x86_64.rpm
/etc/firstrun:
sudo $(CF) -Dinstall_now_please
sudo touch /etc/firstrun
/www/www.example.com-wordpress:
sudo tar -C /www -xvzf /vagrant/wordpress.tgz
/var/lib/mysql/example-wordpress:
mysql -u root -e"create database example-wordpress; grant all on example-wordpress.* to 'wordpress'@'localhost' identified by 's33kr1t'; flush privileges; use example-wordpress; source /vagrant/example-wordpresswp.sql;"
/usr/bin/mysql:
sudo /var/cfengine/bin/cf-agent -f /vagrant/cfengine/masterfiles/promises.cf -KI -Dinstall_now_please
/etc/httpd/modules/mod_proxy_html.so:
tar -C /tmp -xvjf /vagrant/mod_proxy_html.tar.bz2
sudo bash -c 'cd /tmp/mod_proxy_html ; /usr/sbin/apxs -I /usr/include/libxml2 -I . -c -i mod_proxy_html.c'
disable_plugins:
mysql -B -u root chibiwp -e "select option_value from wp_options where option_name='active_plugins';" | sed -e's/^/update wp_options set option_value=QQQ/;s/$/QQQ where option_name="active_plugins";/;' | tail -1 | sed -e"s/QQQ/'/g" > /tmp/restore
mysql -u root chibiwp -e'update wp_options set option_value="a:0:{}" where option_name="active_plugins";'
enable_plugins:
mysql -u root chibiwp < /tmp/restore
unpack_wp:
sudo tar -C /www/www.example.com-wordpress -xvzf /vagrant/wordpress-3.5.2.tar.gz
sudo mv /www/www.example.com-wordpress/wordpress $(NEW)
-sudo rm -r $(OLD)/wp-includes
-sudo rm -r $(OLD)/wp-admin
-sudo mv $(NEW)/wp-includes $(OLD)
-sudo mv $(NEW)/wp-admin $(OLD)
sudo find $(NEW) -type f -maxdepth 1 -exec cp -v {} $(OLD) \;
force_upgrade:
wget "http://localhost/wp-admin/upgrade.php?step=1&backto=%2Fwp-admin%2F"
upgrade: disable_plugins unpack_wp force_upgrade enable_plugins
However, when I browsed to localhost:8000 it tried to redirect me to the actual URL for the work website (http://work.example.com), rather than simply showing me the page and serving it all locally. Turns out this is a known problem, and the solution is to use one of Wordpress' many ways to set the site URL. The original poster used the RELOCATE method, but I had better luck setting the URL manually:
define('WP_HOME','http://localhost:8000');
define('WP_SITEURL','http://localhost:8000');
I can do this manually, but it's better to get Cfengine to do this. First, we have an agent bundle to edit the file:
bundle agent configure_wp_for_vagrant_testing {
files:
vagrantup_com::
"/var/www/wordpress/wp-config.php"
edit_line => vagrant_testing_wpconfig;
}
We specify the lines to add. Rather than install the lines in two passes, which is non-convergent, we add just one line that happens to have an embedded newline:
bundle edit_line vagrant_testing_wpconfig {
insert_lines:
"define('WP_HOME','http://localhost:8000');
define('WP_SITEURL','http://localhost:8000');" location => wp_config_thatsallfolks;
}
(I found that on the Cfengine mailing list, but I've lost the link.) And finally, we specify the location. This depends on having the default comment in wp-config that indicates the end of user-settable vars, but it seems a safe bet:
body location wp_config_thatsallfolks {
select_line_matching => "^/\* That's all, stop editing. Happy blogging. \*\/.*$";
before_after => "before";
}
Second, the production webserver actually hosts a bunch of different sites, and we have separate config files for each of them. Since I was getting Cf3 to configure the VM just as if it was production, the VM got all these config files too. Turned out that browsing to http://localhost:8000 gave me what Apache thought was the default site -- which is the VirtualHost config listed first, which in our case was not our main site. I got around that by renaming our main site's config file to 000-www.example.com.conf (a trick I stole from Debian). Now I could see our main website at http://localhost:8000.
Third, testing: normally I rely on Nagios to do this sort of thing, but it's kind of hard to point it at a VM that might be only around for a few minutes. I could add tests to Cfengine, and that's probably a good idea; however, right now I wanted to try out serverspec, a Ruby-based test suite that lets you verify server attributes.
The serverspec docs say they can run tests on a Vagrant machine, and that all you have to do is tell it so when running "serverspec init." However, I had problems with this; it asked me for a VM name, and I didn't have one...there was only one machine set up, and it didn't seem to like "default". I didn't spend a lot of time on this, but instead went to running the serverspec tests on the Vagrant box itself. That brought its own problems, sinc installing gems in CentOS 5 via the default Ruby (1.8.5) causes buffer overflows. A better person would build a newer RPM, rather than complain about non-schedule repos. However, this Gist does the trick rather nicely (though I also removed the stock Ruby and didn't bother installing Chef).
Okay, so: running "serverspec init" on the Vagrant box created a nice set of default tests for a website. I modified the test for the website config file to look for the right config file and server name:
describe file('/etc/httpd/conf.d/000-www.example.com.conf') do
it { should be_file }
it { should contain "ServerName www.example.com }
end
First, a short function to attach a file when editing a Markdown page in ikiwiki:
(defun x-hugh-wiki-attach-file-to-wiki-page (filename)
"This is my way of doing things."
(interactive "fAttach file: ")
;; doubled slash, but this makes it clear
(let* ((page-name (file-name-nondirectory (file-name-sans-extension (buffer-file-name))))
(local-attachments-dir (format "%s/attachments/%s" (file-name-directory (buffer-file-name)) page-name))
(attachment-file (file-name-nondirectory filename))
(attachment-url (format "https://wiki.example.org/wiki/attachments/%s/%s" page-name attachment-file)))
(make-directory local-attachments-dir 1)
(copy-file filename local-attachments-dir 1)
(insert-string (format "[[%s|%s]]" attachment-file attachment-url))))
Note the way I'm organizing things: there's a directory in the wiki/tree called "attachments"; a subdirectory is created for each page; and the file is dumped there.
Second, a stupid copy-file-template function for Cfengine:
(defun x-hugh-cf3-insert-file-template (file)
"Insert a copy-file template."
(interactive "sFile to copy: ")
(newline-and-indent)
(insert-string (format "\"%s\"" file))
(newline-and-indent)
(insert-string (format " comment => \"Copy %s into place.\"," file))
(newline-and-indent)
(insert-string (format " perms => mog(\"0755\", \"root\", \"wheel\"),"))
(newline-and-indent)
(insert-string (format " copy_from => secure_cp(\"$(g.masterfiles)/centos/5%s\", \"$(g.masterserver)\";" file)))
Both are mostly learning exercises and excuses to post.
First day back at $WORK after the winter break yesterday, and some...interesting...things. Like finding out about the service that didn't come back after a power outage three weeks ago. Fuck. Add the check to Nagios, bring it up; when the light turns green, the trap is clean.
Or when I got a page about a service that I recognized as having, somehow, to do with a webapp we monitor, but no real recollection of what it does or why it's important. Go talk to my boss, find out he's restarted it and it'll be up in a minute, get the 25-word version of what it does, add him to the contact list for that service and add the info to documentation.
I start to think about how to include a link to documentation in Nagios alerts, and a quick search turns up "Default monitoring alerts are awful" , a blog post by Jeff Goldschrafe about just this. His approach looks damned cool, and I'm hoping he'll share how he does this. Inna meantime, there's the Nagios config options "notes", "notesurl" and "actionurl", which I didn't know about. I'll start adding stuff to the Nagios config. (Which really makes me wish I had a way of generating Nagios config...sigh. Maybe NConf?)
But also on Jeff's blog I found a post about Kaboli, which lets you interact with Nagios/Icinga through email. That's cool. Repo here.
Planning. I want to do something better with planning. I've got RT to catch problems as they emerge, and track them to completion. Combined with orgmode, it's pretty good at giving me a handy reference for what I'm working on (RT #666) and having the whole history available. What it's not good at is big-picture planning...everything is just a big list of stuff to do, not sorted by priority or labelled by project, and it's a big intimidating mess. I heard about Kanban when I was at LISA this year, and I want to give it a try...not suure if it's exactly right, but it seems close.
And then I came across Behaviour-driven infrastructure through Cucumber, a blog post from Lindsay Holmwood. Which is damn cool, and about which I'll write more another time. Which led to the Github repo for a cucumber/nagios plugin, and reading more about Cucumber, and behaviour-driven development versus test-driven development (hint: they're almost exactly the same thing).
My god, it's full of stars.
I always seem to forget how to do this, but it's actually pretty simple. Assume you want to test a new bundle called "test", and it's in a file called "test.cf". First, make sure your file has a control stanza like this:
body common control {
inputs => { "/var/cfengine/inputs/cfengine_stdlib.cf" } ;
bundlesequence => { "test" } ;
}
Note:
inputs
must not include the file "test.cf" itself -- otherwise,
you'll get the error "Redefinition of body "control" for "common"
is a broken promise, near token '{'. Think of "inputs" as really
being named "additional inputs".
I'm including the cfengine_stdlib.cf file. You should too.
bundlesequence
is set to your bundle (which I'm leaving out of
this entry for simplicity).
Second, invoke it like so:
sudo /var/cfeing/bin/cf-agent -KI -f /path/to/test.cf
Note:
-K
means "run no matter how soon after the last time it was run."-I
shows a list of promises repaired.-f
gives the path to the file you're testing.When sub was released by 37signals, I liked it a lot. Over the last couple of months I've been putting together a sub for Cfengine. Now it's up on Github, and of course my own repo. It's not pretty, but there are some pretty handy things in there. Enjoy!
Back in January, yo, I wrote about trying to figure out how to use Cfengine3 to do SELinux tasks; one of those was pushing out SELinux modules. These are encapsulated bits of policy, usually generated by piping SELinux logs to the audit2allow command. audit2allow usually makes two files: a source file that's human-readable, and a sorta-compiled version that's actually loaded by semodule.
So how do you deploy this sort of thing on multiple machines? One option would be to copy around the compiled module...but while that's technically possible, the SELinux developers don't guarantee it'll work (link lost, sorry). The better way is to copy around the source file, compile it, and then load it.
SANSNOC used this approach in puppet. I contacted them to ask if it was okay for me to copy their approach/translate their code to Cf3, and they said go for it. Here's my implementation:
bundle agent add_selinux_module(module) {
# This whole approach copied/ported from the SANS Institute's puppet modules:
# https://github.com/sansnoc/puppet
files:
centos::
"/etc/selinux/local/."
comment => "Create local SELinux directory for modules, etc.",
create => "true",
perms => mog("700", "root", "root");
"/etc/selinux/local/$(module).te"
comment => "Copy over module source.",
copy_from => secure_cp("$(g.masterfiles)/centos/5/etc/selinux/local/$(module).te", "$(g.masterserver)"),
perms => mog("440", "root", "root"),
classes => if_repaired("rebuild_$(module)");
"/etc/selinux/local/setup.cf3_template"
comment => "Copy over module source.",
copy_from => secure_cp("$(g.masterfiles)/centos/5/etc/selinux/local/setup.cf3_template", "$(g.masterserver)"),
perms => mog("750", "root", "root"),
classes => if_repaired("rebuild_$(module)");
"/etc/selinux/local/$(module)-setup.sh"
comment => "Create setup script. FIXME: This was easily done in one step in Puppet, and may be stupid for Cf3.",
create => "true",
edit_line => expand_template("/etc/selinux/local/setup.cf3_template"),
perms => mog("750", "root", "root"),
edit_defaults => empty,
classes => if_repaired("rebuild_$(module)");
commands:
centos::
"/etc/selinux/local/$(module)-setup.sh"
comment => "Actually rebuild module.",
ifvarclass => canonify("rebuild_$(module)");
}
Here's how I invoke it as part of setting up a mail server:
bundle agent mail_server {
vars:
centos::
"selinux_mailserver_modules" slist => { "postfixpipe",
"dovecotdeliver" };
methods:
centos.selinux_on::
"Add mail server SELinux modules" usebundle => add_selinux_module("$(selinux_mailserver_modules)");
}
(Yes, that really is all I do as part of setting up a mail server. Why do you ask? :-) )
So in the add_selinux_module
bundle, a directory is created for
local modules. The module source code, named after the module itself,
is copied over, and a setup script created from a Cf3 template. The
setup template looks like this:
#!/bin/sh
# This file is configured by cfengine. Any local changes will be overwritten!
#
# Note that with template files, the variable needs to be referenced
# like so:
#
# $(bundle_name.variable_name)
# Where to store selinux related files
SOURCE=/etc/selinux/local
BUILD=/etc/selinux/local
/usr/bin/checkmodule -M -m -o ${BUILD}/$(add_selinux_module.module).mod ${SOURCE}/$(add_selinux_module.module).te
/usr/bin/semodule_package -o ${BUILD}/$(add_selinux_module.module).pp -m ${BUILD}/$(add_selinux_module.module).mod
/usr/sbin/semodule -i ${BUILD}/$(add_selinux_module.module).pp
/bin/rm ${BUILD}/$(add_selinux_module.module).mod ${BUILD}/$(add_selinux_module.module).pp
Note the two kinds of disambiguating brackets here: {curly} to indicate shell variables, and (round) to indicate Cf3 variables.
As noted in the bundle comment, the template might be overkill; I think it would be easy enough to have the rebuild script just take the name of the module as an argument. But it was a good excuse to get familiar with Cf3 templates.
I've been using this bundle a lot in the last few days as I prep a new mail server, which will be running under SELinux, and it works well. Actually creating the module source file is something I'll put in another post. Also, at some point I should probably put this up on Github FWIW. (SANS had their stuff in the public domain, so I'll probably do BSD or some such... in the meantime,please use this if it's helpful to you.)
UPDATE: It's available on Github and my own server; released under the MIT license. Share and enjoy!
Nagios and Cf3 each have their strengths:
Nagios plugins, frankly, are hard to duplicate in Cfengine. Check out this Cf3 implementation of a web server check:
bundle agent check_tcp_response {
vars:
"read_web_srv_response" string => readtcp("php.net", "80", "GET /manual/en/index.php HTTP/1.1$(const.r)$(const.n)Host: php.net$(const.r)$(const.n)$(const.r)$(const.n)", 60);
classes:
"expectedResponse" expression => regcmp(".*200 OK.*\n.*", "$(read_web_srv_response)");
reports:
!expectedResponse::
"Something is wrong with php.net - see for yourself: $(read_web_srv_response)";
}
That simply does not compare with this Nagios stanza:
define service{
use local-service ; Name of service template to use
hostgroup_name http-servers
service_description HTTP
check_command check_http
}
define command{
command_name check_http
command_line $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}
My idea, which I totally stole from this article, was to invoke Cfengine from Nagios when necessary, and let Cf3 restart the service. Example: I've got this one service that monitors a disk array for faults. It's flaky, and needs to be restarted when it stops responding. I've already got a check for the service in Nagios, so I added an event handler:
define service{
use local-service ; Name of service template to use
host_name diskarray-mon
service_description diskarray-mon website
check_command check_http!-H diskmon.example.com -S -u /login.html
event_handler invoke_cfrunagent
}
define command{
command_name invoke_cfrunagent
command_line $USER2/invoke_cfrunagent.sh -n "$SERVICEDESC" -s $SERVICESTATE$ -t $SERVICESTATETYPE$ -a $HOSTADDRESS$
}
Leaving out some getopt()
stuff, invoke_cfrunagent.sh looks like this:
# Convert "diskarray-mon website to disarray-mon_website":
SVC=${SVC/ /_}
STATE="nagios_$STATE"
TYPE="nagios_$TYPE"
# Debugging
echo "About to run sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE" | /usr/bin/logger
# We allow this in sudoers:
sudo /var/cfengine/bin/cf-runagent -D $SVC -D $STATE -D $TYPE
cf-runagent is a request, not an order, to the running cf-server process to fulfill already-configured processes; it's like saying "If you don't mind, could you please run now?"
Finally, this was to be detected in Cf3 like so:
methods:
diskarray-mon_website.nagios_CRITICAL.nagios_HARD::
"Restart the diskarray monitoring service" usebundle => restart_diskarray_monitor();
(This stanza is in a bundle that I know is called on the disk array monitor.)
Here's what works:
What doesn't work:
running cf-runagent, either as root or as nagios. It seems to stop after analyzing classes and such, and not actually do anything. I'm probably misunderstanding how cf-runagent is meant to work.
Nagios will only run an event handler when things change -- not all the time until things get better. That means that if the first attempt by Cf3 to restart doesn't work, for whatever reason, it won't get run again.
What might work better is using this Cf3 wrapper for Nagios plugins (which I think is the same approach, or possibly code, discussed in this mailing list post).
Anyhow...This is a sort of half-assed attempt in a morning to get something working. Not there yet.
Just had a dream where I'd been called into Sun, just before Oracle's takeover, to figure out why they were spending so much money on eyeglasses for employees. "We think it's part of their benefits, but our accounting department doesn't have a separate line item for it," someone explained. My eyebrows lifted in disbelief. "Well, then, it's damned lucky for you I've got Cfengine."
Over the last two days, in a frenzy of activity, I got some awesome done at work: using git and Vagrant, I finally got Cfengine to install packages in Ubuntu without asking me any goram questions. There were two bits involved:
Telling Apt to use old config files. This prevents it from asking for confirmation when it comes across your already-installed-with-Cfengine config file. Cfengine doesn't do things in a particular order, and in any case I do package installation once an hour -- so I might well have an NTP config file show up before the NTP package itself.
Preseeding, which I've known about for a while but have not had a chance to get right. My summer student came up with a script to do this, and I hope to be able to release it.
Now: Fully automated package installation FTMFW.
And did you know that Emacs can check your laptop battery status? I didn't.
More conversations with Mark Burgess via Twitter (a continuation from here. I should note that this was all a week or so ago now; I've been meaning to put this up here.
markburgess_osl: @saintaardvark Doc is "what" code is "how". I believe the lasting intention comes before a specific implementation. #devops #sysadmin
saintaardvark: .@markburgess_osl Hm. So let's see if I've got this right: the programmer in me notices lots of overlap in my Cf3 config...
saintaardvark: .@markburgess_osl ...and wants to consolidate. Cf3 syntax makes this a hairy proposition at best. But this is not really a problem...
saintaardvark: .@markburgess_osl ...because I should be thinking about this as documentation (which can be long) of the desired system state...
saintaardvark: .@markburgess_osl ...rather than code (where the drive is for efficiency and lack of duplication). Have I got that right? #sysadmin
markburgess_osl: @saintaardvark Documentation => focus on end state (like GPS), Code => focus on start state + directions. The journey is irrelevant.
markburgess_osl @saintaardvark Docs also improved by seeing themes and patterns. That is still WHAT not HOW. So no contradiction.
So putting this in practical (can't resist the temptation to say "less Yoda-like") terms: what I think he's saying is, don't worry about code duplication or getting clever; you're documenting desired system state, and it's okay to be verbose.
Using the example I started with, it's okay to have NTP settings in multiple places (because SuSE needs two files, Solaris 1, etc). The coder in me wants to clean those up because it's all NTP, but the documentationist ("writers", I think they're called) relaxes and says "Can't have too much documentation." Which is fair.
But then I worry about having Multiple Sources of Truth(tm). The advantage of the first setup is that when I change the NTP server, it's ALL in one place; in the second setup, I have to remember: did I change it for SuSE? Solaris? CentOS? I've learned the hard way to be wary of such setups. I nearly always miss something; that's why I'm aggressive about consolidating.
I'm still mulling all this over.
Mark Burgess was kind enough to respond to my earlier post about Cfengine syntax:
markburgess_osl: @saintaardvark (soothing) Syntax is definitely an acquired taste (re perl ;)). The list-ref prob can go away soon. Think doc not code 4 cf3
And then, via tweetsification, we were all like:
saintaardvark: .@markburgess_osl Heh, thanks for the reply -- I was going to ask you about this. Fair pt re: syntax being an acquired taste...[1/2]
saintaardvark: .@markburgess_osl ...but any chance the mess of brackets will be reduced? [2/2]
markburgess_osl: @saintaardvark trade one set of () for -> Don't see much point in that. $() has long precedence in sh / make etc. It delimits clearly in txt
saintaardvark: .@markburgessosl Fair enough, but I'm also thinking of eg "$(services.cfgfile[$(service)])": dollar bracket scopedot square dollar bracket
markburgess_osl: @saintaardvark I agree it's clumsy, but it's also an edge case. You rarely write this if you make good use of patterns. Perl also ugly here.
But this layout came from their own dang documentation! I feel like I'm stuck here:
[old entry recovered from backup!]
That last point: what I mean is that the whole appeal of that
layout (pattern/whatever) was that you could just say
fix_service('foo')
, and The Right Thing(tm) would happen. Now I
have to rethink this; it seems to mean either having lots of bundles
like "fixntp", "fixautofs", etc -- with lots of sections like:
vars:
SuSE::
"files" slist => {"this", "that"};
Centos::
"files" string => "just_this";
...or else having separate "fix_service" bundles for each class. (Forgive me, I'm thinking about all this w/o having a Cf3 instance to play with in front of me.)
I'm trying not to sound whiny here; I'm grateful for Cf3, for the documentation (which is pretty extensive), and that Mark took the time to respond. But this is frustrating.
Cfengine 3 has a lot of things going for it. But its syntax is not one of them.
Consider this situation: you have CentOS machines, SuSE machines and Solaris machines. All of them should run, say, SSH, NTP and Apache why not? The files are slightly different between them, and so is the method of starting/stopping/enabling services, but mostly we're doing the same thing.
I've got a bundle in Cfengine that looks like this:
bundle common services {
vars:
redhat|centos::
"cfg_file_prefix" string => "centos/5";
"cfg_file[httpd]" string => "/etc/httpd/conf/httpd.conf";
"daemon[httpd]" string => "httpd";
"start[httpd]" string => "/sbin/service httpd start";
"enable[httpd]" string => "/sbin/chkconfig httpd on";
"cfg_file[ssh]" string => "/etc/ssh/sshd_config";
"daemon[ssh]" string => "sshd";
"start[ssh]" string => "/sbin/service sshd restart";
"enable[ssh]" string => "/sbin/chkconfig sshd on";
...and so on. We're basically setting up four hashes -- daemon,
start, enable and cfg -- and populating them with the appropriate
entries for Red Hat/Centos ssh and Apache configs; you can imagine
slightly different entries for Solaris and SuSE. The
cfg_file_prefix
allows me to put CentOS' config files in a separate
directory from other OS.
Then there's this bundle:
bundle agent fix_service(service) {
files:
"$(services.cfg_file[$(service)])"
copy_from => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
classes => if_repaired("$(service)_restart"),
comment => "Copy a stock configuration file template from repository";
processes:
"$(services.daemon[$(service)])"
comment => "Check that the server process is running, and start if necessary",
restart_class => canonify("$(service)_restart"),
ifvarclass => canonify("$(services.daemon[$(service)])");
commands:
"$(services.start[$(service)])"
comment => "Method for starting this service",
ifvarclass => canonify("$(service)_restart");
"$(services.enable[$(service)])"
comment => "Method for enabling this service",
ifvarclass => canonify("$(service)_restart");
}
This bundle takes a service name as an argument, and assigns it to the local variable "service". It copies the OS-and-service-appropriate config file into place if it needs to, and enables/starts the service if it needs to. How does it know if it needs to? By setting the class "$(service)_restart" if the service isn't running, or if the config file had to be copied.
So far, so good. Well, except for the mess of brackets. All those hashes are in the services bundle, so you need to be explicit about the scope. (There are provisions for global variables, but I've kept my use of 'em to a minimum.) And so what in Perl would be, say:
$services->start{$service}
becomes
"$(services.start[$(service)])"
Square brackets for the hash, round brackets for the string (and to indicate that you're using a variable -- IOW, it's "$(variable)", not "$variable" like you're used to), and dots to indicate scope ("services.start" == the start variable in the services bundle).
It's...well, it's an ugly mess o' brackets. But I can deal with that. And this arrangement/pattern, which came from the Cfengine documentation itself, has been pretty helpful to me for dealing with single config file services.
But what about the case where a service has more than one config file? Like autofs: you gotta copy around a map file but in SuSE you also need /etc/sysconfig/autofs to set the LDAP variables.
Again, in Perl this would be an anonymous array on top of a hash -- something like:
$services->cfg_file{"autofs"}[0] = "/etc/auto.master
$services->cfg_file{"autofs"}[1] = "/etc/sysconfig/aufofs"
and you'd walk it like so:
foreach my $i in ($services->cfg_file{"autofs"}) { # something with $i }
or even:
while ($services->cfg_file{"autofs"}) { # something with $_ }
(I think...I'm embarrassed sometimes at how rusty my Perl is.)
In Cfengine, you pile an anonymous array on top of a has like so:
"cfg_file[autofs]" slist => { "/etc/auto.master", "/etc/sysconfig/autofs" };
An slist is a list of strings. All right, fine; different layout, same idea, stick it in the services bundle and away we go. But: remote scalars can be referenced; remote lists cannot without gymnastics. From the docs:
During list expansion, only local lists can be expanded, thus global list references have to be mapped into a local context if you want to use them for iteration. Instead of doing this in some arbitrary way, with possibility of name collisions, cfengine asks you to make this explicit. There are two possible approaches.
The first of those two approaches is, I think, passing the list as a parameter, whereupon it just works? maybe? (It's a not-so-minor nitpick that there are lots of examples in the Cf3 handbook that are not explained and don't make much sense. They apparently work, but how is not at all clear, or discernible.) I think it's meant to be like Perl's let's-flatten-everything-into-a-list approach to passing variables.
The second is to just go ahead and redeclare the remote slist (array) as a local one that's set to the remote value. Again, from the docs:
bundle common va {
vars:
"tmpdirs" slist => { "/tmp", "/var/tmp", "/usr/tmp" };
}
bundle agent hardening {
classes:
"ok" expression => "any";
vars:
"other" slist => { "/tmp", "/var/tmp" };
"x" slist => { @(va.tmpdirs) };
reports:
ok::
"Do $(x)";
"Other: $(other)";
}
which makes this prelude to all of that handwaving even more irritating:
Instead of doing this in some arbitrary way, with possibility of name collisions...
...
...I mean...
...I mean, what is the point of requiring explicit paths to variables in other scopes if you're just going to insert random speedbumps to assauge needless worries about name collisions? What the hell is with this let's-redeclare-it-AGAIN approach?
The rage, it fills me.
In Cfengine3, I had been setting up printers for people using lpadmin commands. Among other things, it used a particular PPD file for the local HP printer. It turns out that in Oneiric, those files are no longer present, or even available; judging by what I found on my laptop, the PPD file is (I think) generated automagically by /usr/share/cups/ppd-updaters/hplip-cups.
It's possible that I could figure this out for my new workstation. But right now, I don't think I can be bothered. I'm going to just set this up by hand, and hope that either I'll get a print server or I'll figure it out.
No native support in Cf3 for SELinux.
I've added a bundle that enables/disables booleans and have used it on one machine; this is pretty trivial.
File contexts and restorecon appear to be mainly controlled by plain old files in /etc/selinux/targeted/contexts/files, but there are stern warnings about letting libselinux manage them. However, this thread on the SELinux mailing list seems to say it's okay to copy them around.
Puppet appears to be further ahead in this. This guy compiles policy files locally using Puppet; this other dude has a couple of posts on this. There are yet other other folks using Puppet to do this, and it would be worth checking them out as a source of ideas.
I need to improve my collection of collective pronouns.
I tripped across this error today with Cfengine 3:
cf3:./inputs/promises.cf:1,22: Redefinition of body "control" for "common" is a broken promise, near token '{'
The weird thing was this was a stripped down promises.cf, and I could not figure out why it was complaining about redefinitions. I finally found the error:
body common control {
bundlesequence => { "test" };
inputs => { "promises.cf", "cfengine_stdlib.cf" };
}
Yep, including the promises.cf file itself in the inputs section borked everything; removing it fixed things right away.
I've got a new workstation at $WORK. (Well, where else would it be?) It's pretty sweet: i7 quad-core processor, clock speed > 3GHz (honestly, I barely keep track anymore), and 8GB of RAM. 8GB! Insane.
When I arrived in 2008, I used a -- not cast-off, but unused P4 with 4 GB of RAM. I didn't want to make a big fuss about it; I saved the fuss, instead, for a nice business laptop from Dell that worked well with Linux. Since 90% of my work is Firefox + Emacs + XTerms, and my WM of choice at the moment is Awesome, speed was not a problem and the memory was fine.
Lately, though, I've discovered Vagrant. It looks pretty sweet, but my current machine is sloooow when I try to run a couple of VMs. (So's my laptop, despite a better processor; I suspect the 5400RPM drive.) I'm hoping that the new machine will make a big difference.
Just gotta install Ubuntu and move stuff over. Fortunately I've been pretty good about keeping my machine config in Cfengine, so that'll help. And then build some VMs. I'm always surprised at people who feel comfortable downloading random VM images from the Internet. Yeah, it's probably okay...but how do you know?
One thing that Vagrant is missing is integration with Cfengine. Fortunately, the documentation for extending it seems pretty good (plus, I can always kick things off with a shell script). This might be an excuse to learn Ruby.
At work, I'm about to open up the Rocks cluster to production, or at least beta. I'm finally setting up the attached disk array, along with home directories and quotas, and I've just bumped into an unsettled question:
How the hell do I manage this machine?
On our other servers, I use Cfengine. It's a mix of version 2 and 3, but I'm migrating to 3. I've used Cf3 on the front end of the cluster semi-regularly, and by hand, to set things like LDAP membership, automount, and so on -- basically, to install or modify files and make sure I've got the packages I want. Unlike the other machines, I'm not using cfexecd to run Cf3 continuously.
The assumption behind Cf3 and other configuration management tools -- at least in my mind -- is that if you're doing it once, you'll want to do it again. (Of course, there's also stuff like convergence, distributed management and resisting change, but leave that for now.) This has been a big help, because the changes I needed to apply to the Rocks FE were mostly duplicates of my usual setup.
If/when I change jobs/get hit by a bus, I've made it abundantly clear in my documentation that Cfengine is The Way I Do Things. For a variety of reasons, I think I'm fairly safe in the assumption that Cf3 will not be too hard for a successor to pick up. If someone wants to change it afterward, fine, but at least they know where to start.
OTOH, Rocks has the idea of a "Restore Roll" -- essentially a package you install on a new frontend (after the old one has burned down, say) to reinstall all the files you've customized. You can edit a particular file that creates this roll, and ask it to include more files. Edited /etc/bashrc? Add it to the list.
I think the assumption behind the Restore Roll is that, really, you set up a new FE once every N years -- that a working FE is the result of rare and precious work. The resulting configuration, like the hardware it rests on, is a unique gem. Replacing it is going to be a pain, no matter what you do. There aren't that many Rocks developers, and making it Really, Really Frickin' Nice is probably a waste of their time.
(I also think it fits in with the rest of Rocks, which seems like some really nice bits surrounded by furiously undocumented hacks and workarounds. But I'm probably just annoyed at YET ANOTHER UNDOCUMENTED SET OF HACKS AND WORKAROUNDS.)
And so you have both a number of places where you can list files to be restored, and an amusing uncertainty about whether the whole mechanism works:
I found that after a re-install of Rocks 5.0.3, not all the files I asked for were restored! I suspect it has to do with the order things get installed.
So now I'm torn.
Do I stick with Cf3? I haven't mentioned my unhappiness with its obtuseness and some poor choices in the language (nine positional arguments for a function? WTF?). I'm familiar with it because I've really dived into it and taken a course at LISA from Mark Burgess his own bad self, but it's taken a while to get here. But it is the way I do just about everything else.
Or do I use the Rocks Restore Roll mechanism? Considered on its own, it's the least surprising option for a successor or fill-in. I just wish I could be sure it would work, and I'm annoyed that I'd have to duplicate much of the effort I've put into Cf3.
Gah. What a mess.
At $work I'm migrating slowly to Cfengine 3. One of the attractions is the ability to do what this page shows: loop over lists in a Cf-ish kind of way.
Here's the first bundle. (It's pretty much stolen from that page, but customized for my environment.) It tells you some basic details about the config file, the process name and the restart command for different daemons:
bundle common services {
vars:
redhat|centos::
"cfg_file_prefix" string => "centos/5";
"cfg_file[ssh]" string => "/etc/ssh/sshd_config";
"daemon[ssh]" string => "sshd";
"start[ssh]" string => "/sbin/service sshd restart";
"enable[ssh]" string => "/sbin/chkconfig sshd on";
"cfg_file[iptables]" string => "/etc/sysconfig/iptables";
"start[iptables]" string => "/sbin/service iptables restart";
"enable[iptables]" string => "/sbin/chkconfig iptables on";
}
Here's the bundle that copies config files and restarts the daemon if necessary:
bundle agent fix_service(service) {
files:
"$(services.cfg_file[$(service)])"
copy_from => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
perms => mog("0600","root","root"),
classes => if_repaired("$(service)_restart"),
comment => "Copy a stock configuration file template from repository";
processes:
"$(services.daemon[$(service)])"
comment => "Check that the server process is running, and start if necessary",
restart_class => canonify("$(service)_restart");
commands:
"$(services.start[$(service)])"
comment => "Method for starting this service",
ifvarclass => canonify("$(service)_restart");
"$(services.enable[$(service)])"
comment => "Method for enabling this service",
ifvarclass => canonify("$(service)_restart");
}
And here's the loop that puts it all together:
bundle agent redhat {
vars:
"service" slist => { "ssh", "iptables" };
methods:
"any" usebundle => fix_service("$(service)"),
comment => "Make sure the basic application services are running";
}
I ran into a problem with this, though: it would always, without
fail, restart iptables even though no config file had been copied.
The problem was with the process check: there's no process to check
for with iptables. And from what I can tell, when the processes
stanza was asked to check for a non-existent variable, it checked for
the literal string $(services.daemon[$(service)])
-- that is,
dollar-bracket-s-e-r-v-.... Since there was no such thing, it decided
it needed restarting.
The way around this was to add this variable to the services bundle (the one that has all the info about the daemons):
"daemon[iptables]" string => "cf_null";
I also had to modify the processes stanza:
processes:
$(services.daemon[$(service)])"
comment => "Check that the server process is running, and start if necessary",
restart_class => canonify("$(service)_restart"),
ifvarclass => canonify("$(services.daemon[$(service)])");
That ifvarclass
check on the last line says to run iff there is a
value for daemon. cf_null
is a NULL value special to cfengine.
Since the check fails for iptables, the process check isn't run and
we only restart if we copy over a new config file.
Xmas vacation is when I get to do big, disruptive maintenance with a fairly free hand. Here's some of what I did and what I learned this year.
I made the mistake of rebooting one machine first: the one that held the local CentOS mirror. I did this thinking that it would be a good guinea pig, but then other machines weren't able to fetch updates from it; I had to edit their repo files. Worse, there was no remote console on it, and no time (I thought) to take a look.
Last year I tried getting machines to upgrade using Cfengine like so:
centos.some_group_of_servers.Hr14.Day29.December.Yr2009::
"/usr/bin/yum -q -y clean all"
"/usr/bin/yum -q -y upgrade"
"/usr/bin/reboot"
This didn't work well: I hadn't pushed out the changes in advance, because I was paranoid that I'd miss something. When I did push it out, all the machines hit on the cfserver at the same time (more or less) and didn't get the updated files because the server was refusing connections. I ended up doing it by hand.
This year I pushed out the changes in advance, but it still didn't work because of the problems with the repo. I ran cssh, edited the repos file and updated by hand.
This worked okay, but I had to do the machines in separate batches -- some needed to have their firewall tweaked to let them reach a mirror in the first place, some I wanted to watch more carefully, and so on. That meant going through a list of machines, trying to figure out if I'd missed any, adding them by hand to cssh sessions, and so on.
I may need to give in and look at RHEL, or perhaps func or better Cfengine tweaking will do the job.
Quick and dirty way to make sure you don't overload your PDUs:
sleep $(expr $RANDOM / 200 ) && reboot
Rebooting one server took a long time because the ILOM was not working well, and had to be rebooted itself.
Upgrading the database servers w/the 3 TB arrays took a long time: stock MySQL packages conflicted with the official MySQL rpms, and fscking the arrays takes maybe an hour -- and there's no sign of life on the console while you're doing it. Problems with one machine's ILOM meant I couldn't even get a console for it.
Holy mother of god, what an awful time this was. I spent eight hours on upgrades for just nine desktop machines. Sadly, most of it was my fault, or at least bad configuration:
Graphics drivers: awful. Four different versions, and I'd used the local install scripts rather than creating an RPM and installing that. (Though to be fair, that would just rebuild the driver from scratch when it was installed, rather than do something sane like build a set of modules for a particular kernel.) And I didn't figure out where the uninstall script was 'til 7pm, meaning lots of fun trying to figure out why the hell one machine wouldn't start X.
Lesson: This really needs to be automated.
Lesson: The ATI uninstall script is at /usr/share/ati/fglrx-uninstall.sh. Use it.
Lesson: Next time, uninstall the driver and build a goddamn RPM.
Lesson: A better way of managing xorg.conf would be nice.
Lesson: Look for prefetch options for zypper. And start a local mirror.
Lesson: Pick a working version of the driver, and commit that fucker to Subversion.
These machines run some scientific software: one master, three slaves. When the master starts up at boot time, it tries to SSH to the slaves to copy over the binary. There appears to be no, or poor, rate throttling; if the slaves are not available when the master comes up, you end up with the following symptoms:
The problem is that umpty scp processes on the slave are holding open the binary, and the kernel gets confused trying to run it.
I also ran into problems with a duff cable on the master; confusingly, both the kernel and the switch said it was still up. This took a while to track down.
It turned out that a couple of my kvm-based VMs did not have jumbo frames turned on. I had to use virt-manager to shut down the machines, turn on virtio on the drivers, then reboot. However, kudzu on the VMs then saw these as new interfaces and did not configure them correctly. This caused problems because the machines were LDAP clients and hung when the network was unavailable.
My presentation on Cfengine 3 went pretty well yesterday. There were about 20 people there...I had been hoping for more, but that's a pretty good turnout. I was a little nervous beforehand, but I think I did okay during the talk. (I recorded it -- partly to review afterward, partly 'cos my dad wanted to hear the talk. :-)
One thing that did trip me up a bit were questions from one person in the audience that went fairly deep into how to use Cfengine, what its requirements were and so on. Since this was meant to be an introduction and I only had an hour, I wasn't prepared for this. Also, the questions went on...and on...and I'm not good at taking charge of a conversation to prevent it being hijacked. The questions were good, and though he and I disagree on this subject I respect his views. It's just that it really threw off my timing, and would have been best left for after. Any tips?
At some point I'm going to put up more on Cf3 that I couldn't really get into in the talk -- how it compares to Cf2, some of the (IMHO) shortfalls, and so on.
A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.
Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.
It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.
One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)
The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:
cfservd
had refused its connection because I had the MaxConnections
parameter too low.I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)
Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.
(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)
Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.
And best of 2010 to all of you!
Irritating: chkconfig
on RHEL/CentOS returns non-zero if a service
isn't configured for a runlevel. IOW, you can do:
chkconfig --level 3 foo
and have 0 returned if it's on, 1 if it's not.
But not SuSE; nope, it just returns 0 whether or not it's enabled, or
even if the service itself doesn't exist. Because, you know, grep
doesn't get used enough.
I'm doing this because I'm trying to use cfengine 2 to manage services. This works well in CentOS, where you can add something like:
service_foo_on = (ReturnsZero("/sbin/chkconfig --level 3 foo"))
and it'll work. ("servicefooon" is a bit of a misnomer, because I'm checking runlevels, not whether it's actually running.)
Update: Nope, I'm wrong. chkconfig --check
does exactly what I
want. Many thanks to yaloki on #openSUSE-server for the help.
Just ran into an interesting problem: after replacing memory on a server, CentOS booting hung at "Starting system message bus..."
So what does dbus have to do with anything? This turned out to be an
LDAP failure; dbus was trying to run as UID root
, and since the LDAP
server couldn't be contacted it hung. Why couldn't the LDAP server be
contacted? The LDAP server logs only showed this:
[09/Sep/2009:12:04:32 -0700] conn=41492 op=-1 fd=112 closed - SSL
peer cannot verify your certificate.
The CA cert I use was in place, and another machine had just rebooted w/o problems (all this is taken care of with cfengine, so they were identical in this respect). I could connect to the LDAP server on the right port without any problems.
I finally figured out what was going on when I ran:
openssl s_client -connect ldap.example.com:636 -CApath /path/to/cacert_directory
and saw:
Verify return code: 9 (certificate is not yet valid)
date
said it was December 31, 2001. What the what now? ntpdate
to set things correctly, then I got:
Verify return code: 0 (ok)
I figure the CMOS clock (or whatever the kids are calling it these days) got reset when we had to remove the CPU daughtercard to get at the memory underneath.
And now you know...the rest of the story.
label barcode
just failed miserably. (Neat command that.) And I had thought that DTE meant the arm, but no: upon reflection, it's a subtle/obtuse (not the right word, but oh well) way of referring to the tape drive itself.This sounds like when I was at my previous employer and they asked if I could develop a web-based system to take surveys. I nearly said, "yes" because, well, I know perl, I know CGI, and I could do it. However, I was smart enough to say "no, but surveymonkey.com will do it for cheap." Best of all it was self-service and the HR person was able to do it entirely without me. If I had said I could write such a program, it would have been days of back-and-forth changes which would have driven me crazy. Instead, she was happy to be empowered to do it herself. In fact, doing it herself without any help became a feather in her cap.
The lesson I learned is that "can I do it?" includes "do I want to do it?". If I can do something but don't want to, the answer is, "No, I don't know how" not "I know how but don't want to". The first makes you look like you know your limits. The latter sounds like you are just being difficult.
Another thing I'm trying to do at my new job is make/take more time for long-term planning. I've been dinged by mgt. for this in the past, and while it's not easy to hear I think there has been some validity to this. (My inclination is to concentrate hard on fixing the problems I'm faced with; giving up on something broken, even when doing so would make so much more sense and would free up resources to look for a replacement, just rankles and feels like...well, giving up.) Since the department I'm in is so new, it's even more important to pay attention to this.
Part of the problem is just recognizing that I need to make time. An hour a week to be isolated, and to (say) figure out what I'm going to need to do for the next month, is a habit I'm very conciously trying to adopt.
But another problem is how to keep track of all this. What I've done so far:
I'm a huge fan of Tom Limoncelli's Time Management for System Administrators, and his Cycle system has served me well. I've become a big fan of a paper organizer, so that's how I keep track of things. But it works est as a way of tracking day-to-day stuff; it's not so good at tracking a project that takes weeks, or months, or years.
I've read GTD, and that seems like a good system — but it's very different from The Cycle. I don't want to give up the Cycle, I want to graft on to it. And I'm not sure how well I can do that w/GTD.
I've tried org-mode in Emacs. I'm pretty happy with this, and in fact I switched to it for a while when I first started at this job back in July. It worked well for tracking day-to-day stuff, but I missed the flexibility and ubiquitousness of paper.
So where does that leave me? ATM, (paper planner Cycle)
attempting
some longer-term project tracking w/org-mode. I figure the TODO bits
from org-mode will fit well with the planner, and the flexibility of
Emacs and org-mode (different from paper...oh, how I wish I could grep
paper) will work well for projects...the records for which should,
ideally, be suitable for pasting into wiki-based documentation.
If anyone has any suggestions, please let me know. If I make it to LISA this year, I'll be looking for a BOF about this. (Or maybe I'll just tackle Tom Limoncelli to the ground and holler "I love you, man!" a la "Say Anything".)
Moving on:
I really like TrueType font support in Emacs 23, and ttf-inconsolata in particular. Thanks to Emacs-fu for both suggestions.
I and a co-worker picked up the servers that had, for the last two years or so, been racked at BC Women's hospital (of all places...my sons were both born there). We both had the same reaction when we saw them on a cart, ready to be loaded into our truck: "They're so small!" Seven little 1U servers plus one disk array...you start to think of them as larger-than-life when you're not looking at them all the time, and it's easy to forget just how small they are.
Some interesting discussion on the Cfengine mailing list about how Cfengine should handle packages.
And now it is time for bed.
I'm in the process of setting up a bunch of new servers for $job_2. All but one are CentOS 5.2, kickstart installed and managed with cfengine. This is the third time I've goen thorugh a cfengine setup, and it always feels like starting from scratch each time. It seems -- and I'm not at all sure this is fair or accurate -- that each time I set up one of these systems, there's a lot that I've lost from the last time and have to relearn. I'm fortunate this time that I can refer to $job_1's setup to see how I did things last time, but if I didn't have that I'd be significantly further behind than I am.
I'm not sure what the solution is. Part of me thinks I should just be more aggressive about taking notes, or committing stuff to a private repository, or writing it down here more; part of me thinks that this might be a clue that cfengine is too low-level for my head. It feels like when I was trying to learn C, and couldn't believe that I had to remember all this stuff just to print something, or read a file, or connect to another machine over the Internet. By contrast, Perl (or any other scripted language) was such a relief...just print, or open, or use the Net::Telnet module, or whatever. The details are there and they are important, sometimes very much so; that doesn't mean I want to learn more metallurgy every time I need a fork. (No, I don't think that metaphor's tortured; why do you ask?)
Another thing is that I'm trying to get multipath connections working for the first time. We've got two database servers, each of which is connected via dual SAS HBAs to outboard disk arrays. (I don't think anyone else calls them "outboard", but I like the sound of it. See this hard drive? It's outboard, baby!) The arrays are from Sun and come with drivers, but the documentation is confusing: it says it's available for RHEL 5 (aka CentOS 5), but the actual download says it's only for RHEL 4.
As a temporary respite, I'm trying to see if I can get these working using Linux's own multipath daemon, and it's also confusing. The documentation for it is tough to track down, and I just don't understand the different device names: am I meant to put /dev/dm-2 in fstab, or /dev/mpath/mpath2p1? If the latter, why does the name sometimes change to the WWUID (/dev/mpath/$(cat /dev/random)) when I restart multipathd? (use_friendly_names is uncommented in the config file.) If the whole point of multipath is failover, why does this sequence:
(where /mnt is where I've got this array mounted, obvs) sometimes work, and sometimes end with "I/O error" being logged, and the filesystem being read-only? Is this the sort of thing that the Sun driver will fix? I can't find anything about this.
And I mentioned electrical problems. When we got our servers installed, the Sun guys told us they'd tripped breakers on the PDU and/or breakers in the room's electrical cabinet. Since it had a sign on it saying "100A", I figured we might be running up against power limtis -- either in the room as a whole, if my figures were 'way out, or on individual PDUs. Turns out I was probably wrong: I missed the bit on the sign that said 3-phase, which means (deep breath) we probably have 3 x 100A power available (I think).
It's more complicated than that, because some of it is in 120V, some of it is in twist-lock 220V 30A circuits, and so on. But I should've checked before emailing the faculty member who, in a year or two, will be going into this room (we're there as guests of the department) and happens to sit on the facilities committee. He had asked how we were doing, so I sent him an email -- nice, polite, and including a bit about how grateful we were for the room and the help of the local sysadmins (all of which is true).
I was under the impression that he was asking for info now, so that he could bring it up for action in a few months when we were out. Instead, two hours later when I'm swearing at multipath, in come the facilities manager and one of the sysadmins I was dealing with, looking to find out just how much power we were using anyhow. I apologized profusely, and they were very cool about it. But when the committee guy asks questions, people jump. I had not anticipated this. Welcome to University Politics 101. I emailed again and explained my mistake.
There are lots of remedial courses I could take. However, today I would most like to take "Electricity and wiring for sysadmins".
And on another note: Ack! My laptop's home partition is 93% full! How the hell did that happen?
And again: How did I not know about apt-file? This is perfect!
(Touch o' the hat to Tears For Fears and Steve Kemp; I'm moving closer every day to switching to Chronicle.)
Ran into a problem today when adding this stanza to cfengine on a Debian Etch machine:
editfiles: ``` { /etc/aliases AppendIfNoSuchLine "root: sysadmin@pims.math.ca" DefineClasses "rebuild_aliases:restart_postfix" }
The cfengine reference file I've got, which sez it's for version 2.2.1, says you can define multiple classes in DefineClasses (or DefineInGroup), as long as they're separated by commas, spaces or dots. (The version in Etch is 2.2.20.)
However, when I ran cfagent, it just hung immediately after performing the edit, and gave this error when I ctrl-c'd it:
cfengine: Received signal 2 (SIGKILL) while doing [pre-lock-state]
Running cfengine with -d2
showed endless repetitions of AddClassToHeap()
at this point, so either there's something wrong with my syntax or there's a bug in cfengine. (I'm guessing the former.) Searching for pre-lock-state
and cfengine only turned up cases where the clients were syncing with the master; thus this note.
The fix was to just make it one class:
DefineClasses "rebuild_aliases"
Asking to restart Postfix was probably a bit of overkill anyhow...
Version 0.0.3 of Project U-13, a distro for sysadmins, has been released!
The main change is the addition of RackMonkey, which its website describes as "a web-based tool for managing racks of equipment such as web servers, video encoders, routers and storage devices", at the suggestion of Andy Seely. Also, Lynx has been installed, and there's also the skeletal beginnings of a Cfengine config file.
The ISO has been signed with my public key. Share and enjoy, and comments on a postcard, please.
My laptop hard drive started giving scary errors a couple days ago on the way to work (I've got a 90-minute commute by public transit [uck] so I fill the time by reading, listening to podcasts, or working on Project U-13). Fortunately, working at a university means that there are two computer stores on campus. I ran out at lunch, picked up a 100GB drive, and had things back to normal by the next morning.
Well, normal modulo one false start with Debian; I decided to try encrypted filesystems just for fun. But then I suspended, came back with a newere kernel, and it could not read the encrypted LVM group anymore. Whoops.
Still lots of free space on this thing, and I'm thinking of installing Ubuntu, FreeBSD and maybe NetBSD just for fun. Of course, I've got to do it all via PXE since this thing doesn't have any CDROM drive, but that just adds to the geek points.
Project U-13 is coming up on 0.0.3, btw; Andy suggested adding Rackmonkey, which looks quite cool. There's no package for it, so I'm having to do some rather ugly scripted installation…but I can stand it for now. And I've got the barest skeleton of a cfengine file in there too. Watch the skies!
Holy crap, it's been a while since I last wrote here. Mainly that's because I've been working on web stuff at work and have felt very little like a sysadmin of late. Thankfully we've got a webmaster hired, and to some extent the work'll be shifted to him in the new year. Of course, that still leaves the redesign of the website and its back end…that's not done 'til it's done.
This week, though, has been slow, and I've been catching up a little on sysadmin work. Part of it was setting up a devel server for the webmaster, and detailing what I was doing in Cfengine as I went along. It was gratifying to get LDAP working (I haven't done that on a Linux machine before; shame on me), and irritating when I realized that I couldn't mount the home directories from the server because I hadn't restarted nscd on the server.
The last two days were spent trying to get encrypted Bacula working between here and $other_university. This was an enormous pain in the ass for two reasons:
The Right Way (tm) of doing it is by using TLS, which is what the
kids are calling SSL these days, and I have never fully grokked
SSL, or the openssl
command. I know that there's encryption going
on; I know that there are certificates signed by CAs; I know that
there's a lot of negotiating of different options. But start throwing
in x509 versus PEM, Diffie-Helman parameters and the single most
cryptic set of error messages I've ever come across, and I just feel
thick. I was reduced to looking at tcpdump output of the negotiation
to figure out what was going on, and I couldn't; the Bacula FD client
complained that the Bacula Director wasn't producing a certificate,
and that was all I knew. The otherwise incredibly excellent docs from
Bacula were a trifle thin on all of this, and I couldn't find out much
about my situation (going the self-CA route).
So okay, fuckit, right? That's why God invented OpenSSH. So whee, start tunnelling port 9102 over SSH so the Director can contact the FD at $other_university, and 9103 back so the FD can contact the Storage Daemon. Only it turns out (my bad for not knowing this before) that not only does the client want to contact the SD, so does the director. Thus, my plan to tunnel to the firewall at the other end and tell the client that it could find the Storage Daemon there didn't work, 'cos the director wanted to contact it there too. (I did briefly try allowing the director to contact the tunnel at the other end: so even though the Storage was working on the same machine as the director, for that one job the Director's connection to it was going to the remote end and getting tunnelled back over SSH. But:
And why was I trying to connect to the remote firewall via SSH, rather than the client I'm trying to back up itself? Because that client is a Solaris machine authenticating against LDAP, and that turns out to bork key-based logins over SSH. What a crock.
Oh well. I did add three other machines here to Bacula this week, so that's good.
Project U-13 is coming along. I'm pretty close to a 0.0.2 release (woot), which should have the following working:
And by "working" I mean "installed". But I've got a decent setup on my laptop for building and testing it, which means I get up to a couple hours a day to work on it (New Westminster -> UBC == long). Thanks to Andy, he of the amazing speaking skills, for kicking my ass into action.
I'm learning a bit more about Mercurial in the process. After coming from CVS and Subversion, it seems really weird to me that the usual way of branching is "Go ahead, clone another repo! We're Mercurial! We don't care! Repos for everyone!" But if you figure on distributed development — something Linux-y than a controlled work environment — then it makes sense. Not that I think I'll have lots of people working on this thing, but it makes sense that if someone were to take this for their own ends, they wouldn't want to bother copying all the branches…just the one(s) they're interested in.
Last word to my son:
Q: What does a Camel say, Arlo? A: Purhl!
I've had a bunch of ideas lately. I'm inflicting them on you.
The presentation went well...I didn't get too nervous, or run too long, or start screaming at people (damn Induced Tourette's Syndrome) or anything. There were maybe 30 or so people there, and a bunch of them had questions at the end too. Nice! I was embiggened enough by the whole experience that, when the local LUG announced that they were having a newbie's night and asked for presenters to explain stuff, I volunteered. It's coming up in a few weeks; we'll see what happens.
And then I thought some more. A few days before I'd been listening to the almost-latest episode of LugRadio (nice new design!), where they were talking about GUADEC and PyCon UK. PyCon was especially interesting to hear about; the organizers had thought "Wouldn't it be cool to have a Python conference here in the UK?", so they made one.
So I thought, "It's a shame I'm not going to be able to go to LISA this year. Why don't we have our own conference here in Vancouver?" The more I thought about it, the better the idea seemed. We could have it at UBC in the summer, where I'm pretty sure there are cheap venues to be had. Start out modest — say, a day long the first time around. We could have, say, a training track and a papers track. I'm going to talk about this to some folks and see what they think.
Memo to myself: still on my list of stuff to do is to join pool.ntp.org. Do it, monkey boy!
Another idea I had: a while back I exchanged secondary DNS service,
c/o ns2exchange.com. It's working pretty well so far, but I'm not
monitoring it so it's hard for me to be sure that I can get rid of the
other DNS servers I've got. (Everydns.net is fine, but they
don't do TXT
or IPv6 records.) I'm in the process of setting up
Nagios to watch my own server, but of course that doesn't tell me what
things look like from the outside.
So it hit me: what about Nagios exchange? I'll watch your services if you watch mine. You wouldn't want your business depending on me, of course, but this'd be fine for the slightly anal sysadmin looking to monitor his home machines. :-) The comment link's at the end of the article; let me know if you're interested, or if you think it's a good/bad/weird idea.
The presentation also made me think about how this job has been, in many ways, a lot like the last job: implementing a lot of Things That Really Should Be Done (I hate to say "Best Practices) in a small shop. Time is tight and there's a lot to do, so I've been slowly making my way through the list:
Some of these things have been held up by my trying to remember what I did the last time. And then there's just getting up to speed on bootstrapping a Cfengine installation (say).
So what if all these things were available in one easy package? Not an appliance, since we're sysadmins — but integrated nicely into one machine, easily broken up if needed, and ready to go? Furthermore, what if that tool was a Linux distro, with all its attendant tools and security? What if that tool was easily regenerated, and itself served as a nicely annotated set of files to get the newbie up and running?
Between FAI (because if it's not Debian, you're working too hard) and cfengine, it should be easy to make a machine look like this. Have it work on a live ISO, with installation afterward with saved customizations from when you were playing around with it.
Have it be a godsend for the newbie, a timesaver for the experienced, and a lifeline for those struggling in rapidly expanding shops. Make this the distro I'd want to take to the next job like this.
I'm tentatively calling this Project U-13. We'll see how it goes.
Oh, and over here we've got Project U-14. So, you know, I've got lots of spare time.
Sound of tires, sound of God...
"Electric Version", The New Pornographers.
Thursday morning came far too early. My roommate offered some of his 800mg Ibuprofins, and I accepted. First thing I attended was the presentation "Drowning in the Data Tsunami" by Lee Damon and Evan Marcus. It was interesting, but seemed to be mostly about US data regulations (HIPPA/SOX et al.) and wasn't really relevant to me. I had been expecting more of an outline of, say, how in God's name we're going to preserve information for, say, a hundred years (heroic efforts of the Internet Archive notwithstanding). There was mention of an interesting approach to simply not accumulating cruft as you upgrade storage (because it's easier than sorting through to see what can be discarded; "Why bother weeding out 200MB when the new disk is 800GB?"): a paper by Radia Perlman (sp?) (she of OSPF fame) that proposes an encrypted data storage system (called The Ephemerizer) combined with key escrow that, to expire data, simply deletes the key when the time is up. Still, I moved on before too long.
...Which was good, because I sat in on Alva Couch's presentation on his and Mark Burgess' paper, "Modelling Next-Generation Configuration Management Tools". Some very, very confusing stuff about aspects, promises and closures -- confusing because the bastard didn't preface his talk with "This is what Hugh from Vancouver will need to know to understand this." (May be in the published paper; will check later.) Here's what I could gather:
I will do the right thing and read his paper, and I may update this later; these are just my notes and impressions, and aren't gospel. Couch is an incredibly enthusiastic speaker, and even though I didn't understand a lot of it I ended up excited anyway. :-) He gave another talk later in the week that Ricky went to, about how system administration will have to become more automatic; as a result, we'd all better learn how to think high-level and to be better communicators, because more and more of our stuff will be management -- and not just in the sense of managing computers. I'm going to seek out more of his stuff and see if it'll fit in my head.
After the break was a talk on "QA and the System Administrator", presented by a Google sysadmin. I went because it was Google, and frankly it wasn't that interesting. One thing that did jump out at me was when he described a Windows tool called Eggplant, a QA/validation tool. It has OCR built-in to recognize a menu, no matter where it is on the screen. This astounded me; when you start needing OCR to script things, that's broken. I don't doubt that it's a good tool, and I can think of lots of ways that would come in handy. But come on. I mean, a system that requires that is just so ugly.
I went out to lunch with Jay, a sysadmin from a shop that's just got permission from the boss to BSD a unit-testing program they've come up with for OpenBSD firewalls: it uses QEMU instances to fully test a firewall with production IP addresses, making sure that you're blocking and allowing everything you want. It sounds incredibly cool, and he's promised to send me a copy when he gets back. I can't wait to have a look at it.
After that was the meet-the-author session. I got to thank Tom Limoncelli for "Time Management for System Administrators", and got an autograph sticker from him and Strata Rose Chalup, his co-author for Ed 2. Sadly, I didn't get a chance to thank Tobias Oetiker (who I nearly ran into at lunch the day before).
Next up was the talk from Tom Limoncelli and Adam Moskovitz (Adam's looking for a job! Somebody hire him!) about how to get your paper accepted at LISA. Probably basic stuff if you've written a paper before, but I haven't so it was good to know. Thing like how to write a good abstract, what kind of paper is good for LISA, and how you shouldn't say things like "...and if our paper is accepted, we'll start work right away on the solution." Jay asked whether a paper on the pf testing tool would be good, and they both nodded enthusiastically.
Must Google:
Quotes from the talk:
At this point I started getting fairly depressed. Part of it was just being tired, but I kept thinking that not only could I not think of something to write a paper about, I could not think of how I'd get to find something to write about. I wandered over to the next talk feeling rather sad and lost.
The next talk was from Andy Seely on being a sysadmin in US Armed Forces Command and Control. Jessica was there, and we chatted a bit about how this talk conflicted with Tom Limoncelli's Time Management Guru session, and maybe ducking over to see that. Then Andy came over and asked Jessica to snap some picture, so she ended up staying. I was prepared to give it five minutes before deciding whether or not to leave.
Well, brother, let me tell you: Andy Seely is one of the best goddamned speakers on the planet. He was funny, engaging, and I could no more leave the room than I could get my jaw to undrop. Not only that, his talk was fascinating, and not just because he's a sysadmin for the US Armed Forces while simultaneously having a ponytail, earrings and tattoos. You can read the article in ;login: (FIXME: Add link) that it was based on, but he expanded on it considerably. Let me see what I can recall:
Longer story: Because of the nature of his work, he's got boxes that he has to keep working when he knows next to nothing about what they're meant to do. Case in point: a new Sun box arrives ("and it's literally painted black!"), but the person responsible for it wants to send it back because it doesn't work -- which means that when they click the icon to start the app it's meant to run, it doesn't launch and there's no visible sign that it's running. There's no documentation. And yet he's obligated to support this application. What do you do?
Even tracking down the path to the program launched by the icon is a challenge, but he does, tracks down the nested shell scripts and finally finds the jar that is the app ("Aha! It is Java!"). He finds log files which are verbose but useless. He contacts the company that wrote it, and is told he needs a support contract...which the government, when putting together the contract for the thing, did not think to include. So he calls back an hour later, talks to the help desk and tells them he's lost the number -- "Can you help a brother out?" They do, but they're stumped as well, and say they've never seen anything like this.
Time to pull out truss, which produces a huge amount of output. Somewhere in the middle of all that he notices a failing hard read of a file in /bin: it was trying to read 6 bytes and failing. Turns out the damned thing was trying to keep state in /bin, and failing because the file was zero bytes long. He removed the file, and suddenly the app works.
Andy also talked about trying to get a multiple GB dump file from Florida to Qatar. Physical transport was not an option, because arranging it would take too long. So he tries FTPing the file -- which works until he goes home for the day, at which point the network connection goes down and he loses a day. So he writes a Perl script that divides the file into 300MB chunks, then sends those one at a time. It works!
At this point, someone yells out "What about split?" Andy says, "What?" He hadn't known about it. There was a lot of good-natured laughter. He asked, "Is there an unsplit?" "Cat!" came the response from all over the room. He smacked his forehead and laughed. "This is why I come to LISA," he said. "At my job, I've been there 10 years. People come to me 'cos I'm the smart one. Here, I'm the dumb one. I love that."
There are two things I would like to say at this point.
First off, Andy is at least the tenth coolest person on the entire Eastern seaboard. No, he didn't know about cat -- but not only did he reimplement it in Perl rather than give up, he didn't even flinch when being told about it in the middle of giving a talk at LISA. I would probably have self-combusted from embarassment ("foomp!"), and I would have felt awful. Andy's attitude? "I learned something." That's incredibly strong. (Although he told a story later about being in the elevator with some Google people. They recognized him and said, "Hey, it's the 'man cat' guy!")
Second, when he said, "Here, I'm the dumb one. I love that" I sat up straight and thought, "Holy shit, he's right." Here I am at LISA for the first time ever. I've met people who can help me, and people I can help. I've made a crapload of new friends and have learned more in one week than I would've thought possible. And I'm worried 'cos it might be a few years before I can think about presenting a paper? That's messed up. I tend to set unreasonably high goals for myself and then get depressed when I can't reach them. Andy's statement made me feel a whole lot better.
During Q & A I asked what he did for peer support, since his ability to (say) post to a mailing list asking for help must be pretty restricted. He said that he's started a wiki for internal use and it's getting used...but both the culture and the job function mean that it's slow going. He's also started a conference for fellow sysadmins: 100 or so this year, and he's hoping for more next year.
In conclusion: if you ever get the chance to go see him, do so. And then buy him a beer.
Two sips from the cup of human kindness, and I'm shit-faced
Just laid to waste
If there's a choice between chance and flight, Choose it tonight.
"Choose It", The New Pornographers
Just got back from a whirlwind walk from the Lincoln Memorial to the Washington Monument to the White House. Beautiful, all of it...though a) the White House is small and b) there was something being filmed/videotaped in the courtyard, which made me think of Vancouver.
Training again. AFrisch was good, convering Cfengine quite well; would've liked to see more info about expect. (Apparently there are Perl/Python bindings...I had no idea.) Afternoon course was "Interviewing For System Administrators" by Adam Moskowitz and that was great -- lots of things I didn't know, lots of tips on doing it better next time.
Saw Tom Limoncelli in the hall during a break. Managed to restrain myself. I have the reputation for quiet restraint of a nation to uphold.
Very tired now. Time to go get beer.
Some days are fun days. I got this error on a Debian workstation when starting X:
Xlib: Connection to ":0.0" refused by server Xblib: Protocol not
supported by server. Xrdb: Can't open display ':0'
Turns out that an .xsession
file, with one commented-out line,
caused that. Remove the line (so now it's empty) and everything works.
Next we got the same user, who's had his home directory moved around
on the machine. Machines mounting his home dir via amd
(FreeBSD,
Debian) work fine, but the SuSE machines running autofs
fail
miserably with "permission denied" and the ever-popular:
$ cd
-bash: cd: /home/foo: Unknown error 521
Which, if you look up /usr/include/linux/errno.h
-- which, you know,
is the logical thing to do -- you see this:
/* Defined for the NFSv3 protocol */
#define EBADHANDLE 521 /* Illegal NFS file handle */
Another weird thing with AutoFS: I was running cfengine on a machine, and it hung when querying which RPMs were installed. strace on the rpm command shows its trying to lock a file and failing; looking at /proc/number/fd shows that, yep, it's trying and failing to lock /var/lib/rpm/Packages, the Berkeley DB file that knows all and sees all. So lsof to see who's holding it open, and that hangs; strace shows it's hanging trying to access the home directory of a user whose machine is down right now for reinstall. Try to unmount that directory and it fails. So I bring up the machine with the user's home directory, which allows me to unmount his home directory on the SuSE machine, which allows cfengine to run rpm, which succeeds in locking the Berkely DB file. Strange; possibly similar to this problem.
On top of everything else, someone asked me if I could be a "network prime". I think they mean "person we can talk to with authority to make network changes", or possibly "network contact". Not entirely sure.
But on the other hand: figured out how to run wpkg, package
manager for Windows of the elder gods, as a service using
Cygwin's cygrunsrv
. The instructions are on the wiki for
your viewing enjoyment.
More fallout today from Saturday's power outage: two workstations that failed to boot up (BIOS checksum error for one of 'em, which is a new one for me), some NIS-related services that didn't get started properly (not sure what's going on there), and so on. Plus the return of the where-are-those-seven-machines? that didn't get done on Friday because of all of this.
But I did learn some stuff about Cfengine. For example, if you have something like:
my_url = ( http://www.example.com/foo/bar )
then you'd better precede it with:
split = ( "+" )
or some other character that isn't used. The colon is treated as a list separator by default, which means that later on, when you try and do something like:
shell::
linux.need_some_file:
"/bin/wget $(my_url)/baz"
what it'll actually do is this:
/bin/wget http/baz
/bin/wget //www.example.com/foo/bar/baz
'cos it's iterating over the two lists, see?
And SuSE's dhcp client, by default (I think), will change
/etc/yp.conf
without telling you, and then on exit put back the old
version (saved conveniently at /etc/yp.conf.sv
. It took me a long
time to figure out that this was happening, and it pissed me off
mightily. /etc/resolv.conf
is filled with comments when the dhcp
client modifies it -- hell, they even throw in the PID. So why not do
that with yp.conf
? At least you can turn it off by changing
DHCLIENT_MODIFY_NIS_CONF
in /etc/sysconfig/networking/dhcp
.
cfengine is great, it really is. But there are some things that tripped me up. Often you want to set up a daemon to run The Right Way, which involves changing its config file. After that, of course, you want to restart it. What to do? The naive way (ie, the first way I tried) of doing things is:
control::
sequence ( editfiles shellcommands )
editfiles::
debian:
{ /etc/foo.conf
BeginGroupIfNoLineMatching "bar"
AddLine "bar"
Define restart_foo
EndGroup
}
freebsd:
{ /usr/local/etc/foo.conf
BeginGroupIfNoLineMatching "bar"
AddLine "bar"
Define restart_foo
EndGroup
}
shellcommands::
debian.restart_foo:
"/etc/init.d/foo restart"
freebsd.restart_foo:
"/usr/local/etc/rc.d/foo restart"
However, the correct way of doing this is:
control::
sequence = ( editfiles shellcommands )
AddInstallable = ( restart_foo )
editfiles::
debian:
{ /etc/foo.conf
BeginGroupIfNoLineMatching "bar"
AddLine "bar"
DefineInGroup "restart_foo"
EndGroup
}
freebsd:
{ /usr/local/etc/foo.conf
BeginGroupIfNoLineMatching "bar"
AddLine "bar"
DefineInGroup "restart_foo"
EndGroup
}
shellcommands::
debian.restart_foo:
"/etc/init.d/foo restart"
freebsd.restart_foo:
"/usr/local/etc/rc.d/foo restart"
Without both the enumeration of all your made-up classes in AddInstallable and the enclosing of that class in quotes, cfengine will fail to do what you want -- and will do so quietly and with no clue about why. God, that took me a long time to find.
I love cfengine. If you haven't checked it out yet, do so. You can do really neat stuff like this:
editfiles::
{ /etc/Xprint/C/print/attributes/document
BeginGroupIfNoLineMatching "^\*default-printer-resolution: 300"
CommentLinesMatching "^\*default-printer-resolution: 600"
LocateLineMatching "^# \*default-printer-resolution: 600"
InsertLine "*default-printer-resolution: 300"
DefineInGroup restart_xprint
EndGroup
}
shell::
debian.restart_xprint::
"/etc/init.d/xprint restart"
(Which, by the way, totally fixes the problem of Debian printing 'way
huge stuff. Bug number 262958. You should totally look it up.)
Look at that. It's lovely. It's obvious what it's looking for, what
it'll do if it can't find it, and what'll happen after that. And it
does it automagically. At night. From cron. The way God intended all
system administration to be done. However -- and I cannot emphasize
how important it is to keep this in mind -- it is absolutely NFG
reading the documentation for an hour trying to figure out why the
DefineInGroup
statement just does not work if:
It's my own fault for printing out v2 docs and not thinking much about
it. However, in my own defense it would be nice if cfengine would
complain about something it appears not to recognize. Not even with
-d2
(which produces output along the lines of
CheckingDateForSolarEclipseToday [no]
) did it whisper a word about
this.