I'm in the process of setting up a bunch of new servers for
$job_2. All but one are CentOS 5.2, kickstart installed and managed
with cfengine. This is the third time I've goen thorugh a cfengine
setup, and it always feels like starting from scratch each time. It
seems -- and I'm not at all sure this is fair or accurate -- that each
time I set up one of these systems, there's a lot that I've lost from
the last time and have to relearn. I'm fortunate this time that I can
refer to $job_1's setup to see how I did things last time, but if I
didn't have that I'd be significantly further behind than I am.
I'm not sure what the solution is. Part of me thinks I should just be
more aggressive about taking notes, or committing stuff to a private
repository, or writing it down here more; part of me thinks that this
might be a clue that cfengine is too low-level for my head. It feels
like when I was trying to learn C, and couldn't believe that I had to
remember all this stuff just to print something, or read a file, or
connect to another machine over the Internet. By contrast, Perl (or
any other scripted language) was such a relief...just print, or open, or
use the Net::Telnet module, or whatever. The details are there and
they are important, sometimes very much so; that doesn't mean I want
to learn more metallurgy every time I need a fork. (No, I don't think
that metaphor's tortured; why do you ask?)
Another thing is that I'm trying to get multipath connections working
for the first time. We've got two database servers, each of which is
connected via dual SAS HBAs to outboard disk arrays. (I don't think
anyone else calls them "outboard", but I like the sound of it. See
this hard drive? It's outboard, baby!) The arrays are from Sun and
come with drivers, but the documentation is confusing: it says it's
available for RHEL 5 (aka CentOS 5), but the actual download says it's
only for RHEL 4.
As a temporary respite, I'm trying to see if I can get these working
using Linux's own multipath daemon, and it's also confusing. The
documentation for it is tough to track down, and I just don't
understand the different device names: am I meant to put /dev/dm-2 in
fstab, or /dev/mpath/mpath2p1? If the latter, why does the name
sometimes change to the WWUID (/dev/mpath/$(cat /dev/random)) when I
restart multipathd? (use_friendly_names is uncommented in the config
file.) If the whole point of multipath is failover, why does this
sequence:
- touch /mnt/1
- remove first cable
- rm /mnt/1
- replace first cable
- touch /mnt/2
- remove second cable
- rm /mnt/2
- replace second cable
(where /mnt is where I've got this array mounted, obvs) sometimes
work, and sometimes end with "I/O error" being logged, and the
filesystem being read-only? Is this the sort of thing that the Sun
driver will fix? I can't find anything about this.
And I mentioned electrical problems. When we got our servers
installed, the Sun guys told us they'd tripped breakers on the PDU
and/or breakers in the room's electrical cabinet. Since it had a sign
on it saying "100A", I figured we might be running up against power
limtis -- either in the room as a whole, if my figures were 'way out,
or on individual PDUs. Turns out I was probably wrong: I missed the
bit on the sign that said 3-phase, which means (deep breath) we
probably have 3 x 100A power available (I think).
It's more complicated than that, because some of it is in 120V, some
of it is in twist-lock 220V 30A circuits, and so on. But I should've
checked before emailing the faculty member who, in a year or two, will
be going into this room (we're there as guests of the department) and
happens to sit on the facilities committee. He had asked how we were
doing, so I sent him an email -- nice, polite, and including a bit
about how grateful we were for the room and the help of the local
sysadmins (all of which is true).
I was under the impression that he was asking for info now, so that he
could bring it up for action in a few months when we were
out. Instead, two hours later when I'm swearing at multipath, in come
the facilities manager and one of the sysadmins I was dealing with,
looking to find out just how much power we were using anyhow. I
apologized profusely, and they were very cool about it. But when the
committee guy asks questions, people jump. I had not anticipated
this. Welcome to University Politics 101. I emailed again and
explained my mistake.
There are lots of remedial courses I could take. However, today I
would most like to take "Electricity and wiring for sysadmins".
And on another note: Ack! My laptop's home partition is 93% full! How
the hell did that happen?
And again: How did I not know about apt-file? This is perfect!
(Touch o' the hat to Tears For Fears and Steve Kemp; I'm moving
closer every day to switching to Chronicle.)