11 Jan 2011
Xmas vacation is when I get to do big, disruptive maintenance with a
fairly free hand. Here's some of what I did and what I learned this year.
Order of rebooting
I made the mistake of rebooting one machine first: the one that held
the local CentOS mirror. I did this thinking that it would be a good
guinea pig, but then other machines weren't able to fetch updates from
it; I had to edit their repo files. Worse, there was no remote
console on it, and no time (I thought) to take a look.
Automating patching
Last year I tried getting machines to upgrade using Cfengine like so:
centos.some_group_of_servers.Hr14.Day29.December.Yr2009::
"/usr/bin/yum -q -y clean all"
"/usr/bin/yum -q -y upgrade"
"/usr/bin/reboot"
This didn't work well: I hadn't pushed out the changes in advance,
because I was paranoid that I'd miss something. When I did push it
out, all the machines hit on the cfserver at the same time (more or
less) and didn't get the updated files because the server was refusing
connections. I ended up doing it by hand.
This year I pushed out the changes in advance, but it still didn't
work because of the problems with the repo. I ran cssh, edited the
repos file and updated by hand.
This worked okay, but I had to do the machines in separate batches --
some needed to have their firewall tweaked to let them reach a mirror
in the first place, some I wanted to watch more carefully, and so
on. That meant going through a list of machines, trying to figure out
if I'd missed any, adding them by hand to cssh sessions, and so on.
- Lesson: I need a better way of doing this.
- Lesson: I need a way to check whether updates are needed.
I may need to give in and look at RHEL, or perhaps func or better
Cfengine tweaking will do the job.
Staggering reboots
Quick and dirty way to make sure you don't overload your PDUs:
sleep $(expr $RANDOM / 200 ) && reboot
Remote consoles
Rebooting one server took a long time because the ILOM was not working
well, and had to be rebooted itself.
- Lesson: I need to test the SP before doing big upgrades; the simplest way of doing this may just be rebooting them.
Upgrading the database servers w/the 3 TB arrays took a long time:
stock MySQL packages conflicted with the official MySQL rpms, and
fscking the arrays takes maybe an hour -- and there's no sign of
life on the console while you're doing it. Problems with one machine's
ILOM meant I couldn't even get a console for it.
- Lesson: Again, make sure the SP is okay before doing an upgrade.
- Lesson: Fscking a few TB will take an hour with ext3.
- Lesson: Start the console session on those machines before you reboot, so that you can at least see the progress of the boot messages up until the time it starts fscking.
- Lesson: Might be worth editing fstab so that they're not mounted at boot time; you can fsck them manually afterward. However, you'll need to remember to edit fstab again and reboot (just to make sure)...this may be more trouble than it's worth.
OpenSuSE
Holy mother of god, what an awful time this was. I spent eight hours
on upgrades for just nine desktop machines. Sadly, most of it was my
fault, or at least bad configuration:
- Two of the machines were running OpenSuSE 11.1; the rest were
running 11.2. The latter lets you upgrade to the latest release
from the command line using "zypper dist-upgrade"; the former does
not, and you need to run over with a DVD to upgrade them.
- By default, zypper fetches packages one at a time, installs them,
then fetches them again. I'm not certain, but I think that means
there's a lot more TCP overhead and less chance to ratchet up the
speed. Sure as hell seemed slow downloading 1.8GB x 9 machines this
way.
Graphics drivers: awful. Four different versions, and I'd used the
local install scripts rather than creating an RPM and installing
that. (Though to be fair, that would just rebuild the driver from
scratch when it was installed, rather than do something sane like
build a set of modules for a particular kernel.) And I didn't
figure out where the uninstall script was 'til 7pm, meaning lots of
fun trying to figure out why the hell one machine wouldn't start X.
Lesson: This really needs to be automated.
Lesson: The ATI uninstall script is at /usr/share/ati/fglrx-uninstall.sh. Use it.
Lesson: Next time, uninstall the driver and build a goddamn RPM.
Lesson: A better way of managing xorg.conf would be nice.
Lesson: Look for prefetch options for zypper. And start a local mirror.
Lesson: Pick a working version of the driver, and commit that fucker to Subversion.
Special machines
These machines run some scientific software: one master, three slaves.
When the master starts up at boot time, it tries to SSH to the slaves
to copy over the binary. There appears to be no, or poor, rate
throttling; if the slaves are not available when the master comes up,
you end up with the following symptoms:
- Lots of SSH/scp processes on the master
- Lots of SSH/scp processes on the slave (if it's up)
- If you try to run the slave binary on the slave, you get errors like
"lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)" (from strace) or
"ESPIPE text file busy" (from running it in the shell).
The problem is that umpty scp processes on the slave are holding open
the binary, and the kernel gets confused trying to run it.
- Lesson: Bring up the slaves first, then bring up the master.
- Lesson: There are lots of interesting and obscure Unix errors.
I also ran into problems with a duff cable on the master; confusingly,
both the kernel and the switch said it was still up. This took a
while to track down.
- Lesson: Network cables are surprisingly fragile at the connection
with the jack.
Virtual Machines
It turned out that a couple of my kvm-based VMs did not have jumbo
frames turned on. I had to use virt-manager to shut down the
machines, turn on virtio on the drivers, then reboot. However, kudzu
on the VMs then saw these as new interfaces and did not configure them
correctly. This caused problems because the machines were LDAP
clients and hung when the network was unavailable.
- Lesson: To get around this, go into single-user mode and copy
/etc/sysconfig/network-scripts/ifcfg-eth0.bak to ifcfg.eth0.
- Lesson: Be sure you're monitoring everything in Nagios; it's a
sysadmin's regression test.
Tags:
work
cfengine
jumboframes
rant
toptip
mysql
05 Jan 2011
In the spirit of Chris Siebenmann, and to kick off the new year,
here's a post that's partly documentation for myself and partly an
attempt to ensure I do it right: how I manage my tasks using Org Mode,
The Cycle, my Daytimer and Request Tracker.
Org Mode is awesome, even more awesome than the window manager. I
love Emacs and I love Org Mode's flexibility.
Tom Limoncelli's "Time Management for System Administrators."
Really, I shouldn't have to tell you this.
DayTimer: because I love paper and pen. It's instant boot time,
and it's maybe $75 to replace (and durable) instead of $500 (and
delicate). And there is something so satisfying about crossing
off an item on a list; C-c C-c just isn't as fun.
RT: Email, baby. Problem? Send an email. There's even
rt-liberation for integration with Emacs (and probably Org Mode,
though I haven't done that yet).
So:
Problems that crop up, I email to RT -- especially if I'm not
going to deal with them right away. This is perfect for when you're
tackling one problem and you notice something else non-critical.
Thus, RT is often a global todo list for me.
If I take a ticket in RT (I'm a shop of one), that means I'm
planning to work on it in the next week or so.
Planning for projects, or keeping track of time spent on various
tasks or for various departments, is kept in Org Mode. I also use
it for things like end-of-term maintenance lists. (I work at a
university.) It's plain text, I check it into SVN nightly, and
Emacs is The One True Editor.
My DayTimer is where I write down what I'm going to do today, or
that appointment I've got in two weeks at 3pm. I carry it
everywhere, so I can always check before making a commitment. (That
bit sampled pretty much directly from TL.)
Every Monday (or so; sometimes I get delayed) I look through things
to see what has to be done:
- Org mode for projects or next-step sorta things
- RT for tickets that are active
- DayTimer for upcoming events
I plan out my week. "A" items need to be done today; "B" items
should be done by the end of the week; "C" items are done if I have
time.
Once every couple of months, I go through RT and look at the list of
tickets. Sometimes things have been done (or have become
irrelevant) and can be closed; sometimes they've become more
important and need to be worked.
I try to plan out what I want to get done in the term ahead at the
beginning of the term, or better yet just before the term starts;
often there are new people starting with a new term, and it's always
a bit hectic.
Tags:
work
23 Dec 2010
This took me a while to figure out. (All my war stories start with
that sentence...)
A faculty member is getting a new cluster next year. In the meantime,
I've been setting up Rocks on a test bed of older machines to get
familiar with it. This week I've been working out how Torque, Maui
and MPI work, and today I tried running something non-trivial.
CHARMM is used for molecular simulations; it's mostly (I think)
written in Fortran and has been around since the 80s. It's not the
worst-behaved scientific program I've had to work with.
I had an example script from the faculty member to run. I was able to
run it on the head node of the cluster like so:
mpirun -np 8 /path/to/charmm < stream.inp > out ZZZ=testscript.inp
8 CHARMM processes still running after, like, 5 days. (These things
run forever, and I got distracted.) Sweet!
Now to use the cluster the way it was intended: by running the
processes on the internal nodes. Just a short script and away we go:
$ cat test_mpi_charmm.sh
#PBS -N test_charmm
#PBS -S /bin/sh
#PBS -N mpi_charmm
#PBS -l nodes=2:ppn=4
. /opt/torque/etc/openmpi-setup.sh
mpirun /path/to/charmm < stream.inp ZZZ=testscript.inp
$ qsub test_mpi_charmm.sh
But no, it wasn't working. The error file showed:
At line 3211 of file ensemble.f
Fortran runtime error: No such file or directory
mpirun has exited due to process rank 0 with PID 9494 on
node compute-0-1.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
Well, that's helpful...but the tail of the output file showed:
CHARMM> ensemble open unit 19 read card name -
CHARMM> restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
Parameter: FILEROOT -> "TEST_RUN"
Parameter: PREV -> "FOO"
Parameter: NREP -> "1"
Parameter: NODE -> "0"
ENSEMBLE> REPLICA NODE 0
ENSEMBLE> OPENING FILE restart/test_run_foo_nr1_nd0
ENSEMBLE> ON UNIT 19
ENSEMBLE> WITH FORMAT FORMATTED AND ACCESS READ
What the what now?
Turns out CHARMM has the ability to checkpoint work as it goes along,
saving its work in a restart file that can be read when starting up
again. This is a Good Thing(tm) when calculations can take weeks and
might be interrupted. From the charmm docs, the restart-relevant command is:
IUNREA -1 Fortran unit from which the dynamics restart file should
be read. A value of -1 means don't read any file.
(I'm guessing a Fortran unit is something like a file descriptor;
haven't had time to look it up yet.)
The name of the restart file is set in this bit of the test script:
iunrea 19 iunwri 21 iuncrd 20
Next is this bit:
ensemble open unit 19 read card name -
"restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
An @ sign indicates a variable, it seems. And it's Fortran, and
Fortran's been around forever, so it's case-insensitive. So the
restart file is being set to
"@FILEROOT@PREVnr@NREP_nd@NODE.rst". Snipping from the input file,
here are where the variables are set:
set fileroot test
set prev minim
set node ?whoiam
set nrep ?nensem
test" appears to be just a string. I'm assuming "minim" is some kind
of numerical constant. But "whoiam" and "nensem" are set by MPI and
turned into CHARMM variables. From charmm's documentation:
The CHARMM run is started using MPI commands to specify the number of processes
(replicas) to use, each of which is an identical copy of charmm. This number
is automatically passed by MPI to each copy of the executable, and it is set to
the internal CHARMM variable 'nensem', which can be used in scripts, e.g.
The other internal variable set automatically via MPI is 'whoiam', e.g.
These are useful for giving different file names to different nodes.
So remember the way charmm was being invoked in the two jobs? The way it worked:
...and the way it didn't:
Aha! Follow the bouncing ball:
- The input script wants to load a checkpoint file...
- ...which is named after the number of processes mpi was told to run...
- ...and the script barfs if it's not there.
At first I thought that I could get away with increasing the number of
copies of charmm that would run by fiddling with
torque/serverpriv/servernodes -- telling it that the nodes had 4
processors each (so total of 8) rather than 2. (These really are old
machines.) And then I'd change the PBS line to "-l nodes=2:ppn=4",
and we're done! Hurrah! Except no: I got the same error as before.
What does work is changing the mpirun args in the qsub file:
However, what that does is run 8 copies on one compute node --
which works, hurrah, but it's not what I think we want:
- many copies on many nodes
- communicating as necessary
I think this is a problem for the faculty member to solve, though.
It's taken me a whole day to figure this out, and I'm pretty sure I
wouldn't understand the implications of just (say) deking out the bit
that looks for a restart file. (Besides, there are many such bits.)
(Oh, and incidentally, just moving the files out of the way doesn't
help...still barfs and dies.) I'll email him about this and let him
deal with it.
Tags:
rocks
warstory
cluster
14 Dec 2010
Today I was adding a couple network cards to a new firewall...and
realized that I had no idea if I had any screws to hold them in
place. It's been a long time since I've opened up a machine to
replace a part, and a longer time since I've been doing it on a
regular basis.
Today, for example: drafting an RFP and going over acceptance testing
with the PI; decoding bits of protein modelling procedures so I can
understand what he's doing; reading up on Torque, Maui and trying to
figure out the difference between OpenMP and OpenMPI.
Five years ago...no, longer, six, seven even: debugging crappy DLink
switches; getting workstations from auction; blowing dust bunnies out
of the central fileserver.
Tags:
10 Dec 2010
As mentioned on a mailing list recently: Apache Internals and
Debugging.
Tags:
08 Dec 2010
My presentation on Cfengine 3 went pretty well yesterday. There were
about 20 people there...I had been hoping for more, but that's a
pretty good turnout. I was a little nervous beforehand, but I think I
did okay during the talk. (I recorded it -- partly to review
afterward, partly 'cos my dad wanted to hear the talk. :-)
One thing that did trip me up a bit were questions from one person in
the audience that went fairly deep into how to use Cfengine, what its
requirements were and so on. Since this was meant to be an
introduction and I only had an hour, I wasn't prepared for this.
Also, the questions went on...and on...and I'm not good at taking
charge of a conversation to prevent it being hijacked. The questions
were good, and though he and I disagree on this subject I respect his
views. It's just that it really threw off my timing, and would have
been best left for after. Any tips?
At some point I'm going to put up more on Cf3 that I couldn't really
get into in the talk -- how it compares to Cf2, some of the (IMHO)
shortfalls, and so on.
Tags:
cfengine
05 Dec 2010
I just got in this morning from seeing Saturn for the first time ever
through a telescope. It was through my Galileoscope, a cheap
(but decent!) 50mm refractor.
I've taken up astronomy again for the first time since I was a kid,
and it's been a lot of fun. I've been heading out nights with the
scope, some binoculars and sky maps to learn the sky. So far
through the scope I've seen Jupiter, the Moon, Albireo, and
Venus. (Venus was yesterday, when I was trying to find Saturn...)
Saturn, though...that was something else. It was small, but I could
distinctly see the rings. It was absolutely breathtaking. I've
wanted to see Saturn through a scope for a long, long time, and it was
incredible to finally do so.
Tags:
astronomy
galileoscope
02 Dec 2010
Bad: Sorry, but can we have a budget by Monday? New rules mean we
need to get your budget approved by this time next week, instead of
four months from now, so I need to look over it by Monday. (To be
fair, he was quite apologetic.)
Good: Seeing the crewcent moon and Venus (or Spica? I'll never know)
peeking through the clouds on the way to work this morning.
Tags:
mrsgarrettwasright
astronomy
01 Dec 2010
So we want a cluster at $WORK. I don't know a lot about this, so I
figure that something like Rocks or OSCAR is the way to go.
OSCAR didn't look like it had been worked on in a while, so Rocks
it is. I downloaded the CDs and got ready to install on a handful of
old machines.
(Incidentally, I was a bit off-base on OSCAR. It is being worked on,
but the last "production-ready" release was version 5.0, in 2006. The
newest release is 6.0.5, but as the documentation says:
Note that the OSCAR-6.0.x version is not necessarily suitable for
production. OSCAR-6.0.x is actually very similar to KDE-4.0.x: this
version is not necessarily "designed" for the users who need all the
capabilities traditionally shipped with OSCAR, but this is a good new
framework to include and develop new capabilities and move forward. If
you are looking for all the capabilities normally supported by OSCAR,
we advice you to wait for a later release of OSCAR-6.1.
So yeah, right now it's Rocks.)
Rocks promises to be easy: it installs a frontend, then that frontend
installs all your compute nodes. You install different rolls:
collections of packages. Everything is easy. Whee!
Only it's not that way, at least not consistently.
I'm reinstalling this time because I neglected to install the Torque
roll last time. In theory you can install a roll to an
already-existing frontend; I couldn't get it to work.
A lot of stuff -- no, that's not true. Some stuff in Rocks is
just not documented very well, and it's the little but important and
therefore irritating-when-it's-missing stuff. For example: want the
internal compute nodes to be LDAP clients, rather than syncing
/etc/passwd all around? That means modifying /var/411/Files.mk to
include things like /etc/nsswitch and /etc/ldap.conf. That's
documented in the 4.x series, but it has been left out of the
5.x series. I can't tell why; it's still in use.
When you boot from the Rocks install CD, you're directed to type
"Build" (to build a new frontend) or "Rescue" (to go into rescue
mode). What it doesn't tell you is that if you don't type in
something quickly enough, it's going to boot into a
regular-looking-but-actually-non-functional CentOS install and after
a few minutes will crap out, complaining that it can't find certain
files -- instead of either booting from the hard drive or waiting
for you to type something. You have to reboot again in order to get
another chance.
Right now I'm reinstalling the front end for the THIRD TIME in two
days. For some reason, the installation is crapping out and refusing
to store the static IP address for the outward-facing interface of the
front end. Reinstalling means sitting in the server room feeding CDs
(no network installation without an already-existing front end)
into old servers (which have no DVD drives) for an hour, then waiting
another half hour to see what's gone wrong this time.
Sigh.
Tags:
rocks
cluster
12 Nov 2010
How many times have I tried
Just to get away from you, and you reel me back?
How many times have I lied
That there's nothing that I can do?
Friday morning started with a quick look at Telemundo ("PRoxima:
Esclavas del sexo!"), then a walk to Phillz coffee. This time I
got the Tesora blend (their hallmark) and wow, that's good
coffee. Passed a woman woman pulling two tiny dogs across the street:
"C'mon, peeps!" Back at the tables I checked my email and got an
amazing bit of spam about puppies, and how I could buy some rare
breeds for ch33p.
First up was the Dreamworks talk. But before that, I have to relate
something.
Earlier in the week I ran into Sean Kamath, who was giving the
talk, and told him it looked interesting and that I'd be sure to be
there. "Hah," he said, "Wanna bet? Tom Limoncelli's talk is opposite
mine, and EVERYONE goes to a Tom Limoncelli talk. There's gonna be no
one at mine."
Then yesterday I happened to be sitting next to Tom during a break,
and he was discussing attendance at the different presentations.
"Mine's tomorrow, and no one's going to be there." "Why not?"
"Mine's opposite the Dreamworks talk, and EVERYONE goes to Dreamworks
talks."
Both were quite amused -- and possibly a little relieved -- to learn
what the other thought.
But back at the ranch: FIXME in 2008, Sean gave a talk on
Dreamworks and someone asked afterward "So why do you use NFS anyway?"
This talk was meant to answer that.
So, why? Two reasons:
They use lots of local caching (their filers come from NetApp, and
they also have a caching box), a global namespace, data hierarchy
(varying on the scales of fast, reliable and expensive), leverage the
automounter to the max, and 10GB core links everywhere, and it works.
- What else are you gonna use? Hm?
FTP/rcp/rdist? Nope. SSH? Won't handle the load. AFS lacks
commercial support -- and it's hard to get the head of a
billion-dollar business to buy into anything without commercial
support.
They cache for two reasons: global availability and scalability.
First, people in different locations -- like on different sides of the
planet (oh, what an age we live in!) -- need access to the same files.
(Most data has location affinity, but this will not necessarily be
true in the future.) Geographical distribution and the speed of light
do cause some problems: while Data reads and gettatr() are helped a
lot by the caches, first open, sync()s and writes are slow when the
file is in India and it's being opened in Redwood. They're thinking
about improvements to the UI to indicate what's happening to reduce
user frustration. But overall, it works and works well.
Scalability is just as important: thousands of machines hitting the
same filter will melt it, and the way scenes are rendered, you will
have just that situation. Yes, it adds latency, but it's still faster
than an overloaded filer. (It also requires awareness of
close-to-open consistency.)
Automounter abuse is rampant at DW; If one filer is overloaded, they
move some data somewhere else and change the automount maps. (They're
grateful for the automounter version in RHEL 5: it no longer requires
that the node be rebooted to reload the maps.) But like everything
else it requires a good plan, or it gets confusing quickly.
Oh, and quick bit of trivia: they're currently sourcing workstations
with 96GB of RAM.
One thing he talked about was that there are two ways to do sysadmin:
rule-enforcing and policy-driven ("No!") or creative, flexible
approaches to helping people get their work done. The first is
boring; the second is exciting. But it does require careful
attention to customers' needs.
So for example: the latest film DW released was "Mastermind". This
project was given a quota of 85 TB of storage; they finished the
project with 75 TB in use. Great! But that doesn't account for
35 TB of global temp space that they used.
When global temp space was first brought up, the admins said, "So let
me be clear: this is non-critical and non-backed up. Is that okay
with you?" "Oh sure, great, fine." So the admins bought
cheap-and-cheerful SATA storage: not fast, not reliable, but man it's
cheap.
Only it turns out that non-backed up != non-critical. See, the
artists discovered that this space was incredibly handy during
rendering of crowds. And since space was only needed overnight, say,
the space used could balloon up and down without causing any long-term
problems. The admins discovered this when the storage went down for
some reason, and the artists began to cry -- a day or two of
production was lost because the storage had become important to one
side without the other realizing it.
So the admins fixed things and moved on, because the artists need to
get things done. That's why he's there. And if he does his job
well, the artists can do wonderful things. He described watching
"Madegascar", and seeing the crowd scenes -- the ones the admins and
artists had sweated over. And they were good. But the rendering of
the water in other scenes was amazing -- it blew him away, it was so
realistic. And the artists had never even mentioned that; they'd just
made magic.
Understand that your users are going to use your infrastructure in
ways you never thought possible; what matters is what gets put on the
screen.
Challenges remain:
Sometimes data really does need to be at another site, and caching
doesn't always prevent problems. And problems in a data render farm
(which is using all this data) tend to break everything else.
Much needs to be automated: provisioning, re-provisioning and
allocating storage is mostly done by hand.
Disk utilization is hard to get in real time with > 4 PB of storage
world wide; it can take 12 hours to get a report on usage by
department on 75 TB, and that doesn't make the project managers
happy. Maybe you need a team for that...or maybe you're too busy
recovering from knocking over the filer by walking 75 TB of data to
get usage by department.
Notifications need to be improved. He'd love to go from "Hey, a
render farm just fell over!" to "Hey, a render farm's about to fall
over!"
They still need configuration management. They have a homegrown one
that's working so far. QOTD: "You can't believe how far you can
get with duct tape and baling wire and twine and epoxy and post-it
notes and Lego and...we've abused the crap out of free tools."
I went up afterwards and congratulated him on a good talk; his passion
really came through, and it was amazing to me that a place as big as
DW uses the same tools I do, even if it is on a much larger scale.
I highly recommend watching his talk (FIXME: slides only for now.
Do it now; I'll be here when you get back.
During the break I got to meet Ben Rockwood at last. I've
followed his blog for a long time, and it was a pleasure to talk with
him. We chatted about Ruby on Rails, Twitter starting out on Joyent,
upcoming changes in Illumos now that they've got everyone from Sun but
Jonathan Schwarz (no details except to expect awesome and a renewed
focus on servers, not desktops), the joke that Joyent should just come
out with it and call itself "Sun". Oh, and Joyent has an office in
Vancouver. Ben, next time you're up drop me a line!
Next up: Twitter. 165 million users, 90 million tweets per day, 1000
tweets per second....unless the Lakers win, in which case it peaks at
3085 tweets per second. (They really do get TPS reports.) 75% of
those are by API -- not the website. And that percentage is
increasing.
Lessons learned:
Nothing works the first time; scale using the best available tech
and plan to build everything more than once.
(Cron + ntp) x many machines == enough load on, say, the central
syslog collector to cause micro outages across the site. (Oh, and
speaking of logging: don't forget that syslog truncates messages >
MTU of packet.)
RRDtool isn't good for them, because by the time you want to fiugure
out what that one minute outage was about two weeks ago, RRDtool has
averaged away the data. (At this point Toby Oetiker, a few seats
down from me, said something I didn't catch. Dang.)
Ops mantra: find the weakest link; fix; repeat. OPS stats: MTTD
(mean time to detect problem) and MTTR (MT to recover from problem).
It may be more important to fix the problem and get things going
again than to have a post-mortem right away.
At this scale, at this time, system administration turns into a
large programming project (because all your info is in your
config. mgt tool, correct?). They use Puppet + hundreds of Puppet
modules + SVN + post-commit hooks to ensure code reviews.
Occasionally someone will make a local change, then change
permissions so that Puppet won't change it. This has led to a
sysadmin mantra at Twitter: "You can't chattr +i with broken
fingers."
Curve fitting and other basic statistical tools can really help --
they were able to predict the Twitpocalypse (first tweet ID > 2^32)
to within a few hours.
Decomposition is important to resiliency. Take your app and break
it into n different independant, non-interlocked services. Put each
of them on a farm of 20 machines, and now you no longer care if a
machine that does X fails; it's not the machine that does X.
Because of this Nagios was not a good fit for them; they don't want
to be alerted about every single problem, they want to know when 20%
of the machines that do X are down.
Config management + LDAP for users and machines at an early, early
stage made a huge difference in ease of management. But this was
a big culture change, and management support was important.
And then...lunch with Victor and his sister. We found Good
Karma, which had really, really good vegan food. I'm definitely
a meatatarian, but this was very tasty stuff. And they've got good
beer on tap; I finally got to try Pliny the Elder, and now I know
why everyone tries to clone it.
Victor talked about one of the good things about config mgt for him:
yes, he's got a smaller number of machines, but when he wants to set
up a new VM to test something or other, he can get that many more
tests done because he's not setting up the machine by hand each time.
I hadn't thought of this advantage before.
After that came the Facebook talk. I paid a little less attention to
this, because it was the third ZOMG-they're-big talk I'd been to
today. But there were some interesting bits:
Everyone talks about avoiding hardware as a single point of failure,
but software is a single point of failure too. Don't compound
things by pushing errors upstream.
During the question period I asked them if it would be totally crazy
to try different versions of software -- something like the security
papers I've seen that push web pages through two different VMs to
see if any differences emerge (though I didn't put it nearly so
well). Answer: we push lots of small changes all the time for
other reasons (problems emerge quickly, so easier to track down), so
in a way we do that already (because of staged pushes).
Because we've decided to move fast, it's inevitable that problems
will emerge. But you need to learn from those problems. The
Facebook outage was an example of that.
Always do a post-mortem when problems emerge, and if you focus on
learning rather than blame you'll get a lot more information,
engagement and good work out of everyone. (And maybe the lesson
will be that no one was clearly designated as responsible for X, and
that needs to happen now.)
The final speech of the conference was David Blank-Edelman's keynote
on the resemblance between superheroes and sysadmins. I watched for a
while and then left. I think I can probably skip closing keynotes in
the future.
And then....that was it. I said goodbye to Bob the Norwegian and
Claudio, then I went back to my room and rested. I should have slept
but I didn't; too bad, 'cos I was exhausted. After a while I went out
and wandered around San Jose for an hour to see what I could see.
There was the hipster cocktail bar called "Cantini's" or something;
billiards, flood pants, cocktails, and the sign on the door saying "No
tags -- no colours -- this is a NEUTRAL ZONE."
I didn't go there; I went to a generic looking restaurant with room at the bar.
I got a beer and a burger, and went back to the hotel.
Tags:
lisa
scaryvikingsysadmins
11 Nov 2010
I missed my chance, but I think I'm gonna get another...
-- Sloan
Thursday morning brought Brendan Gregg's (nee Sun, then Oracle, and
now Joyent) talk about data visualization. He introduced himself as
the shouting guy, and talked about how heat maps allowed him so
see what the video demonstrated in a much more intuitive way. But
in turn, these require accurate measurement and quantification of
performance: not just "I/O sucks" but "the whole op takes 10 ms, 1 of
which is CPU and 9 of which is latency."
Some assumptions to avoid when dealing with metrics:
The available metrics are correctly implemented. Are you sure
there's not a kernel bug in how something is measured? He's come
across them.
The available metrics are designed by performance experts. Mostly,
they're kernel developers who were trying to debug their work, and
found that their tool shipped.
The available metrics are complete. Unless you're using DTrace, you
simply won't always find what you're looking for.
He's not a big fan of using IOPS to measure performance. There are a
lot of questions when you start talking about IOPS. Like what layer?
- app
- library
- sync call
- VFS
- filesystem
- RAID
- device
(He didn't add political and financial, but I think that would have
been funny.)
Once you've got a number, what's good or bad? The number can change
radically depending on things like library/filesystem prefetching or
readahead (IOPS inflation), read caching or write cancellation
(deflation), the size of a read (he had an example demonstrating how
measured capacity/busy-ness changes depending on the size of
reads)...probably your company's stock price, too. And iostat or your
local equivalent averages things, which means you lose
outliers...and those outliers are what slow you down.
IOPS and bandwidth are good for capacity planning, but latency is a
much better measure of performance.
And what's the best way of measuring latency? That's right,
heatmaps. Coming from someone who worked on Fishworks, that's
not surprising, but he made a good case. It was interesting to see
how it's as much art as science...and given that he's exploiting the
visual cortex to make things clear that never were, that's true in a
few different ways.
This part of the presentation was so visual that it's best for you to
go view the recording (and anyway, my notes from that part suck).
During the break, I talked with someone who had worked at Nortel
before it imploded. Sign that things were going wrong: new execs
come in (RUMs: Redundant Unisys Managers) and alla sudden everyone is
on the chargeback model. Networks charges ops for bandwidth; ops
charges networks for storage and monitoring; both are charged by
backups for backups, and in turn are charged by them for bandwidth
and storage and monitoring.
The guy I was talking to figured out a way around this, though.
Backups had a penalty clause for non-performance that no one ever took
advantage of, but he did: he requested things from backup and proved
that the backups were corrupt. It got to the point where the backup
department was paying his department every month. What a clusterfuck.
After that, a quick trip to the vendor area to grab stickers for the
kids, then back to the presentations.
Next was the 2nd day of Practice and Experience Reports ("Lessons
Learned"). First up was the network admin (?) for ARIN about
IPv6 migration. This was interesting, particularly as I'd naively
assumed that, hey, they're ARIN and would have no problems at all on
this front...instead of realizing that they're out in front to take a
bullet for ALL of us, man. Yeah. They had problems, they screwed up
a couple times, and came out battered but intact. YEAH!
Interesting bits:
Routing is not as reliable, not least because for a long time (and
perhaps still) admins were treating IPv6 as an experiment, something
still in beta: there were times when whole countries in Europe
would disappear oft the map for weeks as admins tried out different
things.
Understanding ICMPv6 is a must. Naive assumptions brought over from
IPv4 firewalls like "Hey, let's block all ICMP except ping" will
break things in wonderfully subtle ways. Like: it's up to the
client, not the router, to fragment packets. That means the
client needs to discover the route MTU. That depends on ICMPv6.
Not all transit is equal; ask your vendor if they're using the same
equipment to route both protocols, or if IPv6 is on the old, crappy
stuff they were going to eBay. Ask if they're using tunnels;
tunnels aren't bad in themselves, but can add multiple layers to
things and make things trickier to debug. (This goes double if
you've decided to firewall ICMPv6...)
They're very happy with OpenBSD's pf as an IPv6 firewall.
Dual-stack OSes make things easy, but can make policy complex. Be
aware that DHCPv6 is not fully supported (and yes, you need it to
hand out things like DNS and NTP), and some clients (believe he said
XP) would not do DNS lookups over v6 -- only v4, though they'd
happily go to v6 servers once they got the DNS records.
IPv6 security features are a double-edged sword: yes, you can set
up encrypted VPNs, but so can botnets. Security vendors are behind
on this; he's watching for neat tricks that'll allow you to figure
out private keys for traffic and thus decrypt eg. botnet C&C, but
it's not there yet. (My notes are fuzzy on this, so I may have it
wrong.)
Multicast is an attack-and-discovery protocol, and he's a bit
worried about possible return of reflection attacks (Smurfv6). He's
hopeful that the many, many lessons learned since then mean it won't
happen, but it is a whole new protocol nd set of stacks for baddies
to explore and discover. (RFC 4942 was apparently
important...)
Proxies are good for v4-only hosts: mod_proxy, squid and
6tunnel have worked well (6tunnel in particular).
Gotchas: reverse DNS can be painful, because v6 macros/generate
statements don't work in BIND yet; IPv6 takes precedence in most
cases, so watch for SSH barfing when it suddenly starts seeing new
hosts.
Next up: Internet on the Edge, a good war story about bringing
wireless through trees for DARPA that won best PER. Worth watching.
(Later on, I happened across the person who presented and his boss in
the elevator, and I congratulated him on his presentation. "See?"
says his boss, and digs him in the ribs. "He didn't want to present
it.")
Finally there was the report from one of the admins who helped set up
Blue Gene 6, purchased from IBM. (The speaker was much younger
than the others: skinny, pale guy with a black t-shirt that said GET
YOUR WAR ON. "If anyone's got questions, I'm into that...") This
report was extremely interesting to me, especially since I've got an
upcoming purchase for a (much, much smaller) cluster coming up.
Blue Gene is a supercomputer with something like 10k nodes, and it
uses 10GB/s Myrinet/Myricom (FIXME: Clarify which that is)
cards/network for communication. Each node does source routing, and
so latency is extremely low, throughput correspondingly high, and core
routers correspondingly simple. To make this work, every card needs
to have a map of the network so they know where to send stuff, and
that map needs to be generated by a daemon that then distributes the
map everywhere. Fine, right? Wrong:
The Myricom switch is admin'd by a web interface only: no CLI of
any sort, no logging to syslog, nothing. Using this web interface
becomes impractical when you've got thousands of nodes...
There's an inherent fragility in this design: a problem with a card
means you need to turn off the whole node; a problem with the
mapping daemon means things can get corrupt real quick.
And guess what? They had problems with the cards: a bad batch of
transceivers meant that, over the 2-year life of the machine, they
lost a full year's worth of computing. It took a long time to
realize the problem, it took a long time to get the vendor to realize
it, and it took longer to get it fixed (FIXME: Did he ever get it
fixed?)
So, lessons learned:
Vendor relations should not start with a problem. If the first time
you call them up is to say "Your stuff is breaking", you're doomed.
Dealing with vendor problems calls for social skills first, and tech
skills second. Get to know than just the sales team; get familiar
with the tech team before you need them.
Know your systems inside and out before they break; part of their
problem was not being as familiar with things as they should have
been.
Have realistic expectations when someone says "We'll give you such a
deal on this equipment!" That's why they went w/Myricom -- it was
dirt cheap. They saved money on that, but it would have been better
spent on hiring more people. (I realize that doesn't exactly make
sense, but that's what's in my notes.)
Don't pay the vendor 'til it works. Do your acceptance testing, but
be aware of subcontractor relations. In this case, IBM was
providing Blue Gene but had subcontracted Myricom -- and already
paid them. Oops, no leverage. (To be fair, he said that Myricom
did help once they were convinced...but see the next point.)
Have an agreement in advance with your vendor about how much
failure is too much. In their case, the failure rate was slow but
steady, and Myricom kept saying "Oh, let's just let it shake out a
little longer..." It took a lot of work to get them to agree to
replace the cards.
Don't let vendors talk to each other through you. In their case,
IBM would tell them something, and they'd have to pass that on to
Myricom, and then the process would reverse. There were lots of
details to keep track of, and no one had the whole picture. Setting
up a weekly phone meeting with the vendors helped immensely.
Don't wait for the vendors to do your work. Don't assume that
they'll troubleshoot something for you.
Don't buy stuff with a web-only interface. Make sure you can
monitor things. (I'm looking at you, Dell C6500.)
Stay positive at all costs! This was a huge, long-running problem
that impaired an expensive and important piece of equipment, and
resisting pessimism was important. Celebrate victories locally;
give positive feedback to the vendors; keep reminding everyone that
you are making progress.
Question from me: How much of this advice depends on being involved
in negotiations? Answer: maybe 50%; acceptance testing is a big part
of it (and see previous comments about that) but vendor relations
is the other part.
I was hoping to talk to the presenter afterward, but it didn't happen;
there were a lot of other people who got to him first. :-) But what I
heard (and heard again later from Victor) confirmed the low opinion of
the Myrinet protocol/cards...man, there's nothing there to inspire
confidence.
And after that came the talk by Adam Moskowitz on becoming a senior
sysadmin. It was a list of (at times strongly) suggested skills --
hard, squishy, and soft -- that you'll need. Overarching all of it
was the importance of knowing the business you're in and the people
you're responsible to: why you're doing something ("it supports the
business by making X, Y and Z easier" is the correct answer; "it's
cool" is not) , explaining it to the boss and the boss' boss,
respecting the people you work with and not looking down on them
because they don't know computers. Worth watching.
That night, Victor, his sister and I drove up to San Francsisco to
meet Noah and Sarah at the 21st Amendment brewpub. The drive took two
hours (four accidents on the way), but it was worth it: good beer,
good food, good friends, great time. Sadly I was not able to bring
any back; the Noir et Blanc was awesome.
One good story to relate: there was an illustrator at the party who
told us about (and showed pictures of) a coin she's designing for a
client. They gave her the Three Wolves artwork to put on the
coin. Yeah.
Footnotes:
Tags:
lisa
scaryvikingsysadmins
11 Nov 2010

+10 LART of terror. (Quote from Matt.)
Tags:
lisa
scaryvikingsysadmins
10 Nov 2010
I raise my glass to the cut-and-dried,
To the amplified
I raise my glass to the b-side.
-- Sloan, "A-Side Wins"
Tuesday morning I got paged at 4:30am about /tmp filling up on a
webserver at work, and I couldn't get back to sleep after that. I
looked out my window at Venus, Saturn, Spica and Arcturus for a while,
blogged & posted, then went out for coffee. It was cold -- around 4
or 5C. I walked past the Fairmont and wondered at the expensive cars
in their front parking space; I'd noticed something fancy happening
last night, and I've been meaning to look it up.
Two buses with suits pulled up in front of the Convention Centre; I
thought maybe there was going to be a rumble, but they were here for
the Medevice Conference that's in the other half of the Centre.
(The Centre, by the way, is enormous. It's a little creepy to walk
from one end to the other, in this enormous empty marble hall,
followed by Kenny G the whole way.)
And then it was tutorial time: Cfengine 3 all day. I'd really been
looking forward to this, and it was pretty darn good. (Note to
myself: fill out the tutorial evaluation form.) Mark Burgess his own
bad self was the instructor. His focus was on getting things done
with Cfengine 3: start small and expand the scope as you learn more.
At times it dragged a little; there was a lot of time spent on
niceties of syntax and the many, many particular things you can do
with Cf3. (He spent three minutes talking about granularity of time
measurement in Cf3.)
Thus, by the 3rd quarter of the day we were only halfway through his
100+ slides. But then he sped up by popular request, and this was
probably the most valuable part for me: explaining some of the
principles underlying the language itself. He cleared up a lot of
things that I had not understood before, and I think I've got a much
better idea of how to use it. (Good thing, too, since I'm giving a
talk on Cf2 and Cf3 for a user group in December.)
During the break, I asked him about the Community Library. This is a
collection of promises -- subroutines, basically -- that do high-level
things like add packages, or comment-out sections of a file. When I
started experimenting with Cf3, I followed the tutorials and noticed
that there were a few times where the CL promises had changed (new
names, different arguments, etc). I filed a bug and the documentation
was fixed, but this worried me; I felt like libc's printf() had
suddenly been renamed showstuff(). Was this going to happen all the
time?
The answer was no: the CL is meant to be immutable; new features are
appended, and don't replace old ones. In a very few cases, promises
have been rewritten if they were badly implemented in the first place.
At lunch, I listened to some people in Federal labs talk about
supercomputer/big cluster purchases. "I had a thirty-day burnin and
found x, y and z wrong..." "You had 30 days? Man, we only have 14
days." "Well, this was 10 years ago..." I was surprised by this; why
wouldn't you take a long time to verify that your expensive hardware
actually worked?
User pressure is one part; they want it now. But the other part is
management. They know that vendors hate long burn-in periods, because
there's a bunch of expensive shiny that you haven't been paid for yet
getting banged around. So management will use this as a bargaining
chip in the bidding process: we'll cut down burn-in if you'll give us
something else. It's frustrating for the sysadmins; you hope
management knows what they're doing.
I talked with another sysadmin who was in the Cf3 class. He'd
recently gone through the Cf2 -> Cf3 conversion; it took 6 months and
was very, very hard. Cf3 is so radically different from Cf2 that it
took a long time to wrap his head around how it/Mark Burgess thought.
And then they'd come across bugs in documentation, or bugs in
implementation, and that would hold things up.
In fact, version 3.1 has apparently just come out, fixing a bug that
he'd tripped across: inserting a file into the middle of another file
truncated that file. Cf3 would divide the first file in two (as
requested), insert the bit you wanted, then throw away the second half
rather than glom it back on. Whoops.
As a result, they're evaluating Puppet -- yes, even after 6 months of
effort to port...in fact, because it took 6 months of effort to
port. And because Puppet does hierarchical inheritance, whereas Cf3
only does sets and unions of sets. (Which MB says is much more
flexible and simple: do Java class hierarchies really simplify
anything?)
After all of that, it was time for supper. Matt and I met up with a
few others and headed to The Loft, based on some random tweet I'd
seen. There was a long talk about interviews, and I talked to one of
the people about what it's like to work in a secret/secretive
environment.
Secrecy is something I keep bumping up against at LISAs; there are
military folks, government folks (and not just US), and folks from
private companies that just don't talk a lot about what they do. I'm
very curious about all of this, but I'm always reluctant to ask...I
don't want to put anyone in an awkward spot. OTOH, they're probably
used to it.
After that, back to the hotels to continue the conversation with the
rapidly dwindling supplies of free beer, then off to the Fedora 14 BoF
that I promised Beth Lynn I'd attend. It was interesting,
particularly the mention of Fedora CSI ("Tonight on NBC!"), a set
of CC-licensed system administration documentation. David Nalley
introduced it by saying that,if you change jobs every few years like
he does, you probably find yourself building the same damn
documentation from scratch over and over again. Oh, and the Fedora
project is looking for a sysadmin after burning through the first
one. Interesting...
And then to bed. I'm not getting nearly as much sleep here as I
should.
Tags:
lisa
scaryvikingsysadmins
09 Nov 2010
Growing up was wall-to-wall excitement, but I don't recall
Another who could understand at all...
-- Sloan
Monday: day two of tutorials. I found Beth Lynn in the lobby
and congratulated her on being very close to winning her bet; she's a
great deal closer than I would have guessed. She convinced me to show
up at the Fedora 14 BoF tomorrow.
First tutorial was "NASes for the Masses" with Lee Damon, which was
all about how to do cheap NASes that are "mostly reliable" -- which
can be damn good if your requirements are lower, or your budget
smaller. You can build a multi-TB RAID array for about $8000 these
days, which is not that bad at all. He figures these will top out at
around 100 users...200-300 users and you want to spend the money on
better stuff.
The tutorial was good, and a lot of it was stuff I'd have liked to
know about five years ago when I had no budget. (Of course, the disk
prices weren't nearly so good back then...) At the moment I've got a
good-ish budget -- though, like Damon, Oracle's ending of their
education discount has definitely cut off a preferred supplier -- so
it's not immediately relevant for me.
QOTD:
Damon: People load up their file servers with too much. Why
would you put MSSQL on your file server?
Me: NFS over SQL.
Matt: I think I'm going to be sick.
Damon also told us about his experience with Linux as an NFS server:
two identical machines, two identical jobs run, but one ran with the
data mounted from Linux and the other with the data mounted from
FreeBSD. The FreeBSD server gave a 40% speed increase. "I will never
use Linux as an NFS server again."
Oh, and a suggestion from the audience: smallnetbuilder.com for
benchmarks and reviews of small NASes. Must check it out.
During the break I talked to someone from a movie studio who talked
about the legal hurdles he had to jump in his work. F'r example:
waiting eight weeks to get legal approval to host a local copy of a
CSS file (with an open-source license) that added mouseover effects,
as opposed to just referring to the source on its original host.
Or getting approval for showing 4 seconds of one of their movies in a
presentation he made. Legal came back with questions: "How big will
the screen be? How many people will be there? What quality will you
be showing it at?" "It's a conference! There's going to be a big
screen! Lots of people! Why?" "Oh, so it's not going to be 20
people huddled around a laptop? Why didn't you say so?" Copyright
concerns? No: they wanted to make sure that the clip would be shown
at a suitably high quality, showing off their film to the best
effect. "I could get in a lot of trouble for showing a clip at
YouTube quality," he said.
The afternoon was "Recovering from Linux Hard Drive Disasters" with
Ted T'so, and this was pretty amazing. He covered a lot of
material, starting with how filesystems worked and ending with deep
juju using debugfs. If you ever get the chance to take this course, I
highly recommend it. It is choice.
Bits:
ReiserFS: designed to be very, very good at handling lots of little
files, because of Reiser's belief that the line between databases
and filesystems should be erased (or at least a lot thinner than it
is). "Thus, ReiserFS is the perfect filesystem if you want to store
a Windows registry."
Fsck for ReiserFS works pretty well most of the time; it scans the
partition looking for btree nodes (is that the right term?)
(ReiserFS uses btrees throughout the filesytem) and then
reconstructs the btree (ie, your filesystem) with whatever it finds.
Where that falls down is if you've got VM images which themselves
have ReiserFS filesystems...everything gets glommed together and it
is a big, big mess.
BtrFS and ZFS both very cool, and nearly feature-identical though
they take very different paths to get there. Both complex enough
that you almost can't think of them as a filesystem, but need to
think of them in software engineering terms.
ZFS was the cure for the "filesystems are done" syndrome. But it
took many, many years of hard work to get it fast and stable. BtrFS
is coming up from behind, and still struggles with slow reads and
slowness in an aged FS.
Copy-on-write FS like ZFS and BtrFS struggle with aged filesystems
and fragmentation; benchmarking should be done on aged FS to get an
accurate idea of how it'll work for you.
Live demos with debugfs: Wow.
I got to ask him about fsync() O_PONIES; he basically said if you
run bleeding edge distros on your laptop with closed-source graphics
drivers, don't come crying to him when you lose data. (He said it
much, much nicer than that.) This happens because ext4 assumes a
stable system -- one that's not crashing every few minutes -- and so
it can optimize for speed (which means, say, delaying sync()s for a
bit). If you are running bleeding edge stuff, then you need to
optimize for conservative approaches to data preservation and you lose
speed. (That's an awkward sentence, I realize.)
I also got to ask him about RAID partitions for databases. At $WORK
we've got a 3TB disk array that I made into one partition, slapped
ext3 on, and put MySQL there. One of the things he mentioned during
his tutorial made me wonder if that was necessary, so I asked him what
the advantages/disadvantages were.
Answer: it's a tradeoff, and it depends on what you want to do.
DB vendors benchmark on raw devices because it gets a lot of kernel
stuff out of the way (volume management, filesystems). And if you've
got a SAN where you can a) say "Gimme a 2.25TB LUN" without problems,
and b) expand it on the fly because you bought an expensive SAN (is
there any other kind?), then you've got both speed and flexibility.
OTOH, maybe you've got a direct-attached array like us and you can't
just tell the array to double the LUN size. So what you do is hand
the raw device to LVM and let it take care of resizing and such --
maybe with a filesystem, maybe not. You get flexibility, but you have
to give up a bit of speed because of the extra layers (vol mgt,
filesystem).
Or maybe you just say "Screw it" like we have, and put a partition and
filesystem on like any other disk. It's simple, it's quick, it's
obvious that there's something important there, and it works if you
don't really need the flexibility. (We don't; we fill up 3TB and
we're going to need something new anyhow.)
And that was that. I called home and talked to the wife and kids,
grabbed a bite to eat, then headed to the OpenDNS BoF. David
Ulevitch did a live demo of how anycast works for them, taking down
one of their servers to show the routing tables adjust. (If your DNS
lookup took an extra few seconds in Amsterdam, that's why.) It was a
little unsettling to see the log of queries flash across the screen,
but it was quick and I didn't see anything too interesting.
After that, it was off to the Gordon Biersch pub just down the street.
The food was good, the beer was free (though the Marzen tasted
different than at the Fairmont...weird), and the conversation was
good. Matt and Claudio tried to set me straight on US voter
registration (that is, registering as a
Democrat/Republican/Independent); I think I understand now, but it
still seems very strange to me.
Tags:
lisa
scaryvikingsysadmins
beer
mysql
08 Nov 2010
Hey you!
We've been around for a while.
If you'll admit that you were wrong, then we'll admit that we're right.
-- Sloan
After posting last night, a fellow UBCianiite and I went looking for
drinks. We eventually settled on the bar at the Fairmont. The
Widsomething Imperial IPA was lovely, as was the Gordon Biersch
(spelling, I'm sure) Marzen...never had a Marzen before and it was
lovely. (There was a third beer, but it wasn't very good. Mentioning
it would ruin my rhythm.) What was even lovelier was that the
coworker picked up the tab for the night. I'm going to invite him
drinking a lot more from now on.
Sunday was day one of tutorials. In the morning was "Implementing
DNSSEC". As some of the complaints on Twitter mentioned, the
implementation details were saved for the last quarter of the
tutorial. I'm not very familiar with DNSSEC, though, so I was happy
with the broader scope...and as the instructor pointed out, BIND 9.7
has made a lot of it pretty easy, and the walkthrough is no longer as
detailed as it once had to be.
Some interesting things:
He mentioned not being a big believer in dynamic zones
previously...and now he runs 40 zones and they're ALL dynamic. This
is an especially nice thing now that he's running DNSSEC.
Rackspace is authoritative for 1.1 million zones...so startup
time of the DNS server is important; you can't sit twiddling your
thumbs for several hours while you wait for the records to load.
BIND 10 (did I mention he works for the ISC?) will have a database
backend built right in. Not sure if he meant that text records
would go away entirely, or if this would be another backend, or if
it'd be used to generate text files. Still, interesting.
DNSSEC failure -- ie, a failure of your upstream resolver to
validate the records/keys/whatever -- is reported as SERVFAIL rather
than something more specific. Why? To keep (say) Windows 3.1
clients, necessary to the Education Department of the fictional
state of East Carolina, working...they are not going to be updated,
and you can't break DNS for them.
Zone signatures: root (.) is signed (by Verisign; uh-oh); .net is
signed as of last week; .com is due next March. And there are still
registrars that shrug when you ask them when they're going to
support DS records. As he said, implement it now or start
hemorrhaging customers.
Another reason to implement it now, if you're an ISP: because the
people who will call in to notify you of problems are the techie
early adopters. Soon, it'll be Mom and Dad, and they're not going
to be able to help you diagnose it at all.
Go look at dnsviz.net
Question that he gets a lot: what kind of hardware do I need to
serve X many customers? Answer: there isn't one; too many
variables. But what he does suggest is to take your hardware
budget, divide by 3, and buy what you can for that much.
Congratulations: you now have 3 redundant DNS servers, which is a
lot better than trying to guess the right size for just one.
A crypto offload card might be a good thing to look at if you have a
busy resolver. But they're expensive. If your OS supports it, look
into GPU support; a high-end graphics card is only a few hundred
dollars, and apparently works quite well.
On why DNSSEC is important:
"I put more faith in the DNS system than I do in the public water
system. I check my email in bed with my phone before I have a
shower in the morning."
"As much as I have privacy concerns about Google, I have a lot
more concerns about someone pretending to be Google."
On stupid firewall assumptions about DNS:
AOL triggered heartburn a ways back when replies re: MX records
started exceeding 512 bytes...which everyone knew was impossible
and/or wrong. (It's not.) Suddenly people had weird problems
trying to email AOL.
Some version of Cisco's stateful packet inspection assumes that any
DNS reply over 512 bytes is clearly bogus. It's not, especially
with DNSSEC.
If I rem. correctly (notes are fuzzy on this point), a reply over
512 bytes gets you a UDP packet that'll hold what it can, with a
flag set that says "query over TCP for the full answer please." But
there are a large number of firewall tutorials that advise you to
turn off DNS over TCP. (My own firewall may be set up like
that...need to fix that when I get back.)
When giving training on DNS in early 2008, he came to a slide about
cache poisoning. There was another ISC engineer there to help him
field questions, give details, etc, and he was turning paler and paler
as he talked about this. This was right before the break; as soon as
the class was dismissed, the engineer came up to him and said, "How
many more of those damn slides do you have?" "That's all, why?" "I
can't tell you. But let's just say that in a year, DNSSEC will be a
lot more important."
The instructor laughed in his face, because he'd been banging his head
against that brick wall for about 10 years. But the engineer was one
of the few who knew about the Kaminsky attack, and had been sworn to
secrecy.
Lunch! Good lunch, and I happened, along with Bob the Norwegian, to
be nearly first in line. Talked to tablemates from a US gov't lab,
and they mentioned the competition between labs. They described how
they moved an old supercomputer over beside a new supercomputing
cluster, and got the top 500 cluster for...a week, 'til someone else
got it. And there were a couple admins from the GPS division of John
Deere, because tractors are all GPS-guided these days when plowing the
fields.
Sunday afternoon was "Getting it out the door successfully", a
tutorial on project management, with Strata Rose-Chalup. This was
good; there were some things I already knew (but was glad to see
confirmed), and a lot more besides...including stuff I need to
implement. Like: if startup error messages are benign, then a) don't
emit them, and b) at least document them so that other people
(customers, testers, future coders) know this.
QOTD:
- "Point to the wall, where you have a velvet-covered clue-by-four and
chocolate. Ask, 'Which would you like to have your behaviour
modified by today?'"
"What do you do if your product owner is an insane jackass?" "If
your product owner is an insane jackass, then you have a typical
product..." But srsly: many people choose to act like this when they
feel they're not being listened to them. Open up your meetings and
let them see what's on the table. Bring in their peers, too; that way
their choice will be to act like a jackass in front of their peers, or
to moderate their demands.
Tip from the audience: when faced with impossible requests, don't say
"No". Just bring up the list of stuff you're already working on, and
the requests/features/bugfixes that have already been agreed to, and
ask them where this fits in. They'll either modify their request
('cos it's not that important to them), or you'll find a lot of
other stuff moved out of your way ('cos that other stuff isn't that
important to them).
After that was supper with Andy, who I hadn't seen since last year's
LISA. We hit up a small Mexican place for supper (not bad), the
Britannia Arms for a beer (where Matt tried to rope us into
Karaoke and kept asking us to do "Freebird" with him), then the
Fairmont hotel bar so Andy could get his Manhattan. (He's a bit
intense about Manhattans.) It was a good time.
Tags:
lisa
scaryvikingsysadmins
06 Nov 2010
There's been debate and some speculation
Have you heard?
Sloan
I figure two months is long enough.
I'm at LISA again, this time in sunny San Jose. I took the train down
this year (no reason, why do you ask?), which...well, it
took a long time: I got on a bus to Seattle at 5:30am on Friday, and
arrived at the San Jose train station at 10am on Saturday. I went
coach; a sleeper would have been a nice addition, as the chairs are
not completely comfortable for sleeping. (Probably would have got me
access to the wireless too, which Amtrak's website does not mention is
only available to T3h El33+.)
But oh, the leg room! I nearly wept. And the people-watching....my
wife is the champ, but I can get into it too. Overheard snippets
of conversation in the observation car were the best. Like this guy
with silver hair, kind of like the man from Glad:
Silver: So yeah, she got into animal husbandry then and just started
doing every drug on the planet. I mean, when I started doing pot, I
told my parents. I told my grandparents. But she...I mean, EVERY
drug on the planet.
Or the two blue-collar guys who met in the observation car and became
best buds:
Buddy: Aw man, you ever go to the casinos? Now that I'm up in
Washington now, I think I'm gonna check 'em out.
Guy: I dunno, I go with my friends sometimes. I don't gamble, but
I'll have a few beers.
Buddy: You hear who's coming to the Tulalip? Joe Satriani, man.
JOOOOOOOOOE. Joe Satriani!
Guy: Yeah, I'll hit the buffet...
And then later:
Silver: I knew it was a bad thing. I mean, she was a ten. I'm okay,
but she was a TEN, you know what I mean? The other tenants were
going to get jealous, and I only had enough of them to pay the
mortage.
Buddy: (separate conversation) And we caught one of those red crabs
when we were up in Alaska?
Guy: Man, you won't catch me eatin' that shit.
Silver: And then she says, do you mind if I take a trip up the
mountains with this doctor I met? I say, what do I have to say about
it?
Buddy: What? Man, they're good eatin'. We just dropped it in a pot
and boiled the sonuvabitch.
Silver: And that's when I realize she thinks we're in a relationship. I
guess she's got this thing about men.
I slept badly, woke up at 3:30am and read for a while before realizing
that the book of disturbing scifi stories is not really good 3:30am
reading. I watched San Francisco and San Jose approach from the
observation car; tons and tons of industrial land, occasionally
interrupted by beautiful parks and seashore.
San Jose came at last. I had thought about walking to the convention
centre, but decided against it. Glad I did, since a) it's a little
further than I thought; b) it's surprisingly warm here; c) more
industrial land, and d) when I did go out walking later on I managed
to get completely turned around twice. I was looking for Phillz
Coffee, based on a recommendation from Twitter (can't bring myself yet
to say "tweet"; give me six months) and got lost in Innitek land
(complete with Adobe) and a Vietnamese neighbourhood before finding it
at last. The coffee was pretty good; they have about two dozen
varieties and they make it one cup at a time. Not sure it was worth
$3.50 for a 12 oz coffee, though...NOT THAT I'M UNGRATEFUL. Thank
you, @perwille.
Gotta say, downtown SJ on a Saturday is...dead. I saw maybe a dozen
people in six blocks despite stores, a nearby university (they call
them high schools here) and I think three museums. I have no idea
where one might go for a fun time at night, but I imagine it involves
another city.
So then I took a bus to sunny Cupertino. Why? To visit the retail
outlet of Orion Telescopes. I've got into astronomy again (loved
it as a kid), and I'm thinking of buying one of their telescopes in
about a year. Since the store was only ten miles away, why not go?
And since the bus goes right from the hotel to, well, pretty close,
seems like it's a requirement.
Now that was fun; even more people-watching on the train. Like the
Hispanic gentleman w/a handlebar moustache, a cowboy hat, tight
polyester pants (he had the roundest buttocks I've ever seen on a man.
I could only wonder in great admiration), a silk shirt with "K-Paz"
embroidered on the back, and a button that said, in Spanish, something
that was probably "DO X NOW! ASK ME HOW!" And the study in
ringtones: the elderly Hispanic grandmother who had Mexican accordion
music vs. the middle-aged African-American guy who had Michael
Jackson's "Thriller." Man, you just don't get that where I come
from.
And the contrast in neighbourhoods between San Jose (out of downtown,
it was all Hispanic shops), Santa Clara ("ALL-AMERICAN CITY 2001" said
the sign; Restoration Hardware to prevent white panic) and Cupertino
(duelling car dealerships (Audi, Land Rover and Lexus) and antivirus
companies (Symantec and Trend Micro); Critical Mass, only with
scooters instead of bikes; Harley driver wearing a leather jacket with
an Ed Hardy embroidered patch on the back).
Anyhow, the telescopes were neat; it was the first chance I'd really
had to look at them closely. I didn't buy one (relax, Pre!). They
didn't have a floor model of the one I really want, but I've got a
better idea what the size, and what I want out of one.
And now...to post, then register. Which means going to the business
centre, since Internet access costs $78/day at the Hilton with a 3KB
daily cap. And the Russian mob's attempt to get my banking data by
setting up a "Free DownTown WiFi" network is NOT GOING TO WORK,
tvaritch.
Tags:
lisa
scaryvikingsysadmins
02 Sep 2010
Last year I bought a Galileoscope for $15. It's a cheap (though
well-made) telescope that was meant to celebrate the 400th anniversary
of Galileo's first astronomical observations. It was $15 -- so
cheap!
Jupiter has been visible all this month out our bedroom window around
4:30am, and this morning I pointed the telescope at it and saw its
moons and, I think, a band across the middle. If I had a tripod to
hook it up to, I would have got an even better view...but even
balanced on the window, it's amazing what you can see.
Work yesterday was interesting -- which is good, because it's been a
bit of a slow month. A vendor bought me coffee, and it was actually
an interesting conversation. I finally got an LDAP server migrated to
a VM in preparation for re-installing the host it's on; this took a
while because I refused to read my own instructions for how to set up
replication (sigh). And that brought up other problems, like the
fact that my check for jumbo frames being enabled wasn't actually
complaining about non-jumbo frames...or that the OpenSuSE machines
I've got didn't get their LDAP configuration from Cfengine the way I
thought.
All stuff to solve tomorrow...I mean, today. (Dang getting up at 4am...)
Tags:
astronomy
25 Aug 2010
At work, I've been playing with a tiny cluster: 3 Sun V20z servers,
each with a 2.2GHz dual-core Opteron and 2GB of memory. It's nothing
special at all, but it's been a good way of getting familiar wiht
Rocks.
One thing that's bitten me a few times is the documentation. The 411
service is described only in the 4.x documents, but still appears to
be a going concern in the 5.x series...indeed, that's how I got LDAP
and autofs working. And to test the cluster, the HPC roll
documentation says to use cluster-fork
...yet running
cluster-fork gives me the message "cluster-fork has been replaced with
'rocks run host'", which is documented in the base roll
Tags:
23 Aug 2010
My in-laws got us a family membership at Science World for Xmas
last year. Yesterday I got to take my 4 year-old (how should that be
hyphenated?) son for the morning. It was his third trip and my
second.
We headed right for the Eureka room, which is aimed at the young 'uns,
and he ran around showing me everything. "Daddy, here's a big tube
where you can shoot out parachutes! And this air gun shoots balls up
into the water!" We found out that you could stuff three plastic
balls into the air gun at once (poom poom poom!).
Oh, and when we got home he wanted to do an experiment. He got some
pennies and put them in a jar with water, to leave them for a few days
and see if they would dissolve. I had a maple syrup candy in my
pocket (no idea where I got it), so I threw that in too. The candy
has dissolved and made the water brown, so I'm curious to see what he
makes of that.
Science World is just incredible. I long to go see the grownup stuff,
but even the kid stuff is enormously fun and moderately educational
(though that's not my son's priority right now) (dang kids). I grew
up in small towns as a kid, so trips to museums like this were rare,
enormous fun. (And I never did get to go to Science North...)
It's amazing to me that this stuff is right here, only a half hour
away by transit. I'm still a little shocked we don't go, like, every
weekend.
In other news, I've got a starter going for a batch of beer next
weekend. It's a Belgian yeast, harvested from my January
batch. The yeast was washed following these instructions,
and the starter took off in about 18 hours. It seems to be doing
quite nicely; I'll probably stick it in the fridge on Wednesday or so
and cold-crash it.
The ingredients are pretty much whatever I have around the
house: the last of my Gambrinus ESB, some biscuit and wheat malt, a
bit of roasted barley, and the hops are Centennial, Goldings and Mt
Hood. My father-in-law would call this a "ministrone" -- Italian not
just for that kind of soup, but for "dog's breakfast" or "big ol'
mixup". (I kind of like the idea of an Italian sounding like he's
from Missouri.)
Still looking for a name; suggestions on a postcard, please.
Sponsorship options are available. :-)
After that's in the bag, it's time to head back to Dan's for a
shopping trip. This time, I think it'll be a 50-lb bag of plain ol'
pale malt, and I'll see what difference that makes.
Tags:
geekdad
beer
17 Aug 2010
I'm trying to get Bacula to make a separate copy of monthly full
backups that can be kept off-site. To do this, I'm experimenting with
its "Copy" directive. I was hoping to get a complete set of tapes
ready to keep offsite before I left, but it was taking much longer
than anticipated (2 days to copy 2 tapes). So I cancelled the jobs,
typed unmount
at bconsole, and went home thinking Bacula would just
grab the right tape from the autochanger when backups came.
What I should have typed was release
. release
lets Bacula grab
whatever tape it needs. unmount
leaves Bacula unwilling to do
anything on its own, and it waits for the operator (ie, me) to do
something.
Result: 3 weeks of no backups. Welcome back, chump.
There are a number of things I can do to make sure this doesn't happen
again. There's a thread on the Bacula-users mailing list (came up in
my absence, even) detailing how to make sure something's mounted. I
can use release
the way Kern intended. I can set up a separate
check that goes to my cel phone directly, and not through Nagios. I
can run a small backup job manually on Fridays just to make sure it's
going to work. And on it goes.
I knew enough not to make changes as root on Friday before going on
vacation. But now I know that includes backups.
Tags:
backups
fail
work