Xmas Maintenance 2010: Lessons learned

11 Jan 2011

Xmas vacation is when I get to do big, disruptive maintenance with a fairly free hand. Here's some of what I did and what I learned this year.

Order of rebooting

I made the mistake of rebooting one machine first: the one that held the local CentOS mirror. I did this thinking that it would be a good guinea pig, but then other machines weren't able to fetch updates from it; I had to edit their repo files. Worse, there was no remote console on it, and no time (I thought) to take a look.

Lesson: Don't do that.

Automating patching

Last year I tried getting machines to upgrade using Cfengine like so:

centos.some_group_of_servers.Hr14.Day29.December.Yr2009::
          "/usr/bin/yum -q -y clean all"
          "/usr/bin/yum -q -y upgrade"
          "/usr/bin/reboot"

This didn't work well: I hadn't pushed out the changes in advance, because I was paranoid that I'd miss something. When I did push it out, all the machines hit on the cfserver at the same time (more or less) and didn't get the updated files because the server was refusing connections. I ended up doing it by hand.

This year I pushed out the changes in advance, but it still didn't work because of the problems with the repo. I ran cssh, edited the repos file and updated by hand.

This worked okay, but I had to do the machines in separate batches -- some needed to have their firewall tweaked to let them reach a mirror in the first place, some I wanted to watch more carefully, and so on. That meant going through a list of machines, trying to figure out if I'd missed any, adding them by hand to cssh sessions, and so on.

Lesson: I need a better way of doing this.
Lesson: I need a way to check whether updates are needed.

I may need to give in and look at RHEL, or perhaps func or better Cfengine tweaking will do the job.

Staggering reboots

Quick and dirty way to make sure you don't overload your PDUs:

sleep $(expr $RANDOM / 200 ) && reboot

Remote consoles

Rebooting one server took a long time because the ILOM was not working well, and had to be rebooted itself.

Lesson: I need to test the SP before doing big upgrades; the simplest way of doing this may just be rebooting them.

Upgrading the database servers w/the 3 TB arrays took a long time: stock MySQL packages conflicted with the official MySQL rpms, and fscking the arrays takes maybe an hour -- and there's no sign of life on the console while you're doing it. Problems with one machine's ILOM meant I couldn't even get a console for it.

Lesson: Again, make sure the SP is okay before doing an upgrade.
Lesson: Fscking a few TB will take an hour with ext3.
Lesson: Start the console session on those machines before you reboot, so that you can at least see the progress of the boot messages up until the time it starts fscking.
Lesson: Might be worth editing fstab so that they're not mounted at boot time; you can fsck them manually afterward. However, you'll need to remember to edit fstab again and reboot (just to make sure)...this may be more trouble than it's worth.

OpenSuSE

Holy mother of god, what an awful time this was. I spent eight hours on upgrades for just nine desktop machines. Sadly, most of it was my fault, or at least bad configuration:

Two of the machines were running OpenSuSE 11.1; the rest were running 11.2. The latter lets you upgrade to the latest release from the command line using "zypper dist-upgrade"; the former does not, and you need to run over with a DVD to upgrade them.
By default, zypper fetches packages one at a time, installs them, then fetches them again. I'm not certain, but I think that means there's a lot more TCP overhead and less chance to ratchet up the speed. Sure as hell seemed slow downloading 1.8GB x 9 machines this way.
Graphics drivers: awful. Four different versions, and I'd used the local install scripts rather than creating an RPM and installing that. (Though to be fair, that would just rebuild the driver from scratch when it was installed, rather than do something sane like build a set of modules for a particular kernel.) And I didn't figure out where the uninstall script was 'til 7pm, meaning lots of fun trying to figure out why the hell one machine wouldn't start X.
Lesson: This really needs to be automated.
Lesson: The ATI uninstall script is at /usr/share/ati/fglrx-uninstall.sh. Use it.
Lesson: Next time, uninstall the driver and build a goddamn RPM.
Lesson: A better way of managing xorg.conf would be nice.
Lesson: Look for prefetch options for zypper. And start a local mirror.
Lesson: Pick a working version of the driver, and commit that fucker to Subversion.

Special machines

These machines run some scientific software: one master, three slaves. When the master starts up at boot time, it tries to SSH to the slaves to copy over the binary. There appears to be no, or poor, rate throttling; if the slaves are not available when the master comes up, you end up with the following symptoms:

Lots of SSH/scp processes on the master
Lots of SSH/scp processes on the slave (if it's up)
If you try to run the slave binary on the slave, you get errors like "lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)" (from strace) or "ESPIPE text file busy" (from running it in the shell).

The problem is that umpty scp processes on the slave are holding open the binary, and the kernel gets confused trying to run it.

Lesson: Bring up the slaves first, then bring up the master.
Lesson: There are lots of interesting and obscure Unix errors.

I also ran into problems with a duff cable on the master; confusingly, both the kernel and the switch said it was still up. This took a while to track down.

Lesson: Network cables are surprisingly fragile at the connection with the jack.

Virtual Machines

It turned out that a couple of my kvm-based VMs did not have jumbo frames turned on. I had to use virt-manager to shut down the machines, turn on virtio on the drivers, then reboot. However, kudzu on the VMs then saw these as new interfaces and did not configure them correctly. This caused problems because the machines were LDAP clients and hung when the network was unavailable.

Lesson: To get around this, go into single-user mode and copy /etc/sysconfig/network-scripts/ifcfg-eth0.bak to ifcfg.eth0.
Lesson: Be sure you're monitoring everything in Nagios; it's a sysadmin's regression test.

Org Mode + The Cycle + DayTimer + RT

05 Jan 2011

In the spirit of Chris Siebenmann, and to kick off the new year, here's a post that's partly documentation for myself and partly an attempt to ensure I do it right: how I manage my tasks using Org Mode, The Cycle, my Daytimer and Request Tracker.

Org Mode is awesome, even more awesome than the window manager. I love Emacs and I love Org Mode's flexibility.
Tom Limoncelli's "Time Management for System Administrators." Really, I shouldn't have to tell you this.
DayTimer: because I love paper and pen. It's instant boot time, and it's maybe $75 to replace (and durable) instead of $500 (and delicate). And there is something so satisfying about crossing off an item on a list; C-c C-c just isn't as fun.
RT: Email, baby. Problem? Send an email. There's even rt-liberation for integration with Emacs (and probably Org Mode, though I haven't done that yet).

So:

Problems that crop up, I email to RT -- especially if I'm not going to deal with them right away. This is perfect for when you're tackling one problem and you notice something else non-critical. Thus, RT is often a global todo list for me.
If I take a ticket in RT (I'm a shop of one), that means I'm planning to work on it in the next week or so.
Planning for projects, or keeping track of time spent on various tasks or for various departments, is kept in Org Mode. I also use it for things like end-of-term maintenance lists. (I work at a university.) It's plain text, I check it into SVN nightly, and Emacs is The One True Editor.
My DayTimer is where I write down what I'm going to do today, or that appointment I've got in two weeks at 3pm. I carry it everywhere, so I can always check before making a commitment. (That bit sampled pretty much directly from TL.)
Every Monday (or so; sometimes I get delayed) I look through things to see what has to be done:
- Org mode for projects or next-step sorta things
- RT for tickets that are active
- DayTimer for upcoming events
I plan out my week. "A" items need to be done today; "B" items should be done by the end of the week; "C" items are done if I have time.
Once every couple of months, I go through RT and look at the list of tickets. Sometimes things have been done (or have become irrelevant) and can be closed; sometimes they've become more important and need to be worked.
I try to plan out what I want to get done in the term ahead at the beginning of the term, or better yet just before the term starts; often there are new people starting with a new term, and it's always a bit hectic.

How I spent my day

23 Dec 2010

This took me a while to figure out. (All my war stories start with that sentence...)

A faculty member is getting a new cluster next year. In the meantime, I've been setting up Rocks on a test bed of older machines to get familiar with it. This week I've been working out how Torque, Maui and MPI work, and today I tried running something non-trivial.

CHARMM is used for molecular simulations; it's mostly (I think) written in Fortran and has been around since the 80s. It's not the worst-behaved scientific program I've had to work with.

I had an example script from the faculty member to run. I was able to run it on the head node of the cluster like so:

mpirun -np 8 /path/to/charmm < stream.inp > out ZZZ=testscript.inp

8 CHARMM processes still running after, like, 5 days. (These things run forever, and I got distracted.) Sweet!

Now to use the cluster the way it was intended: by running the processes on the internal nodes. Just a short script and away we go:

$ cat test_mpi_charmm.sh
#PBS -N test_charmm
#PBS -S /bin/sh
#PBS -N mpi_charmm
#PBS -l nodes=2:ppn=4

. /opt/torque/etc/openmpi-setup.sh
mpirun /path/to/charmm < stream.inp ZZZ=testscript.inp
$ qsub test_mpi_charmm.sh

But no, it wasn't working. The error file showed:

At line 3211 of file ensemble.f
Fortran runtime error: No such file or directory

mpirun has exited due to process rank 0 with PID 9494 on
node compute-0-1.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

Well, that's helpful...but the tail of the output file showed:

CHARMM>    ensemble open unit 19 read card name    -
 CHARMM> restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
 Parameter: FILEROOT -> "TEST_RUN"
 Parameter: PREV -> "FOO"
 Parameter: NREP -> "1"
 Parameter: NODE -> "0"
 ENSEMBLE>   REPLICA NODE   0
 ENSEMBLE>   OPENING FILE restart/test_run_foo_nr1_nd0
 ENSEMBLE>   ON UNIT  19
 ENSEMBLE>   WITH FORMAT FORMATTED       AND ACCESS READ

What the what now?

Turns out CHARMM has the ability to checkpoint work as it goes along, saving its work in a restart file that can be read when starting up again. This is a Good Thing(tm) when calculations can take weeks and might be interrupted. From the charmm docs, the restart-relevant command is:

IUNREA     -1     Fortran unit from which the dynamics restart file should

          be read. A value of -1 means don't read any file.

(I'm guessing a Fortran unit is something like a file descriptor; haven't had time to look it up yet.)

The name of the restart file is set in this bit of the test script:

iunrea 19 iunwri 21 iuncrd 20

Next is this bit:

ensemble open unit 19 read card name     -
"restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"

An @ sign indicates a variable, it seems. And it's Fortran, and Fortran's been around forever, so it's case-insensitive. So the restart file is being set to "@FILEROOT@PREVnr@NREP_nd@NODE.rst". Snipping from the input file, here are where the variables are set:

set fileroot  test
set prev minim
set node ?whoiam
set nrep ?nensem

test" appears to be just a string. I'm assuming "minim" is some kind of numerical constant. But "whoiam" and "nensem" are set by MPI and turned into CHARMM variables. From charmm's documentation:

The CHARMM run is started using MPI commands to specify the number of processes
(replicas) to use, each of which is an identical copy of charmm. This number
is automatically passed by MPI to each copy of the executable, and it is set to
the internal CHARMM variable 'nensem', which can be used in scripts, e.g.

    set nrep ?nensem

The other internal variable set automatically via MPI is 'whoiam', e.g.

    set node ?whoiam

These are useful for giving different file names to different nodes.

So remember the way charmm was being invoked in the two jobs? The way it worked:

mpirun -np 8 ...

...and the way it didn't:

mpirun ...

Aha! Follow the bouncing ball:

The input script wants to load a checkpoint file...
...which is named after the number of processes mpi was told to run...
...and the script barfs if it's not there.

At first I thought that I could get away with increasing the number of copies of charmm that would run by fiddling with torque/serverpriv/servernodes -- telling it that the nodes had 4 processors each (so total of 8) rather than 2. (These really are old machines.) And then I'd change the PBS line to "-l nodes=2:ppn=4", and we're done! Hurrah! Except no: I got the same error as before.

What does work is changing the mpirun args in the qsub file:

mpirun -np 8 ...

However, what that does is run 8 copies on one compute node -- which works, hurrah, but it's not what I think we want:

many copies on many nodes
communicating as necessary

I think this is a problem for the faculty member to solve, though. It's taken me a whole day to figure this out, and I'm pretty sure I wouldn't understand the implications of just (say) deking out the bit that looks for a restart file. (Besides, there are many such bits.) (Oh, and incidentally, just moving the files out of the way doesn't help...still barfs and dies.) I'll email him about this and let him deal with it.

How things change

14 Dec 2010

Today I was adding a couple network cards to a new firewall...and realized that I had no idea if I had any screws to hold them in place. It's been a long time since I've opened up a machine to replace a part, and a longer time since I've been doing it on a regular basis.

Today, for example: drafting an RFP and going over acceptance testing with the PI; decoding bits of protein modelling procedures so I can understand what he's doing; reading up on Torque, Maui and trying to figure out the difference between OpenMP and OpenMPI.

Five years ago...no, longer, six, seven even: debugging crappy DLink switches; getting workstations from auction; blowing dust bunnies out of the central fileserver.

Awesome Apache debugging resource

10 Dec 2010

As mentioned on a mailing list recently: Apache Internals and Debugging.

Presentation done

08 Dec 2010

My presentation on Cfengine 3 went pretty well yesterday. There were about 20 people there...I had been hoping for more, but that's a pretty good turnout. I was a little nervous beforehand, but I think I did okay during the talk. (I recorded it -- partly to review afterward, partly 'cos my dad wanted to hear the talk. :-)

One thing that did trip me up a bit were questions from one person in the audience that went fairly deep into how to use Cfengine, what its requirements were and so on. Since this was meant to be an introduction and I only had an hour, I wasn't prepared for this. Also, the questions went on...and on...and I'm not good at taking charge of a conversation to prevent it being hijacked. The questions were good, and though he and I disagree on this subject I respect his views. It's just that it really threw off my timing, and would have been best left for after. Any tips?

At some point I'm going to put up more on Cf3 that I couldn't really get into in the talk -- how it compares to Cf2, some of the (IMHO) shortfalls, and so on.

Saturn

05 Dec 2010

I just got in this morning from seeing Saturn for the first time ever through a telescope. It was through my Galileoscope, a cheap (but decent!) 50mm refractor.

I've taken up astronomy again for the first time since I was a kid, and it's been a lot of fun. I've been heading out nights with the scope, some binoculars and sky maps to learn the sky. So far through the scope I've seen Jupiter, the Moon, Albireo, and Venus. (Venus was yesterday, when I was trying to find Saturn...)

Saturn, though...that was something else. It was small, but I could distinctly see the rings. It was absolutely breathtaking. I've wanted to see Saturn through a scope for a long, long time, and it was incredible to finally do so.

You take the good, you take the bad

02 Dec 2010

Bad: Sorry, but can we have a budget by Monday? New rules mean we need to get your budget approved by this time next week, instead of four months from now, so I need to look over it by Monday. (To be fair, he was quite apologetic.)

Good: Seeing the crewcent moon and Venus (or Spica? I'll never know) peeking through the clouds on the way to work this morning.

Working with Rocks

01 Dec 2010

So we want a cluster at $WORK. I don't know a lot about this, so I figure that something like Rocks or OSCAR is the way to go. OSCAR didn't look like it had been worked on in a while, so Rocks it is. I downloaded the CDs and got ready to install on a handful of old machines.

(Incidentally, I was a bit off-base on OSCAR. It is being worked on, but the last "production-ready" release was version 5.0, in 2006. The newest release is 6.0.5, but as the documentation says:

Note that the OSCAR-6.0.x version is not necessarily suitable for production. OSCAR-6.0.x is actually very similar to KDE-4.0.x: this version is not necessarily "designed" for the users who need all the capabilities traditionally shipped with OSCAR, but this is a good new framework to include and develop new capabilities and move forward. If you are looking for all the capabilities normally supported by OSCAR, we advice you to wait for a later release of OSCAR-6.1.

So yeah, right now it's Rocks.)

Rocks promises to be easy: it installs a frontend, then that frontend installs all your compute nodes. You install different rolls: collections of packages. Everything is easy. Whee!

Only it's not that way, at least not consistently.

I'm reinstalling this time because I neglected to install the Torque roll last time. In theory you can install a roll to an already-existing frontend; I couldn't get it to work.
A lot of stuff -- no, that's not true. Some stuff in Rocks is just not documented very well, and it's the little but important and therefore irritating-when-it's-missing stuff. For example: want the internal compute nodes to be LDAP clients, rather than syncing /etc/passwd all around? That means modifying /var/411/Files.mk to include things like /etc/nsswitch and /etc/ldap.conf. That's documented in the 4.x series, but it has been left out of the 5.x series. I can't tell why; it's still in use.
When you boot from the Rocks install CD, you're directed to type "Build" (to build a new frontend) or "Rescue" (to go into rescue mode). What it doesn't tell you is that if you don't type in something quickly enough, it's going to boot into a regular-looking-but-actually-non-functional CentOS install and after a few minutes will crap out, complaining that it can't find certain files -- instead of either booting from the hard drive or waiting for you to type something. You have to reboot again in order to get another chance.

Right now I'm reinstalling the front end for the THIRD TIME in two days. For some reason, the installation is crapping out and refusing to store the static IP address for the outward-facing interface of the front end. Reinstalling means sitting in the server room feeding CDs (no network installation without an already-existing front end) into old servers (which have no DVD drives) for an hour, then waiting another half hour to see what's gone wrong this time.

Sigh.

Can't Face Up

12 Nov 2010

How many times have I tried  
Just to get away from you, and you reel me back?  
How many times have I lied  
That there's nothing that I can do?

-- Sloan

Friday morning started with a quick look at Telemundo ("PRoxima: Esclavas del sexo!"), then a walk to Phillz coffee. This time I got the Tesora blend (their hallmark) and wow, that's good coffee. Passed a woman woman pulling two tiny dogs across the street: "C'mon, peeps!" Back at the tables I checked my email and got an amazing bit of spam about puppies, and how I could buy some rare breeds for ch33p.

First up was the Dreamworks talk. But before that, I have to relate something.

Earlier in the week I ran into Sean Kamath, who was giving the talk, and told him it looked interesting and that I'd be sure to be there. "Hah," he said, "Wanna bet? Tom Limoncelli's talk is opposite mine, and EVERYONE goes to a Tom Limoncelli talk. There's gonna be no one at mine."

Then yesterday I happened to be sitting next to Tom during a break, and he was discussing attendance at the different presentations. "Mine's tomorrow, and no one's going to be there." "Why not?" "Mine's opposite the Dreamworks talk, and EVERYONE goes to Dreamworks talks."

Both were quite amused -- and possibly a little relieved -- to learn what the other thought.

But back at the ranch: FIXME in 2008, Sean gave a talk on Dreamworks and someone asked afterward "So why do you use NFS anyway?" This talk was meant to answer that.

So, why? Two reasons:

Because it works.

They use lots of local caching (their filers come from NetApp, and they also have a caching box), a global namespace, data hierarchy (varying on the scales of fast, reliable and expensive), leverage the automounter to the max, and 10GB core links everywhere, and it works.

What else are you gonna use? Hm?

FTP/rcp/rdist? Nope. SSH? Won't handle the load. AFS lacks commercial support -- and it's hard to get the head of a billion-dollar business to buy into anything without commercial support.

They cache for two reasons: global availability and scalability. First, people in different locations -- like on different sides of the planet (oh, what an age we live in!) -- need access to the same files. (Most data has location affinity, but this will not necessarily be true in the future.) Geographical distribution and the speed of light do cause some problems: while Data reads and gettatr() are helped a lot by the caches, first open, sync()s and writes are slow when the file is in India and it's being opened in Redwood. They're thinking about improvements to the UI to indicate what's happening to reduce user frustration. But overall, it works and works well.

Scalability is just as important: thousands of machines hitting the same filter will melt it, and the way scenes are rendered, you will have just that situation. Yes, it adds latency, but it's still faster than an overloaded filer. (It also requires awareness of close-to-open consistency.)

Automounter abuse is rampant at DW; If one filer is overloaded, they move some data somewhere else and change the automount maps. (They're grateful for the automounter version in RHEL 5: it no longer requires that the node be rebooted to reload the maps.) But like everything else it requires a good plan, or it gets confusing quickly.

Oh, and quick bit of trivia: they're currently sourcing workstations with 96GB of RAM.

One thing he talked about was that there are two ways to do sysadmin: rule-enforcing and policy-driven ("No!") or creative, flexible approaches to helping people get their work done. The first is boring; the second is exciting. But it does require careful attention to customers' needs.

So for example: the latest film DW released was "Mastermind". This project was given a quota of 85 TB of storage; they finished the project with 75 TB in use. Great! But that doesn't account for 35 TB of global temp space that they used.

When global temp space was first brought up, the admins said, "So let me be clear: this is non-critical and non-backed up. Is that okay with you?" "Oh sure, great, fine." So the admins bought cheap-and-cheerful SATA storage: not fast, not reliable, but man it's cheap.

Only it turns out that non-backed up != non-critical. See, the artists discovered that this space was incredibly handy during rendering of crowds. And since space was only needed overnight, say, the space used could balloon up and down without causing any long-term problems. The admins discovered this when the storage went down for some reason, and the artists began to cry -- a day or two of production was lost because the storage had become important to one side without the other realizing it.

So the admins fixed things and moved on, because the artists need to get things done. That's why he's there. And if he does his job well, the artists can do wonderful things. He described watching "Madegascar", and seeing the crowd scenes -- the ones the admins and artists had sweated over. And they were good. But the rendering of the water in other scenes was amazing -- it blew him away, it was so realistic. And the artists had never even mentioned that; they'd just made magic.

Understand that your users are going to use your infrastructure in ways you never thought possible; what matters is what gets put on the screen.

Challenges remain:

Sometimes data really does need to be at another site, and caching doesn't always prevent problems. And problems in a data render farm (which is using all this data) tend to break everything else.
Much needs to be automated: provisioning, re-provisioning and allocating storage is mostly done by hand.
Disk utilization is hard to get in real time with > 4 PB of storage world wide; it can take 12 hours to get a report on usage by department on 75 TB, and that doesn't make the project managers happy. Maybe you need a team for that...or maybe you're too busy recovering from knocking over the filer by walking 75 TB of data to get usage by department.
Notifications need to be improved. He'd love to go from "Hey, a render farm just fell over!" to "Hey, a render farm's about to fall over!"
They still need configuration management. They have a homegrown one that's working so far. QOTD: "You can't believe how far you can get with duct tape and baling wire and twine and epoxy and post-it notes and Lego and...we've abused the crap out of free tools."

I went up afterwards and congratulated him on a good talk; his passion really came through, and it was amazing to me that a place as big as DW uses the same tools I do, even if it is on a much larger scale.

I highly recommend watching his talk (FIXME: slides only for now. Do it now; I'll be here when you get back.

During the break I got to meet Ben Rockwood at last. I've followed his blog for a long time, and it was a pleasure to talk with him. We chatted about Ruby on Rails, Twitter starting out on Joyent, upcoming changes in Illumos now that they've got everyone from Sun but Jonathan Schwarz (no details except to expect awesome and a renewed focus on servers, not desktops), the joke that Joyent should just come out with it and call itself "Sun". Oh, and Joyent has an office in Vancouver. Ben, next time you're up drop me a line!

Next up: Twitter. 165 million users, 90 million tweets per day, 1000 tweets per second....unless the Lakers win, in which case it peaks at 3085 tweets per second. (They really do get TPS reports.) 75% of those are by API -- not the website. And that percentage is increasing.

Lessons learned:

Nothing works the first time; scale using the best available tech and plan to build everything more than once.
(Cron + ntp) x many machines == enough load on, say, the central syslog collector to cause micro outages across the site. (Oh, and speaking of logging: don't forget that syslog truncates messages > MTU of packet.)
RRDtool isn't good for them, because by the time you want to fiugure out what that one minute outage was about two weeks ago, RRDtool has averaged away the data. (At this point Toby Oetiker, a few seats down from me, said something I didn't catch. Dang.)
Ops mantra: find the weakest link; fix; repeat. OPS stats: MTTD (mean time to detect problem) and MTTR (MT to recover from problem).
It may be more important to fix the problem and get things going again than to have a post-mortem right away.
At this scale, at this time, system administration turns into a large programming project (because all your info is in your config. mgt tool, correct?). They use Puppet + hundreds of Puppet modules + SVN + post-commit hooks to ensure code reviews.
Occasionally someone will make a local change, then change permissions so that Puppet won't change it. This has led to a sysadmin mantra at Twitter: "You can't chattr +i with broken fingers."
Curve fitting and other basic statistical tools can really help -- they were able to predict the Twitpocalypse (first tweet ID > 2^32) to within a few hours.
Decomposition is important to resiliency. Take your app and break it into n different independant, non-interlocked services. Put each of them on a farm of 20 machines, and now you no longer care if a machine that does X fails; it's not the machine that does X.
Because of this Nagios was not a good fit for them; they don't want to be alerted about every single problem, they want to know when 20% of the machines that do X are down.
Config management + LDAP for users and machines at an early, early stage made a huge difference in ease of management. But this was a big culture change, and management support was important.

And then...lunch with Victor and his sister. We found Good Karma, which had really, really good vegan food. I'm definitely a meatatarian, but this was very tasty stuff. And they've got good beer on tap; I finally got to try Pliny the Elder, and now I know why everyone tries to clone it.

Victor talked about one of the good things about config mgt for him: yes, he's got a smaller number of machines, but when he wants to set up a new VM to test something or other, he can get that many more tests done because he's not setting up the machine by hand each time. I hadn't thought of this advantage before.

After that came the Facebook talk. I paid a little less attention to this, because it was the third ZOMG-they're-big talk I'd been to today. But there were some interesting bits:

Everyone talks about avoiding hardware as a single point of failure, but software is a single point of failure too. Don't compound things by pushing errors upstream.
During the question period I asked them if it would be totally crazy to try different versions of software -- something like the security papers I've seen that push web pages through two different VMs to see if any differences emerge (though I didn't put it nearly so well). Answer: we push lots of small changes all the time for other reasons (problems emerge quickly, so easier to track down), so in a way we do that already (because of staged pushes).
Because we've decided to move fast, it's inevitable that problems will emerge. But you need to learn from those problems. The Facebook outage was an example of that.
Always do a post-mortem when problems emerge, and if you focus on learning rather than blame you'll get a lot more information, engagement and good work out of everyone. (And maybe the lesson will be that no one was clearly designated as responsible for X, and that needs to happen now.)

The final speech of the conference was David Blank-Edelman's keynote on the resemblance between superheroes and sysadmins. I watched for a while and then left. I think I can probably skip closing keynotes in the future.

And then....that was it. I said goodbye to Bob the Norwegian and Claudio, then I went back to my room and rested. I should have slept but I didn't; too bad, 'cos I was exhausted. After a while I went out and wandered around San Jose for an hour to see what I could see. There was the hipster cocktail bar called "Cantini's" or something; billiards, flood pants, cocktails, and the sign on the door saying "No tags -- no colours -- this is a NEUTRAL ZONE."

I didn't go there; I went to a generic looking restaurant with room at the bar. I got a beer and a burger, and went back to the hotel.

Anyone who's anyone

11 Nov 2010

I missed my chance, but I think I'm gonna get another...

-- Sloan

Thursday morning brought Brendan Gregg's (nee Sun, then Oracle, and now Joyent) talk about data visualization. He introduced himself as the shouting guy, and talked about how heat maps allowed him so see what the video demonstrated in a much more intuitive way. But in turn, these require accurate measurement and quantification of performance: not just "I/O sucks" but "the whole op takes 10 ms, 1 of which is CPU and 9 of which is latency."

Some assumptions to avoid when dealing with metrics:

The available metrics are correctly implemented. Are you sure there's not a kernel bug in how something is measured? He's come across them.
The available metrics are designed by performance experts. Mostly, they're kernel developers who were trying to debug their work, and found that their tool shipped.
The available metrics are complete. Unless you're using DTrace, you simply won't always find what you're looking for.

He's not a big fan of using IOPS to measure performance. There are a lot of questions when you start talking about IOPS. Like what layer?

app
library
sync call
VFS
filesystem
RAID
device

(He didn't add political and financial, but I think that would have been funny.)

Once you've got a number, what's good or bad? The number can change radically depending on things like library/filesystem prefetching or readahead (IOPS inflation), read caching or write cancellation (deflation), the size of a read (he had an example demonstrating how measured capacity/busy-ness changes depending on the size of reads)...probably your company's stock price, too. And iostat or your local equivalent averages things, which means you lose outliers...and those outliers are what slow you down.

IOPS and bandwidth are good for capacity planning, but latency is a much better measure of performance.

And what's the best way of measuring latency? That's right, heatmaps. Coming from someone who worked on Fishworks, that's not surprising, but he made a good case. It was interesting to see how it's as much art as science...and given that he's exploiting the visual cortex to make things clear that never were, that's true in a few different ways.

This part of the presentation was so visual that it's best for you to go view the recording (and anyway, my notes from that part suck).

During the break, I talked with someone who had worked at Nortel before it imploded. Sign that things were going wrong: new execs come in (RUMs: Redundant Unisys Managers) and alla sudden everyone is on the chargeback model. Networks charges ops for bandwidth; ops charges networks for storage and monitoring; both are charged by backups for backups, and in turn are charged by them for bandwidth and storage and monitoring.

The guy I was talking to figured out a way around this, though. Backups had a penalty clause for non-performance that no one ever took advantage of, but he did: he requested things from backup and proved that the backups were corrupt. It got to the point where the backup department was paying his department every month. What a clusterfuck.

After that, a quick trip to the vendor area to grab stickers for the kids, then back to the presentations.

Next was the 2nd day of Practice and Experience Reports ("Lessons Learned"). First up was the network admin (?) for ARIN about IPv6 migration. This was interesting, particularly as I'd naively assumed that, hey, they're ARIN and would have no problems at all on this front...instead of realizing that they're out in front to take a bullet for ALL of us, man. Yeah. They had problems, they screwed up a couple times, and came out battered but intact. YEAH!

Interesting bits:

Routing is not as reliable, not least because for a long time (and perhaps still) admins were treating IPv6 as an experiment, something still in beta: there were times when whole countries in Europe would disappear oft the map for weeks as admins tried out different things.
Understanding ICMPv6 is a must. Naive assumptions brought over from IPv4 firewalls like "Hey, let's block all ICMP except ping" will break things in wonderfully subtle ways. Like: it's up to the client, not the router, to fragment packets. That means the client needs to discover the route MTU. That depends on ICMPv6.
Not all transit is equal; ask your vendor if they're using the same equipment to route both protocols, or if IPv6 is on the old, crappy stuff they were going to eBay. Ask if they're using tunnels; tunnels aren't bad in themselves, but can add multiple layers to things and make things trickier to debug. (This goes double if you've decided to firewall ICMPv6...)
They're very happy with OpenBSD's pf as an IPv6 firewall.
Dual-stack OSes make things easy, but can make policy complex. Be aware that DHCPv6 is not fully supported (and yes, you need it to hand out things like DNS and NTP), and some clients (believe he said XP) would not do DNS lookups over v6 -- only v4, though they'd happily go to v6 servers once they got the DNS records.
IPv6 security features are a double-edged sword: yes, you can set up encrypted VPNs, but so can botnets. Security vendors are behind on this; he's watching for neat tricks that'll allow you to figure out private keys for traffic and thus decrypt eg. botnet C&C, but it's not there yet. (My notes are fuzzy on this, so I may have it wrong.)
Multicast is an attack-and-discovery protocol, and he's a bit worried about possible return of reflection attacks (Smurfv6). He's hopeful that the many, many lessons learned since then mean it won't happen, but it is a whole new protocol nd set of stacks for baddies to explore and discover. (RFC 4942 was apparently important...)
Proxies are good for v4-only hosts: mod_proxy, squid and 6tunnel have worked well (6tunnel in particular).
Gotchas: reverse DNS can be painful, because v6 macros/generate statements don't work in BIND yet; IPv6 takes precedence in most cases, so watch for SSH barfing when it suddenly starts seeing new hosts.

Next up: Internet on the Edge, a good war story about bringing wireless through trees for DARPA that won best PER. Worth watching. (Later on, I happened across the person who presented and his boss in the elevator, and I congratulated him on his presentation. "See?" says his boss, and digs him in the ribs. "He didn't want to present it.")

Finally there was the report from one of the admins who helped set up Blue Gene 6, purchased from IBM. (The speaker was much younger than the others: skinny, pale guy with a black t-shirt that said GET YOUR WAR ON. "If anyone's got questions, I'm into that...") This report was extremely interesting to me, especially since I've got an upcoming purchase for a (much, much smaller) cluster coming up.

Blue Gene is a supercomputer with something like 10k nodes, and it uses 10GB/s Myrinet/Myricom (FIXME: Clarify which that is) cards/network for communication. Each node does source routing, and so latency is extremely low, throughput correspondingly high, and core routers correspondingly simple. To make this work, every card needs to have a map of the network so they know where to send stuff, and that map needs to be generated by a daemon that then distributes the map everywhere. Fine, right? Wrong:

The Myricom switch is admin'd by a web interface only: no CLI of any sort, no logging to syslog, nothing. Using this web interface becomes impractical when you've got thousands of nodes...
There's an inherent fragility in this design: a problem with a card means you need to turn off the whole node; a problem with the mapping daemon means things can get corrupt real quick.

And guess what? They had problems with the cards: a bad batch of transceivers meant that, over the 2-year life of the machine, they lost a full year's worth of computing. It took a long time to realize the problem, it took a long time to get the vendor to realize it, and it took longer to get it fixed (FIXME: Did he ever get it fixed?)

So, lessons learned:

Vendor relations should not start with a problem. If the first time you call them up is to say "Your stuff is breaking", you're doomed. Dealing with vendor problems calls for social skills first, and tech skills second. Get to know than just the sales team; get familiar with the tech team before you need them.
Know your systems inside and out before they break; part of their problem was not being as familiar with things as they should have been.
Have realistic expectations when someone says "We'll give you such a deal on this equipment!" That's why they went w/Myricom -- it was dirt cheap. They saved money on that, but it would have been better spent on hiring more people. (I realize that doesn't exactly make sense, but that's what's in my notes.)
Don't pay the vendor 'til it works. Do your acceptance testing, but be aware of subcontractor relations. In this case, IBM was providing Blue Gene but had subcontracted Myricom -- and already paid them. Oops, no leverage. (To be fair, he said that Myricom did help once they were convinced...but see the next point.)
Have an agreement in advance with your vendor about how much failure is too much. In their case, the failure rate was slow but steady, and Myricom kept saying "Oh, let's just let it shake out a little longer..." It took a lot of work to get them to agree to replace the cards.
Don't let vendors talk to each other through you. In their case, IBM would tell them something, and they'd have to pass that on to Myricom, and then the process would reverse. There were lots of details to keep track of, and no one had the whole picture. Setting up a weekly phone meeting with the vendors helped immensely.
Don't wait for the vendors to do your work. Don't assume that they'll troubleshoot something for you.
Don't buy stuff with a web-only interface. Make sure you can monitor things. (I'm looking at you, Dell C6500.)
Stay positive at all costs! This was a huge, long-running problem that impaired an expensive and important piece of equipment, and resisting pessimism was important. Celebrate victories locally; give positive feedback to the vendors; keep reminding everyone that you are making progress.

Question from me: How much of this advice depends on being involved in negotiations? Answer: maybe 50%; acceptance testing is a big part of it (and see previous comments about that) but vendor relations is the other part.

I was hoping to talk to the presenter afterward, but it didn't happen; there were a lot of other people who got to him first. :-) But what I heard (and heard again later from Victor) confirmed the low opinion of the Myrinet protocol/cards...man, there's nothing there to inspire confidence.

And after that came the talk by Adam Moskowitz on becoming a senior sysadmin. It was a list of (at times strongly) suggested skills -- hard, squishy, and soft -- that you'll need. Overarching all of it was the importance of knowing the business you're in and the people you're responsible to: why you're doing something ("it supports the business by making X, Y and Z easier" is the correct answer; "it's cool" is not) , explaining it to the boss and the boss' boss, respecting the people you work with and not looking down on them because they don't know computers. Worth watching.

That night, Victor, his sister and I drove up to San Francsisco to meet Noah and Sarah at the 21st Amendment brewpub. The drive took two hours (four accidents on the way), but it was worth it: good beer, good food, good friends, great time. Sadly I was not able to bring any back; the Noir et Blanc was awesome.

One good story to relate: there was an illustrator at the party who told us about (and showed pictures of) a coin she's designing for a client. They gave her the Three Wolves artwork to put on the coin. Yeah.

Footnotes:

Scary_viking_sysadmins

11 Nov 2010

Scary Viking Sysadmins

+10 LART of terror. (Quote from Matt.)

A-Side Wins

10 Nov 2010

I raise my glass to the cut-and-dried,  
To the amplified  
I raise my glass to the b-side.  

-- Sloan, "A-Side Wins"

Tuesday morning I got paged at 4:30am about /tmp filling up on a webserver at work, and I couldn't get back to sleep after that. I looked out my window at Venus, Saturn, Spica and Arcturus for a while, blogged & posted, then went out for coffee. It was cold -- around 4 or 5C. I walked past the Fairmont and wondered at the expensive cars in their front parking space; I'd noticed something fancy happening last night, and I've been meaning to look it up.

Two buses with suits pulled up in front of the Convention Centre; I thought maybe there was going to be a rumble, but they were here for the Medevice Conference that's in the other half of the Centre. (The Centre, by the way, is enormous. It's a little creepy to walk from one end to the other, in this enormous empty marble hall, followed by Kenny G the whole way.)

And then it was tutorial time: Cfengine 3 all day. I'd really been looking forward to this, and it was pretty darn good. (Note to myself: fill out the tutorial evaluation form.) Mark Burgess his own bad self was the instructor. His focus was on getting things done with Cfengine 3: start small and expand the scope as you learn more.

At times it dragged a little; there was a lot of time spent on niceties of syntax and the many, many particular things you can do with Cf3. (He spent three minutes talking about granularity of time measurement in Cf3.)

Thus, by the 3rd quarter of the day we were only halfway through his 100+ slides. But then he sped up by popular request, and this was probably the most valuable part for me: explaining some of the principles underlying the language itself. He cleared up a lot of things that I had not understood before, and I think I've got a much better idea of how to use it. (Good thing, too, since I'm giving a talk on Cf2 and Cf3 for a user group in December.)

During the break, I asked him about the Community Library. This is a collection of promises -- subroutines, basically -- that do high-level things like add packages, or comment-out sections of a file. When I started experimenting with Cf3, I followed the tutorials and noticed that there were a few times where the CL promises had changed (new names, different arguments, etc). I filed a bug and the documentation was fixed, but this worried me; I felt like libc's printf() had suddenly been renamed showstuff(). Was this going to happen all the time?

The answer was no: the CL is meant to be immutable; new features are appended, and don't replace old ones. In a very few cases, promises have been rewritten if they were badly implemented in the first place.

At lunch, I listened to some people in Federal labs talk about supercomputer/big cluster purchases. "I had a thirty-day burnin and found x, y and z wrong..." "You had 30 days? Man, we only have 14 days." "Well, this was 10 years ago..." I was surprised by this; why wouldn't you take a long time to verify that your expensive hardware actually worked?

User pressure is one part; they want it now. But the other part is management. They know that vendors hate long burn-in periods, because there's a bunch of expensive shiny that you haven't been paid for yet getting banged around. So management will use this as a bargaining chip in the bidding process: we'll cut down burn-in if you'll give us something else. It's frustrating for the sysadmins; you hope management knows what they're doing.

I talked with another sysadmin who was in the Cf3 class. He'd recently gone through the Cf2 -> Cf3 conversion; it took 6 months and was very, very hard. Cf3 is so radically different from Cf2 that it took a long time to wrap his head around how it/Mark Burgess thought. And then they'd come across bugs in documentation, or bugs in implementation, and that would hold things up.

In fact, version 3.1 has apparently just come out, fixing a bug that he'd tripped across: inserting a file into the middle of another file truncated that file. Cf3 would divide the first file in two (as requested), insert the bit you wanted, then throw away the second half rather than glom it back on. Whoops.

As a result, they're evaluating Puppet -- yes, even after 6 months of effort to port...in fact, because it took 6 months of effort to port. And because Puppet does hierarchical inheritance, whereas Cf3 only does sets and unions of sets. (Which MB says is much more flexible and simple: do Java class hierarchies really simplify anything?)

After all of that, it was time for supper. Matt and I met up with a few others and headed to The Loft, based on some random tweet I'd seen. There was a long talk about interviews, and I talked to one of the people about what it's like to work in a secret/secretive environment.

Secrecy is something I keep bumping up against at LISAs; there are military folks, government folks (and not just US), and folks from private companies that just don't talk a lot about what they do. I'm very curious about all of this, but I'm always reluctant to ask...I don't want to put anyone in an awkward spot. OTOH, they're probably used to it.

After that, back to the hotels to continue the conversation with the rapidly dwindling supplies of free beer, then off to the Fedora 14 BoF that I promised Beth Lynn I'd attend. It was interesting, particularly the mention of Fedora CSI ("Tonight on NBC!"), a set of CC-licensed system administration documentation. David Nalley introduced it by saying that,if you change jobs every few years like he does, you probably find yourself building the same damn documentation from scratch over and over again. Oh, and the Fedora project is looking for a sysadmin after burning through the first one. Interesting...

And then to bed. I'm not getting nearly as much sleep here as I should.

Nothing left to make me want to stay

09 Nov 2010

Growing up was wall-to-wall excitement, but I don't recall
Another who could understand at all...

-- Sloan

Monday: day two of tutorials. I found Beth Lynn in the lobby and congratulated her on being very close to winning her bet; she's a great deal closer than I would have guessed. She convinced me to show up at the Fedora 14 BoF tomorrow.

First tutorial was "NASes for the Masses" with Lee Damon, which was all about how to do cheap NASes that are "mostly reliable" -- which can be damn good if your requirements are lower, or your budget smaller. You can build a multi-TB RAID array for about $8000 these days, which is not that bad at all. He figures these will top out at around 100 users...200-300 users and you want to spend the money on better stuff.

The tutorial was good, and a lot of it was stuff I'd have liked to know about five years ago when I had no budget. (Of course, the disk prices weren't nearly so good back then...) At the moment I've got a good-ish budget -- though, like Damon, Oracle's ending of their education discount has definitely cut off a preferred supplier -- so it's not immediately relevant for me.

QOTD:

Damon: People load up their file servers with too much. Why would you put MSSQL on your file server?

Me: NFS over SQL.

Matt: I think I'm going to be sick.

Damon also told us about his experience with Linux as an NFS server: two identical machines, two identical jobs run, but one ran with the data mounted from Linux and the other with the data mounted from FreeBSD. The FreeBSD server gave a 40% speed increase. "I will never use Linux as an NFS server again."

Oh, and a suggestion from the audience: smallnetbuilder.com for benchmarks and reviews of small NASes. Must check it out.

During the break I talked to someone from a movie studio who talked about the legal hurdles he had to jump in his work. F'r example: waiting eight weeks to get legal approval to host a local copy of a CSS file (with an open-source license) that added mouseover effects, as opposed to just referring to the source on its original host.

Or getting approval for showing 4 seconds of one of their movies in a presentation he made. Legal came back with questions: "How big will the screen be? How many people will be there? What quality will you be showing it at?" "It's a conference! There's going to be a big screen! Lots of people! Why?" "Oh, so it's not going to be 20 people huddled around a laptop? Why didn't you say so?" Copyright concerns? No: they wanted to make sure that the clip would be shown at a suitably high quality, showing off their film to the best effect. "I could get in a lot of trouble for showing a clip at YouTube quality," he said.

The afternoon was "Recovering from Linux Hard Drive Disasters" with Ted T'so, and this was pretty amazing. He covered a lot of material, starting with how filesystems worked and ending with deep juju using debugfs. If you ever get the chance to take this course, I highly recommend it. It is choice.

Bits:

ReiserFS: designed to be very, very good at handling lots of little files, because of Reiser's belief that the line between databases and filesystems should be erased (or at least a lot thinner than it is). "Thus, ReiserFS is the perfect filesystem if you want to store a Windows registry."
Fsck for ReiserFS works pretty well most of the time; it scans the partition looking for btree nodes (is that the right term?) (ReiserFS uses btrees throughout the filesytem) and then reconstructs the btree (ie, your filesystem) with whatever it finds. Where that falls down is if you've got VM images which themselves have ReiserFS filesystems...everything gets glommed together and it is a big, big mess.
BtrFS and ZFS both very cool, and nearly feature-identical though they take very different paths to get there. Both complex enough that you almost can't think of them as a filesystem, but need to think of them in software engineering terms.
ZFS was the cure for the "filesystems are done" syndrome. But it took many, many years of hard work to get it fast and stable. BtrFS is coming up from behind, and still struggles with slow reads and slowness in an aged FS.
Copy-on-write FS like ZFS and BtrFS struggle with aged filesystems and fragmentation; benchmarking should be done on aged FS to get an accurate idea of how it'll work for you.
Live demos with debugfs: Wow.

I got to ask him about fsync() O_PONIES; he basically said if you run bleeding edge distros on your laptop with closed-source graphics drivers, don't come crying to him when you lose data. (He said it much, much nicer than that.) This happens because ext4 assumes a stable system -- one that's not crashing every few minutes -- and so it can optimize for speed (which means, say, delaying sync()s for a bit). If you are running bleeding edge stuff, then you need to optimize for conservative approaches to data preservation and you lose speed. (That's an awkward sentence, I realize.)

I also got to ask him about RAID partitions for databases. At $WORK we've got a 3TB disk array that I made into one partition, slapped ext3 on, and put MySQL there. One of the things he mentioned during his tutorial made me wonder if that was necessary, so I asked him what the advantages/disadvantages were.

Answer: it's a tradeoff, and it depends on what you want to do. DB vendors benchmark on raw devices because it gets a lot of kernel stuff out of the way (volume management, filesystems). And if you've got a SAN where you can a) say "Gimme a 2.25TB LUN" without problems, and b) expand it on the fly because you bought an expensive SAN (is there any other kind?), then you've got both speed and flexibility.

OTOH, maybe you've got a direct-attached array like us and you can't just tell the array to double the LUN size. So what you do is hand the raw device to LVM and let it take care of resizing and such -- maybe with a filesystem, maybe not. You get flexibility, but you have to give up a bit of speed because of the extra layers (vol mgt, filesystem).

Or maybe you just say "Screw it" like we have, and put a partition and filesystem on like any other disk. It's simple, it's quick, it's obvious that there's something important there, and it works if you don't really need the flexibility. (We don't; we fill up 3TB and we're going to need something new anyhow.)

And that was that. I called home and talked to the wife and kids, grabbed a bite to eat, then headed to the OpenDNS BoF. David Ulevitch did a live demo of how anycast works for them, taking down one of their servers to show the routing tables adjust. (If your DNS lookup took an extra few seconds in Amsterdam, that's why.) It was a little unsettling to see the log of queries flash across the screen, but it was quick and I didn't see anything too interesting.

After that, it was off to the Gordon Biersch pub just down the street. The food was good, the beer was free (though the Marzen tasted different than at the Fairmont...weird), and the conversation was good. Matt and Claudio tried to set me straight on US voter registration (that is, registering as a Democrat/Republican/Independent); I think I understand now, but it still seems very strange to me.

Money City Maniacs

08 Nov 2010

Hey you!
We've been around for a while.
If you'll admit that you were wrong, then we'll admit that we're right.

-- Sloan

After posting last night, a fellow UBCianiite and I went looking for drinks. We eventually settled on the bar at the Fairmont. The Widsomething Imperial IPA was lovely, as was the Gordon Biersch (spelling, I'm sure) Marzen...never had a Marzen before and it was lovely. (There was a third beer, but it wasn't very good. Mentioning it would ruin my rhythm.) What was even lovelier was that the coworker picked up the tab for the night. I'm going to invite him drinking a lot more from now on.

Sunday was day one of tutorials. In the morning was "Implementing DNSSEC". As some of the complaints on Twitter mentioned, the implementation details were saved for the last quarter of the tutorial. I'm not very familiar with DNSSEC, though, so I was happy with the broader scope...and as the instructor pointed out, BIND 9.7 has made a lot of it pretty easy, and the walkthrough is no longer as detailed as it once had to be.

Some interesting things:

He mentioned not being a big believer in dynamic zones previously...and now he runs 40 zones and they're ALL dynamic. This is an especially nice thing now that he's running DNSSEC.
Rackspace is authoritative for 1.1 million zones...so startup time of the DNS server is important; you can't sit twiddling your thumbs for several hours while you wait for the records to load.
BIND 10 (did I mention he works for the ISC?) will have a database backend built right in. Not sure if he meant that text records would go away entirely, or if this would be another backend, or if it'd be used to generate text files. Still, interesting.
DNSSEC failure -- ie, a failure of your upstream resolver to validate the records/keys/whatever -- is reported as SERVFAIL rather than something more specific. Why? To keep (say) Windows 3.1 clients, necessary to the Education Department of the fictional state of East Carolina, working...they are not going to be updated, and you can't break DNS for them.
Zone signatures: root (.) is signed (by Verisign; uh-oh); .net is signed as of last week; .com is due next March. And there are still registrars that shrug when you ask them when they're going to support DS records. As he said, implement it now or start hemorrhaging customers.
Another reason to implement it now, if you're an ISP: because the people who will call in to notify you of problems are the techie early adopters. Soon, it'll be Mom and Dad, and they're not going to be able to help you diagnose it at all.
Go look at dnsviz.net
Question that he gets a lot: what kind of hardware do I need to serve X many customers? Answer: there isn't one; too many variables. But what he does suggest is to take your hardware budget, divide by 3, and buy what you can for that much. Congratulations: you now have 3 redundant DNS servers, which is a lot better than trying to guess the right size for just one.
A crypto offload card might be a good thing to look at if you have a busy resolver. But they're expensive. If your OS supports it, look into GPU support; a high-end graphics card is only a few hundred dollars, and apparently works quite well.

On why DNSSEC is important:

"I put more faith in the DNS system than I do in the public water system. I check my email in bed with my phone before I have a shower in the morning."
"As much as I have privacy concerns about Google, I have a lot more concerns about someone pretending to be Google."

On stupid firewall assumptions about DNS:

AOL triggered heartburn a ways back when replies re: MX records started exceeding 512 bytes...which everyone knew was impossible and/or wrong. (It's not.) Suddenly people had weird problems trying to email AOL.
Some version of Cisco's stateful packet inspection assumes that any DNS reply over 512 bytes is clearly bogus. It's not, especially with DNSSEC.
If I rem. correctly (notes are fuzzy on this point), a reply over 512 bytes gets you a UDP packet that'll hold what it can, with a flag set that says "query over TCP for the full answer please." But there are a large number of firewall tutorials that advise you to turn off DNS over TCP. (My own firewall may be set up like that...need to fix that when I get back.)

When giving training on DNS in early 2008, he came to a slide about cache poisoning. There was another ISC engineer there to help him field questions, give details, etc, and he was turning paler and paler as he talked about this. This was right before the break; as soon as the class was dismissed, the engineer came up to him and said, "How many more of those damn slides do you have?" "That's all, why?" "I can't tell you. But let's just say that in a year, DNSSEC will be a lot more important."

The instructor laughed in his face, because he'd been banging his head against that brick wall for about 10 years. But the engineer was one of the few who knew about the Kaminsky attack, and had been sworn to secrecy.

Lunch! Good lunch, and I happened, along with Bob the Norwegian, to be nearly first in line. Talked to tablemates from a US gov't lab, and they mentioned the competition between labs. They described how they moved an old supercomputer over beside a new supercomputing cluster, and got the top 500 cluster for...a week, 'til someone else got it. And there were a couple admins from the GPS division of John Deere, because tractors are all GPS-guided these days when plowing the fields.

Sunday afternoon was "Getting it out the door successfully", a tutorial on project management, with Strata Rose-Chalup. This was good; there were some things I already knew (but was glad to see confirmed), and a lot more besides...including stuff I need to implement. Like: if startup error messages are benign, then a) don't emit them, and b) at least document them so that other people (customers, testers, future coders) know this.

QOTD:

"Point to the wall, where you have a velvet-covered clue-by-four and chocolate. Ask, 'Which would you like to have your behaviour modified by today?'"

"What do you do if your product owner is an insane jackass?" "If your product owner is an insane jackass, then you have a typical product..." But srsly: many people choose to act like this when they feel they're not being listened to them. Open up your meetings and let them see what's on the table. Bring in their peers, too; that way their choice will be to act like a jackass in front of their peers, or to moderate their demands.

Tip from the audience: when faced with impossible requests, don't say "No". Just bring up the list of stuff you're already working on, and the requests/features/bugfixes that have already been agreed to, and ask them where this fits in. They'll either modify their request ('cos it's not that important to them), or you'll find a lot of other stuff moved out of your way ('cos that other stuff isn't that important to them).

After that was supper with Andy, who I hadn't seen since last year's LISA. We hit up a small Mexican place for supper (not bad), the Britannia Arms for a beer (where Matt tried to rope us into Karaoke and kept asking us to do "Freebird" with him), then the Fairmont hotel bar so Andy could get his Manhattan. (He's a bit intense about Manhattans.) It was a good time.

C'mon C'mon C'mon

06 Nov 2010

There's been debate and some speculation
Have you heard?

Sloan

I figure two months is long enough.

I'm at LISA again, this time in sunny San Jose. I took the train down this year (no reason, why do you ask?), which...well, it took a long time: I got on a bus to Seattle at 5:30am on Friday, and arrived at the San Jose train station at 10am on Saturday. I went coach; a sleeper would have been a nice addition, as the chairs are not completely comfortable for sleeping. (Probably would have got me access to the wireless too, which Amtrak's website does not mention is only available to T3h El33+.)

But oh, the leg room! I nearly wept. And the people-watching....my wife is the champ, but I can get into it too. Overheard snippets of conversation in the observation car were the best. Like this guy with silver hair, kind of like the man from Glad:

Silver: So yeah, she got into animal husbandry then and just started doing every drug on the planet. I mean, when I started doing pot, I told my parents. I told my grandparents. But she...I mean, EVERY drug on the planet.

Or the two blue-collar guys who met in the observation car and became best buds:

Buddy: Aw man, you ever go to the casinos? Now that I'm up in Washington now, I think I'm gonna check 'em out.

Guy: I dunno, I go with my friends sometimes. I don't gamble, but I'll have a few beers.

Buddy: You hear who's coming to the Tulalip? Joe Satriani, man. JOOOOOOOOOE. Joe Satriani!

Guy: Yeah, I'll hit the buffet...

And then later:

Silver: I knew it was a bad thing. I mean, she was a ten. I'm okay, but she was a TEN, you know what I mean? The other tenants were going to get jealous, and I only had enough of them to pay the mortage.

Buddy: (separate conversation) And we caught one of those red crabs when we were up in Alaska?

Guy: Man, you won't catch me eatin' that shit.

Silver: And then she says, do you mind if I take a trip up the mountains with this doctor I met? I say, what do I have to say about it?

Buddy: What? Man, they're good eatin'. We just dropped it in a pot and boiled the sonuvabitch.

Silver: And that's when I realize she thinks we're in a relationship. I guess she's got this thing about men.

I slept badly, woke up at 3:30am and read for a while before realizing that the book of disturbing scifi stories is not really good 3:30am reading. I watched San Francisco and San Jose approach from the observation car; tons and tons of industrial land, occasionally interrupted by beautiful parks and seashore.

San Jose came at last. I had thought about walking to the convention centre, but decided against it. Glad I did, since a) it's a little further than I thought; b) it's surprisingly warm here; c) more industrial land, and d) when I did go out walking later on I managed to get completely turned around twice. I was looking for Phillz Coffee, based on a recommendation from Twitter (can't bring myself yet to say "tweet"; give me six months) and got lost in Innitek land (complete with Adobe) and a Vietnamese neighbourhood before finding it at last. The coffee was pretty good; they have about two dozen varieties and they make it one cup at a time. Not sure it was worth $3.50 for a 12 oz coffee, though...NOT THAT I'M UNGRATEFUL. Thank you, @perwille.

Gotta say, downtown SJ on a Saturday is...dead. I saw maybe a dozen people in six blocks despite stores, a nearby university (they call them high schools here) and I think three museums. I have no idea where one might go for a fun time at night, but I imagine it involves another city.

So then I took a bus to sunny Cupertino. Why? To visit the retail outlet of Orion Telescopes. I've got into astronomy again (loved it as a kid), and I'm thinking of buying one of their telescopes in about a year. Since the store was only ten miles away, why not go? And since the bus goes right from the hotel to, well, pretty close, seems like it's a requirement.

Now that was fun; even more people-watching on the train. Like the Hispanic gentleman w/a handlebar moustache, a cowboy hat, tight polyester pants (he had the roundest buttocks I've ever seen on a man. I could only wonder in great admiration), a silk shirt with "K-Paz" embroidered on the back, and a button that said, in Spanish, something that was probably "DO X NOW! ASK ME HOW!" And the study in ringtones: the elderly Hispanic grandmother who had Mexican accordion music vs. the middle-aged African-American guy who had Michael Jackson's "Thriller." Man, you just don't get that where I come from.

And the contrast in neighbourhoods between San Jose (out of downtown, it was all Hispanic shops), Santa Clara ("ALL-AMERICAN CITY 2001" said the sign; Restoration Hardware to prevent white panic) and Cupertino (duelling car dealerships (Audi, Land Rover and Lexus) and antivirus companies (Symantec and Trend Micro); Critical Mass, only with scooters instead of bikes; Harley driver wearing a leather jacket with an Ed Hardy embroidered patch on the back).

Anyhow, the telescopes were neat; it was the first chance I'd really had to look at them closely. I didn't buy one (relax, Pre!). They didn't have a floor model of the one I really want, but I've got a better idea what the size, and what I want out of one.

And now...to post, then register. Which means going to the business centre, since Internet access costs $78/day at the Hilton with a 3KB daily cap. And the Russian mob's attempt to get my banking data by setting up a "Free DownTown WiFi" network is NOT GOING TO WORK, tvaritch.

Watching Jupiter

02 Sep 2010

Last year I bought a Galileoscope for $15. It's a cheap (though well-made) telescope that was meant to celebrate the 400th anniversary of Galileo's first astronomical observations. It was $15 -- so cheap!

Jupiter has been visible all this month out our bedroom window around 4:30am, and this morning I pointed the telescope at it and saw its moons and, I think, a band across the middle. If I had a tripod to hook it up to, I would have got an even better view...but even balanced on the window, it's amazing what you can see.

Work yesterday was interesting -- which is good, because it's been a bit of a slow month. A vendor bought me coffee, and it was actually an interesting conversation. I finally got an LDAP server migrated to a VM in preparation for re-installing the host it's on; this took a while because I refused to read my own instructions for how to set up replication (sigh). And that brought up other problems, like the fact that my check for jumbo frames being enabled wasn't actually complaining about non-jumbo frames...or that the OpenSuSE machines I've got didn't get their LDAP configuration from Cfengine the way I thought.

All stuff to solve tomorrow...I mean, today. (Dang getting up at 4am...)

A real live cluster

25 Aug 2010

At work, I've been playing with a tiny cluster: 3 Sun V20z servers, each with a 2.2GHz dual-core Opteron and 2GB of memory. It's nothing special at all, but it's been a good way of getting familiar wiht Rocks.

One thing that's bitten me a few times is the documentation. The 411 service is described only in the 4.x documents, but still appears to be a going concern in the 5.x series...indeed, that's how I got LDAP and autofs working. And to test the cluster, the HPC roll documentation says to use cluster-fork...yet running cluster-fork gives me the message "cluster-fork has been replaced with 'rocks run host'", which is documented in the base roll

Trip to Science World

23 Aug 2010

My in-laws got us a family membership at Science World for Xmas last year. Yesterday I got to take my 4 year-old (how should that be hyphenated?) son for the morning. It was his third trip and my second.

We headed right for the Eureka room, which is aimed at the young 'uns, and he ran around showing me everything. "Daddy, here's a big tube where you can shoot out parachutes! And this air gun shoots balls up into the water!" We found out that you could stuff three plastic balls into the air gun at once (poom poom poom!).

Oh, and when we got home he wanted to do an experiment. He got some pennies and put them in a jar with water, to leave them for a few days and see if they would dissolve. I had a maple syrup candy in my pocket (no idea where I got it), so I threw that in too. The candy has dissolved and made the water brown, so I'm curious to see what he makes of that.

Science World is just incredible. I long to go see the grownup stuff, but even the kid stuff is enormously fun and moderately educational (though that's not my son's priority right now) (dang kids). I grew up in small towns as a kid, so trips to museums like this were rare, enormous fun. (And I never did get to go to Science North...) It's amazing to me that this stuff is right here, only a half hour away by transit. I'm still a little shocked we don't go, like, every weekend.

In other news, I've got a starter going for a batch of beer next weekend. It's a Belgian yeast, harvested from my January batch. The yeast was washed following these instructions, and the starter took off in about 18 hours. It seems to be doing quite nicely; I'll probably stick it in the fridge on Wednesday or so and cold-crash it.

The ingredients are pretty much whatever I have around the house: the last of my Gambrinus ESB, some biscuit and wheat malt, a bit of roasted barley, and the hops are Centennial, Goldings and Mt Hood. My father-in-law would call this a "ministrone" -- Italian not just for that kind of soup, but for "dog's breakfast" or "big ol' mixup". (I kind of like the idea of an Italian sounding like he's from Missouri.)

Still looking for a name; suggestions on a postcard, please. Sponsorship options are available. :-)

After that's in the bag, it's time to head back to Dan's for a shopping trip. This time, I think it'll be a 50-lb bag of plain ol' pale malt, and I'll see what difference that makes.

Rule

17 Aug 2010

I'm trying to get Bacula to make a separate copy of monthly full backups that can be kept off-site. To do this, I'm experimenting with its "Copy" directive. I was hoping to get a complete set of tapes ready to keep offsite before I left, but it was taking much longer than anticipated (2 days to copy 2 tapes). So I cancelled the jobs, typed unmount at bconsole, and went home thinking Bacula would just grab the right tape from the autochanger when backups came.

What I should have typed was release. release lets Bacula grab whatever tape it needs. unmount leaves Bacula unwilling to do anything on its own, and it waits for the operator (ie, me) to do something.

Result: 3 weeks of no backups. Welcome back, chump.

There are a number of things I can do to make sure this doesn't happen again. There's a thread on the Bacula-users mailing list (came up in my absence, even) detailing how to make sure something's mounted. I can use release the way Kern intended. I can set up a separate check that goes to my cel phone directly, and not through Nagios. I can run a small backup job manually on Fridays just to make sure it's going to work. And on it goes.

I knew enough not to make changes as root on Friday before going on vacation. But now I know that includes backups.

Older Newer