Christmas With Jesus

And my conscience has it stripped down to science
Why does everything displease me?
Still, I'm trying...

"Christmas with Jesus", Josh Rouse

At 3am my phone went off with a page from $WORK. It was benign, but do you think I could get back to sleep? Could I bollocks. I gave up at 5am and came down to the hotel lobby (where the wireless does NOT cost $11/day for 512 Kb/s, or $15 for 3Mb/s) to get some work done and email my family. The music volume was set to 11, and after I heard the covers of "Living Thing" (Beautiful South) and "Stop Me If You Think That You've Heard This One Before" (Marc Ronson; disco) I retreated back to my hotel room to sit on my balcony and watch the airplanes. The airport is right by both the hotel and the downtown, so when you're flying in you get this amazing view of the buildings OH CRAP RIGHT THERE; from my balcony I can hear them coming in but not see them. But I can see the ones that are, I guess, flying to Japan; they go straight up, slowly, and the contrail against the morning twilight looks like rockets ascending to space. Sigh.

Abluted (ablated? hm...) and then down to the conference lounge to stock up on muffins and have conversations. I talked to the guy giving the .EDU workshop ("What we've found is that we didn't need a bachelor's degree in LDAP and iptables"), and with someone else about kids these days ("We had a rich heritage of naming schemes. Do you think they're going to name their desktop after Lord of the Rings?" "Naw, it's all gonna be Twilight and Glee.")

Which brought up another story of network debugging. After an organizational merger, network problems persisted until someone figured out that each network had its own DNS servers that had inconsistent views. To make matters worse, one set was named Kirk and Picard, and the other was named Gandalf and Frodo. Our Hero knew then what to do, and in the post-mortem Root Cause Diagnosis, Executive Summary, wrote "Genre Mismatch." [rimshot]

(6.48 am and the sun is rising right this moment. The earth, she is a beautiful place.)

And but so on to the HPC workshop, which intimidated me. I felt unprepared. I felt too small, too newbieish to be there. And when the guy from fucking Oak Ridge got up and said sheepishly, "I'm probably running one of the smaller clusters here," I cringed. But I needn't have worried. For one, maybe 1/3rd of the people introduced themselves as having small clusters (smallest I heard was 10 nodes, 120 cores), or being newbies, or both. For two, the host/moderator/glorious leader was truly excellent, in the best possible Bill and Ted sense, and made time for everyone's questions. For three, the participants were also generous with time and knowledge, and whether I asked questions or just sat back and listened, I learned so much.

Participants: Oak Ridge, Los Alamos, a lot of universities, and a financial trading firm that does a lot of modelling and some really interesting, regulatory-driven filesystem characteristics: nothing can be deleted for 7 years. So if someone's job blows up and it litters the filesystem with crap, you can't remove the files. Sure, they're only 10-100 MB each, but with a million jobs a day that adds up. You can archive...but if the SEC shows up asking for files, they need to have them within four hours.

The guy from Oak Ridge runs at least one of his clusters diskless: less moving parts to fail. Everything gets saved to Lustre. This became a requirement when, in an earlier cluster, a node failed and it had Very Important Data on a local scratch disk, and it took a long time to recover. The PI (==principal investigator, for those not from an .EDU; prof/faculty member/etc who leads a lab) said, "I want to be able to walk into your server room, fire a shotgun at a random node, and have it back within 20 minutes." So, diskless. (He's also lucky because he gets biweekly maintenance windows. Another admin announces his quarterly outages a year in advance.)

There were a lot of people who ran configuration management (Cf3, Puppet, etc) on their compute nodes, which surprised me. I've thought about doing that, but assumed I'd be stealing precious CPU cycles from the science. Overwhelming response: Meh, they'll never notice. OTOH, using more than one management tool is going to cause admin confusion or state flapping, and you don't want to do that.

One guy said (both about this and the question of what installer to use), "Why are you using anything but Rocks? It's federally funded, so you've already paid for it. It works and it gets you a working cluster quickly. You should use it unless you have a good reason not to." "I think I can address that..." (laughter) Answer: inconsistency with installations; not all RPMs get installed when you're doing 700 nodes at once, so he uses Rocks for a bare-ish install and Cf3 after that -- a lot like I do with Cobbler for servers. And FAI was mentioned too, which apparently has support for CentOS now.

One .EDU admin gloms all his lab's desktops into the cluster, and uses Condor to tie it all together. "If it's idle, it's part of the cluster." No head node, jobs can be submitted from anywhere, and the dev environment matches the run environment. There's a wide mix of hardware,so part of user education a) is getting people to specify minimal CPU and memory requirements and b) letting them know that the ideal job is 2 hours long. (Actually, there were a lot of people who talked about high-turnover jobs like that, which is different from what I expected; I always thought of HPC as letting your cluster go to town for 3 weeks on something. Perhaps that's a function of my lab's work, or having a smaller cluster.)

User education was something that came up over and over again: telling people how to efficiently use the cluster, how to tweak settings (and then vetting jobs with scripts).

I asked about how people learned about HPC; there's not nearly the wealth of resources that there are for programming, sysadmin, networking, etc. Answer: yep, it's pretty quiet out there. Mailing lists tend to be product-specific (though are pretty excellent), vendor training is always good if you can get it, but generally you need to look around a lot. ACM has started a SIG for HPC.

I asked about checkpointing, which was something I've been very fuzzy about. Here's the skinny:


* The easiest and best by far is for the app to do it.  It knows its
  state intimately and is in the best position to do this.  However,
  the app needs to support this.  Not necessary to have it explicitly
  save the process (as in, kernel-resident memory image, registers,
  etc); if it can look at logs or something and say "Oh, I'm 3/4
  done", then that's good too.

* The Condor scheduler supports this, *but* you have to do this by
  linking in its special libraries when you compile your program.  And
  none of the big vendors do this (Matlab, Mathematica, etc).

* BLCR: "It's 90% working, but the 10% will kill you." Segfaults,
  restarts only work 2/3 of the time, etc.  Open-source project from a
  federal lab and until very recently not funded -- so the response to
  "There's this bug..." was "Yeah, we're not funded. Can't do nothing
  for you." Funding has been obtained recently, so keep your fingers
  crossed.

One admin had problems with his nodes:  random slowdowns, not caused
by cstates or the other usual suspects.  It's a BIOS problem of some
sort and they're working it out with the vendor, but in the meantime
the only way around it is to pull the affected node and let the power
drain completely.  This was pointed out by a user ("Hey, why is my job
suddenly taking so long?") who was clever enough to write a
dirt-simple 10 million iteration for-loop that very, very obviously
took a lot longer on the affected node than the others.  At this point
I asked if people were doing regular benchmarking on their clusters to
pick up problems like this.  Answer: no.  They'll do benchmarking on
their cluster when it's stood up so they have something to compare it
to later, but users will unfailingly tell them if something's slow.

I asked about HPL; my impression when setting up the cluster was, yes,
benchmark your own stuff, but benchmark HPL too 'cos that's what you
do with a cluster.  This brought up a host of problems for me, like
compiling it and figuring out the best parameters for it.  Answers:

* Yes, HPL is a bear.  Oak Ridge: "We've got someone for that and
  that's all he does."  (Response: "That's your answer for everything
  at Oak Ridge.")

* Fiddle with the params P, Q and N, and leave the rest alone.  You
  can predict the FLOPS you should get on your hardware, and if you
  get 90% or so within that you're fine.

* HPL is not that relevant for most people, and if you tune your
  cluster for linear algebra (which is what HPL does) you may get
  crappy performance on your real work.

* You can benchmark it if you want (and download Intel's binary if you
  do; FIXME: add link), but it's probably better and easier to stick
  to your own apps.

Random:

* There's a significant number of clusters that expose interactive
  sessions to users via qlogin; that had not occurred to me.

* Recommended tools:
  * ubmod: accounting graphs
  * Healthcheck scripts (Werewolf)
  * stress: cluster stress test tool
  * munin: to collect arbitrary info from a machine
  * collectl: good for ie millisecond resolution of traffic spikes

* "So if a box gets knocked over -- and this is just anecdotal -- my
  experience is that the user that logs back in first is the one who
  caused it."

* A lot of the discussion was prompted by questions like "Is anyone
  else doing X?" or "How many people here are doing Y?"  Very helpful.

* If you have to return warranty-covered disks to the vendor but you
  really don't want the data to go, see if they'll accept the metal
  cover of the disk.  You get to keep the spinning rust.

* A lot of talk about OOM-killing in the bad old days ("I can't tell
  you how many times it took out init.").  One guy insisted it's a lot
  better now (3.x series).

* "The question of changing schedulers comes up in my group every six
  months."

* "What are you doing for log analysis?" "We log to /dev/null."
  (laughter) "No, really, we send syslog to /dev/null."

* Splunk is eye-wateringly expensive: 1.5 TB data/day =~ $1-2 million
  annual license.

* On how much disk space Oak Ridge has:  "It's...I dunno, 12 or 13 PB?
  It's 33 tons of disks, that's what I remember."

* Cheap and cheerful NFS:  OpenSolaris or FreeBSD running ZFS. For
  extra points, use an Aztec Zeus for a ZIL: a battery-backed 8GB
  DIMM that dumps to a compact flash card if the power goes out.

* Some people monitor not just for overutilization, but for
  underutilization: it's a chance for user education ("You're paying
  for my time and the hardware; let me help you get the best value for
  that").  For Oak Ridge, though, there's less pressure for that:
  scientists get billed no matter what.

* "We used to blame the network when there were problems.  Now their
  app relies on SQL Server and we blame that."

* Sweeping for expired data is important.  If it's scratch, then
  *treat* it as such: negotiate expiry dates and sweep regularly.

* Celebrity resemblances: Michael Moore and the guy from Dead Poet's
  Society/The Good Wife.  (Those are two different sysadmins, btw.)

* Asked about my .TK file problem; no insight.  Take it to the lists.
  (Don't think I've written about this, and I should.)

* On why one lab couldn't get Vendor X to supply DKMS kernel modules
  for their hardware:  "We're three orders of magnitude away from
  their biggest customer.  We have *no* influence."

* Another vote for SoftwareCarpentry.org as a way to get people up to
  speed on Linux.

* A lot of people encountered problems upgrading to Torque 4.x and
  rolled back to 2.5.  "The source code is disgusting.  Have you ever
  looked at it?  There's 15 years of cruft in there. The devs
  acknowledged the problem and announced they were going to be taking
  steps to fix things. One step: they're migrating to C++.
  [Kif sigh]"

* "Has anyone here used Moab Web Services? It's as scary as it sounds.
  Tomcat...yeah, I'll stop there." "You've turned the web into RPC. Again."

* "We don't have regulatory issues, but we do have a
  physicist/geologist issue."

* 1/3 of the Top 500 use SLURM as a scheduler.  Slurm's srun =~
  Torque's pdbsh; I have the impression it does not use MPI (well,
  okay, neither does Torque, but a lot of people use Torque + mpirun),
  but I really need to do more reading.

* lmod (FIXME: add link) is a Environment Modules-compatible (works
  with old module files) replacement that fixes some problems with old
  EM, actively developed, written in lua.

* People have had lots of bad experiences with external Fermi GPU
  boxes from Dell, particularly when attached to non-Dell equipment.

* Puppet has git hooks that let you pull out a particular branch on a node.

And finally:

Q: How do you know you're with a Scary Viking Sysadmin?

A: They ask for Thor's Skullsplitter Mead at the Google Bof.