Posts tagged “hpc”

March 07, 2013 Troubleshooting deferred jobs, episode 80

The other day at $WORK, a user asked me why the jobs she was submitting to the cluser were being deferred. They only needed one core each, and showq showed lots free, so WTF?

By the time I checked on the state of these deferred jobs, the jobs were already running -- and yeah, there were lots of cores free. The checkjob command showed something interesting, though:

$ checkjob 34141 | grep Messages
Messages:  cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'

I thought this was from the node that the job was on now:

$ qstat -f 34141 | grep exec_host
exec_host = compute-3-5/19

but that was a red herring. (I could've also got the host from "checkjob | grep -2 "Allocated Nodes".) Instead, grepping through maui.log showed that it had been compute-1-11 that was the real problem:

/opt/maui/log $ sudo grep 34141 maui.log.3 maui.log.2 maui.log.1 maui.log |grep -E 'WARN|ERROR'
maui.log.3:03/05 16:21:48 ERROR:    job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:48 ERROR:    job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:50 ERROR:    job '34141' cannot be started: (rc: 15041  errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'  hostlist: 'compute-1-11')
maui.log.3:03/05 16:21:50 WARNING:  cannot start job '34141' through resource manager
maui.log.3:03/05 16:21:50 ERROR:    cannot start job '34141' in partition DEFAULT
maui.log.3:03/05 17:21:56 ERROR:    job '34141' cannot be started: (rc: 15041  errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'  hostlist: 'compute-1-11')

There were lots of messages like this; I think the scheduler kept only gave up on that node much later (hours).

checknode showed nothing wrong; in fact, it was running a job currently and had 4 free cores:

$ checknode compute-1-11
checking node compute-1-11

State:      Busy  (in current state for 6:23:11:32)
Configured Resources: PROCS: 12  MEM: 47G  SWAP: 46G  DISK: 1M
Utilized   Resources: PROCS: 8
Dedicated  Resources: PROCS: 8
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:      13.610
Network:    [DEFAULT]
Features:   [NONE]
Attributes: [Batch]
Classes:    [default 0:12]

Total Time:   INFINITY  Up:   INFINITY (98.74%)  Active:   INFINITY (18.08%)

Reservations:
  Job '33849'(8)  -6:23:12:03 -> 93:00:47:56 (99:23:59:59)
JobList:  33849

maui.log showed an alert:

maui.log.10:03/03 22:32:26 ALERT: RM state corruption. job '34001' has idle node 'compute-1-11' allocated (node forced to active state)

but that was another red herring; this is common and benign.

dmesg on compute-1-11 showed the problem:

compute-1-11 $ dmesg | tail
sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Hardware Error

<<vendor>> ASC=0x80 ASCQ=0x87ASC=0x80 <<vendor>> ASCQ=0x87

Info fld=0x10489
end_request: I/O error, dev sda, sector 66697
Aborting journal on device sda1.
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
(Linux)|Wed Mar 06 09:37:20|[compute-1-11:~]$ mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda5 on /state/partition1 type ext3 (rw)
/dev/sda2 on /var type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
sophie:/export/scratch on /share/networkscratch type nfs (rw,addr=10.1.1.1)

mount: warning /etc/mtab is not writable (e.g. read-only filesystem).

   It's possible that information reported by mount(8) is not
   up to date. For actual information about system mount points
   check the /proc/mounts file.

but this was also logged on the head node in /var/log/messages:

$ sudo grep compute-1-11.local /var/log/* |grep -vE 'automount|snmpd|qmgr|smtp|pam_unix|Accepted publickey' > ~/rt_1526/compute-1-11.syslog
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in remtree, unlink failed on /opt/torque/mom_priv/jobs/34038.sophie.TK
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed

and in /var/log/kern:

$ sudo tail /var/log/kern
Mar  5 10:05:00 compute-1-11.local kernel: Aborting journal on device sda1.
Mar  5 10:05:01 compute-1-11.local kernel: ext3_abort called.
Mar  5 10:05:01 compute-1-11.local kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Mar  5 10:05:01 compute-1-11.local kernel: Remounting filesystem read-only
Mar  7 05:18:06 compute-1-11.local kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range

There are a few things I've learned from this:

checkjob | grep Messages -- but there's no standard format for this.
I really need to send alerts when /var/log/kern gets written to.

I've started to put some of these commands in a sub -- that's a really awesome framework from 37 signals to collect commonly-used commands together. In this case, I've named the sub "sophie", after the cluster I work on (named in turn after the daughter of the PI). You can find it on github or my own server (github is great, but what happens when it goes away? ...but that's a rant for another day.) Right now there are only a few things in there, and they're somewhat specific to my environment, and doubtless they could be improved -- but it's helping a lot so far.

February 18, 2013 Don't do that, then
Q from a user today that took two hours (not counting this entry) to track down: why are my jobs idle when showq shows there are a lot of free processors? (Background: we have a Rocks cluster, 39 nodes, 492 cores. Torque + Maui, pretty vanilla config.)

First off, showq did show a lot of free cores:
```
$ showq
[lots of jobs]

192 Active Jobs     426 of  492 Processors Active (86.59%)
```
```
            38 of   38 Nodes Active      (100.00%)
```
```
IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT QUEUETIME

32542                  jdoe       Idle     1  1:00:00:00  Fri Feb 15 13:55:55
32543                  jdoe       Idle     1  1:00:00:00  Fri Feb 15 13:55:55
32544                  jdoe       Idle     1  1:00:00:00  Fri Feb 15 13:55:56
```
Okay, so why? Let's take one of those jobs:
```
$ checkjob 32542
checking job 32542

State: Idle
Creds:  user:jdoe  group:example  class:default  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00:00
SubmitTime: Fri Feb 15 13:55:55
  (Time Queued  Total: 2:22:55:26  Eligible: 2:22:55:26)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 196  StartCount: 0
PartitionMask: [ALL]
Flags:       HOSTLIST RESTARTABLE
HostList:
  [compute-1-3:1]
Reservation '32542' (21:17:14 -> 1:21:17:14  Duration: 1:00:00:00)
PE:  1.00  StartPriority:  4255
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs: 216  feasible procs:   0

Rejection Reasons: [State        :    1][HostList     :   38]
```
Note the bit that says:

job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)

If we run "checkjob -v", we see some additional info (all the rest is the same):
```
Detailed Node Availability Information:

compute-2-1              rejected : HostList
compute-1-1              rejected : HostList
compute-3-2              rejected : HostList
compute-1-3              rejected : State
compute-1-4              rejected : HostList
compute-1-5              rejected : HostList
[and on it goes...]
```
This means that compute-1-3, one of the nodes we have, has been assigned to the job. It's busy, so it'll get to the job Real Soon Now. Problem solved!

Well, no. Because if you run something like this:
```
showq -u jdoe |awk '/Idle/ {print "checkjob -v " $1}' | sh
```
then a) you're probably in a state of sin, and b) you'll see that there are a lot of jobs assigned to compute-1-3. WTF?

Well, this looks pretty close to what I'm seeing. And as it turns out, the user in question submitted a lot of jobs (hundreds) all at the same time. Ganglia lost track of all the nodes for a while, so I assume that Torque did as well. (Haven't checked into that yet...trying to get this down first; documenting stuff for Rocks is always a problem for me.) The thread reply suggests qalter, but that doesn't seem to work.

While I'm at it, here's a list of stuff that doesn't work:
- "qalter -l neednodes= " ; maui restart
- "runjob -c "; maui restart
- "runjob -c "; "releasehold " ; maui restart
(Oh, and btw turns out runjob is supposed to be replaced by mjobctl, but mjobctl doesn't appear to work. True story.)

So at this point I'm stuck suggesting two things to the user:
- Don't submit umpty jobs at once
- Either wait for compute-1-3 to work through all your jobs, or cancel and resubmit them.
God I hate HPC sometimes.

Helpful links that explained some of this:

December 10, 2012 Christmas With Jesus

And my conscience has it stripped down to science
Why does everything displease me?
Still, I'm trying...

"Christmas with Jesus", Josh Rouse

At 3am my phone went off with a page from $WORK. It was benign, but do you think I could get back to sleep? Could I bollocks. I gave up at 5am and came down to the hotel lobby (where the wireless does NOT cost $11/day for 512 Kb/s, or $15 for 3Mb/s) to get some work done and email my family. The music volume was set to 11, and after I heard the covers of "Living Thing" (Beautiful South) and "Stop Me If You Think That You've Heard This One Before" (Marc Ronson; disco) I retreated back to my hotel room to sit on my balcony and watch the airplanes. The airport is right by both the hotel and the downtown, so when you're flying in you get this amazing view of the buildings OH CRAP RIGHT THERE; from my balcony I can hear them coming in but not see them. But I can see the ones that are, I guess, flying to Japan; they go straight up, slowly, and the contrail against the morning twilight looks like rockets ascending to space. Sigh.

Abluted (ablated? hm...) and then down to the conference lounge to stock up on muffins and have conversations. I talked to the guy giving the .EDU workshop ("What we've found is that we didn't need a bachelor's degree in LDAP and iptables"), and with someone else about kids these days ("We had a rich heritage of naming schemes. Do you think they're going to name their desktop after Lord of the Rings?" "Naw, it's all gonna be Twilight and Glee.")

Which brought up another story of network debugging. After an organizational merger, network problems persisted until someone figured out that each network had its own DNS servers that had inconsistent views. To make matters worse, one set was named Kirk and Picard, and the other was named Gandalf and Frodo. Our Hero knew then what to do, and in the post-mortem Root Cause Diagnosis, Executive Summary, wrote "Genre Mismatch." [rimshot]

(6.48 am and the sun is rising right this moment. The earth, she is a beautiful place.)

And but so on to the HPC workshop, which intimidated me. I felt unprepared. I felt too small, too newbieish to be there. And when the guy from fucking Oak Ridge got up and said sheepishly, "I'm probably running one of the smaller clusters here," I cringed. But I needn't have worried. For one, maybe 1/3rd of the people introduced themselves as having small clusters (smallest I heard was 10 nodes, 120 cores), or being newbies, or both. For two, the host/moderator/glorious leader was truly excellent, in the best possible Bill and Ted sense, and made time for everyone's questions. For three, the participants were also generous with time and knowledge, and whether I asked questions or just sat back and listened, I learned so much.

Participants: Oak Ridge, Los Alamos, a lot of universities, and a financial trading firm that does a lot of modelling and some really interesting, regulatory-driven filesystem characteristics: nothing can be deleted for 7 years. So if someone's job blows up and it litters the filesystem with crap, you can't remove the files. Sure, they're only 10-100 MB each, but with a million jobs a day that adds up. You can archive...but if the SEC shows up asking for files, they need to have them within four hours.

The guy from Oak Ridge runs at least one of his clusters diskless: less moving parts to fail. Everything gets saved to Lustre. This became a requirement when, in an earlier cluster, a node failed and it had Very Important Data on a local scratch disk, and it took a long time to recover. The PI (==principal investigator, for those not from an .EDU; prof/faculty member/etc who leads a lab) said, "I want to be able to walk into your server room, fire a shotgun at a random node, and have it back within 20 minutes." So, diskless. (He's also lucky because he gets biweekly maintenance windows. Another admin announces his quarterly outages a year in advance.)

There were a lot of people who ran configuration management (Cf3, Puppet, etc) on their compute nodes, which surprised me. I've thought about doing that, but assumed I'd be stealing precious CPU cycles from the science. Overwhelming response: Meh, they'll never notice. OTOH, using more than one management tool is going to cause admin confusion or state flapping, and you don't want to do that.

One guy said (both about this and the question of what installer to use), "Why are you using anything but Rocks? It's federally funded, so you've already paid for it. It works and it gets you a working cluster quickly. You should use it unless you have a good reason not to." "I think I can address that..." (laughter) Answer: inconsistency with installations; not all RPMs get installed when you're doing 700 nodes at once, so he uses Rocks for a bare-ish install and Cf3 after that -- a lot like I do with Cobbler for servers. And FAI was mentioned too, which apparently has support for CentOS now.

One .EDU admin gloms all his lab's desktops into the cluster, and uses Condor to tie it all together. "If it's idle, it's part of the cluster." No head node, jobs can be submitted from anywhere, and the dev environment matches the run environment. There's a wide mix of hardware,so part of user education a) is getting people to specify minimal CPU and memory requirements and b) letting them know that the ideal job is 2 hours long. (Actually, there were a lot of people who talked about high-turnover jobs like that, which is different from what I expected; I always thought of HPC as letting your cluster go to town for 3 weeks on something. Perhaps that's a function of my lab's work, or having a smaller cluster.)

User education was something that came up over and over again: telling people how to efficiently use the cluster, how to tweak settings (and then vetting jobs with scripts).

I asked about how people learned about HPC; there's not nearly the wealth of resources that there are for programming, sysadmin, networking, etc. Answer: yep, it's pretty quiet out there. Mailing lists tend to be product-specific (though are pretty excellent), vendor training is always good if you can get it, but generally you need to look around a lot. ACM has started a SIG for HPC.

I asked about checkpointing, which was something I've been very fuzzy about. Here's the skinny:

Checkpointing is freezing the process so that you can resurrect it later. It protects against node failures (maybe with automatic moving of the process/job to another node if one goes down) and outages (maybe caused by maintenance windows.)
Checkpointing can be done at a few different layers:
- the app itself
- the scheduler (Condor can do this; Torque can't)
- the OS (BLCR for Linux, but see below)
- or just suspending a VM and moving it around; I was unclear how ``` many people did this.


* The easiest and best by far is for the app to do it.  It knows its
  state intimately and is in the best position to do this.  However,
  the app needs to support this.  Not necessary to have it explicitly
  save the process (as in, kernel-resident memory image, registers,
  etc); if it can look at logs or something and say "Oh, I'm 3/4
  done", then that's good too.

* The Condor scheduler supports this, *but* you have to do this by
  linking in its special libraries when you compile your program.  And
  none of the big vendors do this (Matlab, Mathematica, etc).

* BLCR: "It's 90% working, but the 10% will kill you." Segfaults,
  restarts only work 2/3 of the time, etc.  Open-source project from a
  federal lab and until very recently not funded -- so the response to
  "There's this bug..." was "Yeah, we're not funded. Can't do nothing
  for you." Funding has been obtained recently, so keep your fingers
  crossed.

One admin had problems with his nodes:  random slowdowns, not caused
by cstates or the other usual suspects.  It's a BIOS problem of some
sort and they're working it out with the vendor, but in the meantime
the only way around it is to pull the affected node and let the power
drain completely.  This was pointed out by a user ("Hey, why is my job
suddenly taking so long?") who was clever enough to write a
dirt-simple 10 million iteration for-loop that very, very obviously
took a lot longer on the affected node than the others.  At this point
I asked if people were doing regular benchmarking on their clusters to
pick up problems like this.  Answer: no.  They'll do benchmarking on
their cluster when it's stood up so they have something to compare it
to later, but users will unfailingly tell them if something's slow.

I asked about HPL; my impression when setting up the cluster was, yes,
benchmark your own stuff, but benchmark HPL too 'cos that's what you
do with a cluster.  This brought up a host of problems for me, like
compiling it and figuring out the best parameters for it.  Answers:

* Yes, HPL is a bear.  Oak Ridge: "We've got someone for that and
  that's all he does."  (Response: "That's your answer for everything
  at Oak Ridge.")

* Fiddle with the params P, Q and N, and leave the rest alone.  You
  can predict the FLOPS you should get on your hardware, and if you
  get 90% or so within that you're fine.

* HPL is not that relevant for most people, and if you tune your
  cluster for linear algebra (which is what HPL does) you may get
  crappy performance on your real work.

* You can benchmark it if you want (and download Intel's binary if you
  do; FIXME: add link), but it's probably better and easier to stick
  to your own apps.

Random:

* There's a significant number of clusters that expose interactive
  sessions to users via qlogin; that had not occurred to me.

* Recommended tools:
  * ubmod: accounting graphs
  * Healthcheck scripts (Werewolf)
  * stress: cluster stress test tool
  * munin: to collect arbitrary info from a machine
  * collectl: good for ie millisecond resolution of traffic spikes

* "So if a box gets knocked over -- and this is just anecdotal -- my
  experience is that the user that logs back in first is the one who
  caused it."

* A lot of the discussion was prompted by questions like "Is anyone
  else doing X?" or "How many people here are doing Y?"  Very helpful.

* If you have to return warranty-covered disks to the vendor but you
  really don't want the data to go, see if they'll accept the metal
  cover of the disk.  You get to keep the spinning rust.

* A lot of talk about OOM-killing in the bad old days ("I can't tell
  you how many times it took out init.").  One guy insisted it's a lot
  better now (3.x series).

* "The question of changing schedulers comes up in my group every six
  months."

* "What are you doing for log analysis?" "We log to /dev/null."
  (laughter) "No, really, we send syslog to /dev/null."

* Splunk is eye-wateringly expensive: 1.5 TB data/day =~ $1-2 million
  annual license.

* On how much disk space Oak Ridge has:  "It's...I dunno, 12 or 13 PB?
  It's 33 tons of disks, that's what I remember."

* Cheap and cheerful NFS:  OpenSolaris or FreeBSD running ZFS. For
  extra points, use an Aztec Zeus for a ZIL: a battery-backed 8GB
  DIMM that dumps to a compact flash card if the power goes out.

* Some people monitor not just for overutilization, but for
  underutilization: it's a chance for user education ("You're paying
  for my time and the hardware; let me help you get the best value for
  that").  For Oak Ridge, though, there's less pressure for that:
  scientists get billed no matter what.

* "We used to blame the network when there were problems.  Now their
  app relies on SQL Server and we blame that."

* Sweeping for expired data is important.  If it's scratch, then
  *treat* it as such: negotiate expiry dates and sweep regularly.

* Celebrity resemblances: Michael Moore and the guy from Dead Poet's
  Society/The Good Wife.  (Those are two different sysadmins, btw.)

* Asked about my .TK file problem; no insight.  Take it to the lists.
  (Don't think I've written about this, and I should.)

* On why one lab couldn't get Vendor X to supply DKMS kernel modules
  for their hardware:  "We're three orders of magnitude away from
  their biggest customer.  We have *no* influence."

* Another vote for SoftwareCarpentry.org as a way to get people up to
  speed on Linux.

* A lot of people encountered problems upgrading to Torque 4.x and
  rolled back to 2.5.  "The source code is disgusting.  Have you ever
  looked at it?  There's 15 years of cruft in there. The devs
  acknowledged the problem and announced they were going to be taking
  steps to fix things. One step: they're migrating to C++.
  [Kif sigh]"

* "Has anyone here used Moab Web Services? It's as scary as it sounds.
  Tomcat...yeah, I'll stop there." "You've turned the web into RPC. Again."

* "We don't have regulatory issues, but we do have a
  physicist/geologist issue."

* 1/3 of the Top 500 use SLURM as a scheduler.  Slurm's srun =~
  Torque's pdbsh; I have the impression it does not use MPI (well,
  okay, neither does Torque, but a lot of people use Torque + mpirun),
  but I really need to do more reading.

* lmod (FIXME: add link) is a Environment Modules-compatible (works
  with old module files) replacement that fixes some problems with old
  EM, actively developed, written in lua.

* People have had lots of bad experiences with external Fermi GPU
  boxes from Dell, particularly when attached to non-Dell equipment.

* Puppet has git hooks that let you pull out a particular branch on a node.

And finally:

Q: How do you know you're with a Scary Viking Sysadmin?

A: They ask for Thor's Skullsplitter Mead at the Google Bof.

May 13, 2011 Trouble compiling GotoBLAS2 on newer CPU

I came across a problem compiling GotoBLAS2 at work today. It went well on a practice cluster, but on the new one I got this error:

gcc -c -O2 -Wall -m64 -DF_INTERFACE_G77 -fPIC  -DSMP_SERVER -DMAX_CPU_NUMBER=24 -DASMNAME=strmm_ounncopy -DASMFNAME=strmm_ounncopy_ -DNAME=strmm_ounncopy_ -DCNAME=strmm_ounncopy -DCHAR_NAME=\"strmm_ounncopy_\o
../kernel/x86_64/gemm_ncopy_4.S: Assembler messages:
../kernel/x86_64/gemm_ncopy_4.S:192: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:193: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:194: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:195: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:197: Error: undefined symbol `WPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:345: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:346: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:348: Error: undefined symbol `WPREFETCHSIZE' in operation

The solution was simple:

gmake clean
gmake TARGET=NEHALEM

The problem appears to be that newer CPUs (Intel X5650 in my case) are not detected properly by the CPU ID routine in GotoBlas2. You can verify this by checking the contents of config.h in the top-level directory. Without TARGET=NEHALEM, I saw this line:

#define INTEL_UNKNOWN

But with TARGET=NEHALEM, this becomes:

#define NEHALEM

The problem with gemm_ncopy_4.S arises because it defines RPRETCHSIZE and WPREFETCHSIZE using #ifdef statements depending on CPU type. There is an entry for #ifdef GENERIC, but that was not set for me in config.h.

In addition, if you type "gmake TARGET=NEHALEM" without "gmake clean" first, you get a little further before you run into a similar error:

usr/bin/ld: ../libgoto2_nehalemp-r1.13.a(ssymv_U.o): relocation R_X86_64_32S against `PREFETCHSIZE' can not be used when making a shared object; recompile with -fPIC
../libgoto2_nehalemp-r1.13.a(ssymv_U.o): could not read symbols: Bad value

If I was a better person, I'd have a look at how the sizes are defined and figure out what the right value is for newer CPUs, then modify cpuid.c (which I presume is what's being used to generate config.h, or at least this part of it. Maybe another day...

May 05, 2011 Multiple redundant cascading problems
Oh god this week. I've been setting up the cluster (three chassis' worth of blades from Dell). I've installed Rocks on the front end (rackmount R710). After that:
- All blades powered on.
- Some installed, most did not. Not sure why. Grub Error 15 is the result, which is Grub for "File not found".
- I find suggestions in the Rocks mailing list to turn off floppy controllers. Don't have floppy controllers exactly in these, but I do see boot order includes USB floppy and USB CDROM. Pick a blade, disable, PXE boot and reinstall. Whee, it works!
- Try on another blade and find that reinstallation takes 90 minutes. Network looks fine; SSH to the reinstalling blade and wget all the RPMs in about twelve seconds. What the hell?
- Discover Rocks' Avalanche Installer and how it uses BitTorrent to serve RPMs to nodes. Notice that the installing node is constantly ARPing to find nodes that aren't turned on (they're waiting for me to figure out what the hell's going on). Restart service rocks-tracker on the front end and HOLY CRAP now it's back down to a three minute installation. Make a mental note to file a bug about this.
- Find out that Dell OpenManage Deploy Toolkit is the best way to populate a new machine w/BIOS settings, since the Chassis Management Console can't push that particular setting to blades. Download that, start reading.
- Try fifteen different ways of connecting virtual media using CMC. Once I find out the correct syntax for NFS mounts (amusingly different between manuals), some blades find it and some don't; no obvious hints why. What the hell?
- Give up, pick a random blade and tell it by hand where to find the goddamn ISO. (This ignores the problems of getting Java apps to work in Awesome [hint: use wmname], which is my own fault.) Collect settings before and after disabling USB CDROM and Floppy and find no difference; this setting is apparently not something they expose to this tool.
- Give up and try PXE booting this blade even with the demon USB devices still enabled. It works; installation goes fine and after it reboots it comes up fine. What the hell?
- Power cycle the blade to see if it still works and it reinstalls. Reinstalls Rocks. What the hell?
- Discover /etc/init.d/rocks-grub, which at boot modified grub.conf to pxe boot next time and at graceful shutdown reverses the change, allowing the machine to boot normally. The thinking is that if you have to power cycle a machine you probably want to reinstall it anyhow.
- Finally put this all together. Restart tracker, set all blades in one of the chassis' to reinstall. Pick a random couple of blades and fire up consoles. Power all the blades up. Installation fails with anaconda error, saying it can't find any more mirrors. What the hell?
- eth0 is down on the front end; dmesg shows hundreds of "kernel: bnx2: eth0 NIC Copper Link is Down" messages starting approximately the time I power-cycled the blades.
I give up. I am going here tonight because my wife is a good person and is taking me there. And I am going to have much, and much-deserved, beer.
March 04, 2011 Rocks Lessons Part 2 -- Torque, Maui and OpenMPI
Torque is a resource manager; it's an open source project with a long history. It keeps track of resources -- typically compute nodes, but it's "flexible enough to handle scheduling a conference room". It knows how many compute nodes you have, how much memory, how many cores, and so on.

Maui is the job scheduler. It looks at the jobs being submitted, notes what resources you've asked for, and makes requests of Torque. It keeps track of what work is being done, needs to be done, or has been completed.

MPI stands for "Message Passing Interface". Like BLAS, it's a standard with different implementations. It's used by a lot of HPC/scientific programs to exchange messages between processes -- often but not necessarily on separate computers -- related to their work.

MPI is worth mentioning in the same breath as Torque and Maui because of mpiexec, which is part of OpenMPI; OpenMPI is a popular open-source implementation of MPI. mpiexec (aka mpirun, aka orterun) lets you launch processes in an OpenMPI environment, even if the process doesn't require MPI. IOW, there's no problem running something like "mpiexec echo 'Hello, world!'".

To focus on OpenMPI and mpiexec: you can run n copies of your program by using the "-np" argument. Thus, "-np 8" will run 8 copies of your program...but it will run on the machine you run mpiexec on:
```
$ mpiexec -np 8 hostname
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
```
This isn't always useful -- why pay big money for all this hardware if you're not going to use it? -- so you can tell it to run on different hosts:
```
$ mpiexec -np 8 -host compute-0-0,compute-0-1 hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
```
And if you're going to do that, you might as well give it a file to read, right?
```
$ mpiexec -np 8 -hostfile /opt/openmpi/etc/openmpi-default-hostfile hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
```
That file is where Rocks sticks the hostfile, but it could be anywhere -- including in your home directory, if you decide that you want it to run on a particular set of machines.

However, if you're doing that, then you're really setting yourself up as the resource manager. Isn't that Torque's job? Didn't we set all this up so that you wouldn't have to keep track of what machine is busy?

So OpenMPI can work with Torque:
1. How do I run jobs under Torque / PBS Pro?
The short answer is just to use mpirun as normal.

Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Torque / PBS Pro directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will use PBS/Torque-native mechanisms to launch and kill processes ([rsh] and/or ssh are not required).
Whee! So easy! Except that Rocks does not compile OpenMPI with Torque support!

Because the Rocks project is kind of a broad umbrella, with lots of sub-projects underneath, the Torque roll is separate from the OpenMPI roll. Besides, installing one doesn't mean you'll install the other, so it may not make sense to build OpenMPI that way.

The fine folks at Rocks asked the fine folks at OpenMPI and found a way around this: by having every job submitted to Torque/Maui and using MPI source /opt/torque/etc/openmpi-setup.sh. While not efficient, it works; the recommended way, though, is to recompile OpenMPI with Torque installed so that it knows about Torque.

To me, this makes the whole Rocks installation less useful, particularly since this is didn't seem terribly well documented. To be fair, it is there in the Torque roll documentation:

Although OpenMPI has support for the torque tm-interface (tm=taskmanager) it is not compiled into the library shipped with Rocks (the reason for this is that the OpenMPI build process needs to have access to libtm from torque to enable the interface). The best workaround is to recompile OpenMPI on a system with torque installed. Then the mpirun command can talk directly to the batch system to get the nodelist and start the parallel application using the torque daemon already running on the nodes. Job startup times for large parallel applications is significantly shorter using the tm-interface that using ssh to start the application on all nodes.

So maybe I should just shut my mouth.

In any event, I suspect I'll end up recompiling OpenMPI in order to get it to see Torque.
March 03, 2011 Rocks Lesson Part 1 -- BLAS
There's a lot to clusters. I'm learning that now.

At $WORK, we're getting a cluster RSN -- rack fulla blades, head node, etc etc. I haven't worked w/a cluster before so I'm practicing with a test one: three little nodes, dual core CPUs, 2 GB memory each, set up as one head node and two compute nodes. I'm using Rocks to manage it.

Here a few stories about things I've learned along the way.

BLAS

You find a lot of references to BLAS when you start reading software requirements for HPC, and not a lot explaining it.

BLAS stands for "Basic Linear Algebra Subprograms"; the original web page is here. Wikipedia calls it "a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication." This is important to realize, because, as in the article, common usage of the term seems to refer to an API than anything else; there's the reference implementation, but it's not really used much.

As I understand it -- and I invite corrections -- BLAS chugs through linear algebra and comes up with an answer at the end. Brute force is one way to do this sort of thing, but there are ways to speed up the process; these can make a huge difference in the amount of time it takes to do some calculation. Some of these are heuristics and algorithms that allow you to search more intelligently through the search space. Some are ways of compiling or writing the library routines differently, taking advantage of the capabilities of different processors to let you search more quickly.

There are two major open-source BLAS implementations:
- The Goto BLAS library is a hand-optimized BLAS implementation that, by all accounts, is very fast. It's partly written in assembler, and the guy who wrote it basically crafted it the way (I think) Enzo Ferrari crafted cars.
- ATLAS is another BLAS implementation. The ATLAS home page says "it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK." As noted in the articles attached to this page, ATLAS tries many, many different searches for a solution to a particular problem. It uses CPU capabilities to do these searches efficiently.
As such, compilation of ATLAS is a big deal, and the resulting binaries are tuned to the CPU they were built on. Not only do you need to turn off CPU throttling, but you need to build on the CPU you'll be running on. Pre-built packages are pretty much out.

ATLAS used to be included in the HPC roll of the Rocks 4 series. Despite [irritatingly out-of-date information][13], this has not been the case in a while.

[LAPACK] "is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems." It needs a BLAS library. From the FAQ:

Why aren’t BLAS routines included when I download an LAPACK routine?

It is assumed that you have a machine-specific optimized BLAS library already available on the architecture to which you are installing LAPACK. If this is not the case, you can download a Fortran77 reference implementation of the BLAS from netlib.

Although a model implementation of the BLAS in available from netlib in the blas directory, it is not expected to perform as well as a specially tuned implementation on most high-performance computers -- on some machines it may give much worse performance -- but it allows users to run LAPACK software on machines that do not offer any other implementation of the BLAS.

Alternatively, you can automatically generate an optimized BLAS library for your machine, using ATLAS http:>www.netlib.org/atlas/

(There is an RPM called "blas-3.0" available for rocks; given the URL listed (http://www.netlib.org/lapack/), it appears that this is the model implementation listed above. This version is at /usr/lib64/libblas.so*, and is in ldconfig.)

Point is, you'll want a BLAS implementation, but you've got two (at least) to choose from. And you'll need to compile it yourself. I get the impression that the choice of BLAS library is something that can vary depending on religion, software, environment and so on...which means you'll probably want to look at something like modules to manage all this.

Tomorrow: Torque, Maui and OpenMPI.