The other day at $WORK, a user asked me why the jobs she was
submitting to the cluser were being deferred. They only needed one
core each, and showq
showed lots free, so WTF?
By the time I checked on the state of these deferred jobs, the jobs
were already running -- and yeah, there were lots of cores free.
The checkjob
command showed something interesting, though:
$ checkjob 34141 | grep Messages
Messages: cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'
I thought this was from the node that the job was on now:
$ qstat -f 34141 | grep exec_host
exec_host = compute-3-5/19
but that was a red herring. (I could've also got the host from "checkjob | grep -2 "Allocated Nodes".) Instead, grepping through maui.log showed that it had been compute-1-11 that was the real problem:
/opt/maui/log $ sudo grep 34141 maui.log.3 maui.log.2 maui.log.1 maui.log |grep -E 'WARN|ERROR'
maui.log.3:03/05 16:21:48 ERROR: job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:48 ERROR: job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:50 ERROR: job '34141' cannot be started: (rc: 15041 errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN' hostlist: 'compute-1-11')
maui.log.3:03/05 16:21:50 WARNING: cannot start job '34141' through resource manager
maui.log.3:03/05 16:21:50 ERROR: cannot start job '34141' in partition DEFAULT
maui.log.3:03/05 17:21:56 ERROR: job '34141' cannot be started: (rc: 15041 errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN' hostlist: 'compute-1-11')
There were lots of messages like this; I think the scheduler kept only gave up on that node much later (hours).
checknode showed nothing wrong; in fact, it was running a job currently and had 4 free cores:
$ checknode compute-1-11
checking node compute-1-11
State: Busy (in current state for 6:23:11:32)
Configured Resources: PROCS: 12 MEM: 47G SWAP: 46G DISK: 1M
Utilized Resources: PROCS: 8
Dedicated Resources: PROCS: 8
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 13.610
Network: [DEFAULT]
Features: [NONE]
Attributes: [Batch]
Classes: [default 0:12]
Total Time: INFINITY Up: INFINITY (98.74%) Active: INFINITY (18.08%)
Reservations:
Job '33849'(8) -6:23:12:03 -> 93:00:47:56 (99:23:59:59)
JobList: 33849
maui.log showed an alert:
maui.log.10:03/03 22:32:26 ALERT: RM state corruption. job '34001' has idle node 'compute-1-11' allocated (node forced to active state)
but that was another red herring; this is common and benign.
dmesg on compute-1-11 showed the problem:
compute-1-11 $ dmesg | tail
sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Hardware Error
<<vendor>> ASC=0x80 ASCQ=0x87ASC=0x80 <<vendor>> ASCQ=0x87
Info fld=0x10489
end_request: I/O error, dev sda, sector 66697
Aborting journal on device sda1.
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
(Linux)|Wed Mar 06 09:37:20|[compute-1-11:~]$ mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda5 on /state/partition1 type ext3 (rw)
/dev/sda2 on /var type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
sophie:/export/scratch on /share/networkscratch type nfs (rw,addr=10.1.1.1)
mount: warning /etc/mtab is not writable (e.g. read-only filesystem).
It's possible that information reported by mount(8) is not
up to date. For actual information about system mount points
check the /proc/mounts file.
but this was also logged on the head node in /var/log/messages:
$ sudo grep compute-1-11.local /var/log/* |grep -vE 'automount|snmpd|qmgr|smtp|pam_unix|Accepted publickey' > ~/rt_1526/compute-1-11.syslog
/var/log/messages:Mar 6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed
/var/log/messages:Mar 6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in remtree, unlink failed on /opt/torque/mom_priv/jobs/34038.sophie.TK
/var/log/messages:Mar 6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed
and in /var/log/kern:
$ sudo tail /var/log/kern
Mar 5 10:05:00 compute-1-11.local kernel: Aborting journal on device sda1.
Mar 5 10:05:01 compute-1-11.local kernel: ext3_abort called.
Mar 5 10:05:01 compute-1-11.local kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Mar 5 10:05:01 compute-1-11.local kernel: Remounting filesystem read-only
Mar 7 05:18:06 compute-1-11.local kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range
There are a few things I've learned from this:
I've started to put some of these commands in a sub -- that's a really awesome framework from 37 signals to collect commonly-used commands together. In this case, I've named the sub "sophie", after the cluster I work on (named in turn after the daughter of the PI). You can find it on github or my own server (github is great, but what happens when it goes away? ...but that's a rant for another day.) Right now there are only a few things in there, and they're somewhat specific to my environment, and doubtless they could be improved -- but it's helping a lot so far.
Q from a user today that took two hours (not counting this entry) to track down: why are my jobs idle when showq shows there are a lot of free processors? (Background: we have a Rocks cluster, 39 nodes, 492 cores. Torque + Maui, pretty vanilla config.)
First off, showq did show a lot of free cores:
$ showq
[lots of jobs]
192 Active Jobs 426 of 492 Processors Active (86.59%)
38 of 38 Nodes Active (100.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
32542 jdoe Idle 1 1:00:00:00 Fri Feb 15 13:55:55
32543 jdoe Idle 1 1:00:00:00 Fri Feb 15 13:55:55
32544 jdoe Idle 1 1:00:00:00 Fri Feb 15 13:55:56
Okay, so why? Let's take one of those jobs:
$ checkjob 32542
checking job 32542
State: Idle
Creds: user:jdoe group:example class:default qos:DEFAULT
WallTime: 00:00:00 of 1:00:00:00
SubmitTime: Fri Feb 15 13:55:55
(Time Queued Total: 2:22:55:26 Eligible: 2:22:55:26)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 196 StartCount: 0
PartitionMask: [ALL]
Flags: HOSTLIST RESTARTABLE
HostList:
[compute-1-3:1]
Reservation '32542' (21:17:14 -> 1:21:17:14 Duration: 1:00:00:00)
PE: 1.00 StartPriority: 4255
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs: 216 feasible procs: 0
Rejection Reasons: [State : 1][HostList : 38]
Note the bit that says:
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
If we run "checkjob -v", we see some additional info (all the rest is the same):
Detailed Node Availability Information:
compute-2-1 rejected : HostList
compute-1-1 rejected : HostList
compute-3-2 rejected : HostList
compute-1-3 rejected : State
compute-1-4 rejected : HostList
compute-1-5 rejected : HostList
[and on it goes...]
This means that compute-1-3, one of the nodes we have, has been assigned to the job. It's busy, so it'll get to the job Real Soon Now. Problem solved!
Well, no. Because if you run something like this:
showq -u jdoe |awk '/Idle/ {print "checkjob -v " $1}' | sh
then a) you're probably in a state of sin, and b) you'll see that there are a lot of jobs assigned to compute-1-3. WTF?
Well, this looks pretty close to what I'm seeing. And as it turns out, the user in question submitted a lot of jobs (hundreds) all at the same time. Ganglia lost track of all the nodes for a while, so I assume that Torque did as well. (Haven't checked into that yet...trying to get this down first; documenting stuff for Rocks is always a problem for me.) The thread reply suggests qalter, but that doesn't seem to work.
While I'm at it, here's a list of stuff that doesn't work:
(Oh, and btw turns out runjob is supposed to be replaced by mjobctl, but mjobctl doesn't appear to work. True story.)
So at this point I'm stuck suggesting two things to the user:
God I hate HPC sometimes.
Helpful links that explained some of this:
Quick note: I just tracked down a problem with our Rocks cluster
(which uses Torque and Maui) where suddenly submitted jobs were
nearly always failing instantly and without any output -- even a
simple echo hello world
failed with zero output. Turned out one of
the nodes had filled up / (which, in a default Rocks install, includes
/opt/torque and /tmp) completely with the output from a month-old job
that ran amok. This node happened to be allocated to most (but not
all...) jobs, and so caused a lot of disruption.
I don't know how best to monitor this...
Last week I was running some benchmarks on the new cluster at $WORK; I was trying to see what effect compiling a new, Torque-aware version of OpenMPI would have. As you may remember, the stock version of OpenMPI that comes with Rocks is not Torque-aware, so a workaround was added that told OpenMPI which nodes Torque had allocated to it.
The change was in the Torque submission script. Stock version:
source /opt/torque/etc/openmpi-setup.sh
/opt/openmpi/bin/mpiexec -n $NUM_PROCS /usr/bin/emacs
New, Torque-aware version:
/path/to/new/openmpi/bin/mpiexec -n $NUM_PROCS /usr/bin/emacs
(Of course I benchmark the cluster using Emacs. Don't you have the code for the MPI-aware version?)
In the end, there wasn't a whole lot of difference in runtimes; that didn't surprise me too much, since (as I understand it) the difference between the two methods is mainly in the starting of jobs -- the overhead at the beginning, rather than in the running of the job itself.
For fun, I tried running the job with MPICH2, another MPI implementation:
/opt/mpich2/gnu/bin/mpiexec -n $NUM_PROCS /usr/bin/emacs
and found pretty terrible performance. It turned out that it wasn't running on all the nodes...in fact, it was only running on one node, and with as many processes as CPUs I'd specified. Since this was meant to be a 4-node, 8-CPU/node version, that meant 32 copies of Emacs on one node. Damn right it was slow.
So what the hell? First thought was that maybe this was a library-versus-launching mismatch. You compile MPI applications using the OpenMPI or MPICH2 versions of the Gnu compilers -- which are basically just wrappers around the regular tools that set library paths and such correctly. So if your application links to OpenMPI but you launch it with MPICH2, maybe that's the problem.
I still need to test that. However, I think what's more likely is that MPICH2 is not Torque-aware. The ever-excellent Debian Clusters has an excellent page on this, with a link to the other other other mpiexec page. Now I need to figure out if the Rocks people have changed anything since 2008, and if the Torque Roll documentation is incomplete or just misunderstood (by me).
At work, I'm about to open up the Rocks cluster to production, or at least beta. I'm finally setting up the attached disk array, along with home directories and quotas, and I've just bumped into an unsettled question:
How the hell do I manage this machine?
On our other servers, I use Cfengine. It's a mix of version 2 and 3, but I'm migrating to 3. I've used Cf3 on the front end of the cluster semi-regularly, and by hand, to set things like LDAP membership, automount, and so on -- basically, to install or modify files and make sure I've got the packages I want. Unlike the other machines, I'm not using cfexecd to run Cf3 continuously.
The assumption behind Cf3 and other configuration management tools -- at least in my mind -- is that if you're doing it once, you'll want to do it again. (Of course, there's also stuff like convergence, distributed management and resisting change, but leave that for now.) This has been a big help, because the changes I needed to apply to the Rocks FE were mostly duplicates of my usual setup.
If/when I change jobs/get hit by a bus, I've made it abundantly clear in my documentation that Cfengine is The Way I Do Things. For a variety of reasons, I think I'm fairly safe in the assumption that Cf3 will not be too hard for a successor to pick up. If someone wants to change it afterward, fine, but at least they know where to start.
OTOH, Rocks has the idea of a "Restore Roll" -- essentially a package you install on a new frontend (after the old one has burned down, say) to reinstall all the files you've customized. You can edit a particular file that creates this roll, and ask it to include more files. Edited /etc/bashrc? Add it to the list.
I think the assumption behind the Restore Roll is that, really, you set up a new FE once every N years -- that a working FE is the result of rare and precious work. The resulting configuration, like the hardware it rests on, is a unique gem. Replacing it is going to be a pain, no matter what you do. There aren't that many Rocks developers, and making it Really, Really Frickin' Nice is probably a waste of their time.
(I also think it fits in with the rest of Rocks, which seems like some really nice bits surrounded by furiously undocumented hacks and workarounds. But I'm probably just annoyed at YET ANOTHER UNDOCUMENTED SET OF HACKS AND WORKAROUNDS.)
And so you have both a number of places where you can list files to be restored, and an amusing uncertainty about whether the whole mechanism works:
I found that after a re-install of Rocks 5.0.3, not all the files I asked for were restored! I suspect it has to do with the order things get installed.
So now I'm torn.
Do I stick with Cf3? I haven't mentioned my unhappiness with its obtuseness and some poor choices in the language (nine positional arguments for a function? WTF?). I'm familiar with it because I've really dived into it and taken a course at LISA from Mark Burgess his own bad self, but it's taken a while to get here. But it is the way I do just about everything else.
Or do I use the Rocks Restore Roll mechanism? Considered on its own, it's the least surprising option for a successor or fill-in. I just wish I could be sure it would work, and I'm annoyed that I'd have to duplicate much of the effort I've put into Cf3.
Gah. What a mess.
I've been puttering away at work getting the cluster going. It's hard, because there are a lot of things I'm having to learn on the go. One of the biggest chunks is Torque and Maui, and how they interact with each other and Rocks as a whole.
For example: today I tried submitting a crapton of jobs all at once.
After a while I checked the queue with showq
(a Maui command; not to
be confused with qstat
, which is Torque) and found that a lot of jobs
were listed as "Deferred" rather than "Idle". I watched, and the idle
ones ran; the deferred ones just stayed in place, even after the list
of running jobs was all done.
At first I thought this might be something to do with fairness. There are a lot of knobs to twiddle in Maui, and since I hadn't looked at the configuration after installation I wasn't really sure what was there. But near as I could tell, there wasn't anything happening there; the config file for Maui was empty, and I couldn't seem to find any mention of what the default settings were. I followed the FAQ and ran the various status commands, but couldn't really see anything obvious there.
Then I tried looking in the Torque logs (/opt/torque/server_logs), and found this:
06/08/2011 14:42:09;0008;PBS_Server;Job;8356.example.com;send of job to compute-3-2 failed error = 15008
06/08/2011 14:42:09;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Access from host not allowed, or unknown host (15008) in send_job, child failed in previous commit request for job 8356.example.com
And on compute-3-2 (/opt/torque/mom_logs)
06/08/2011 14:42:01;0080; pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=QueueJob, from PBS_Server@example.local
That's weird. I ran rocks sync config
out of superstition, but
nothing changed. I found a suggestion that it might be a bug in
Torque, and to run momctl -d
to see if the head node was in the
trusted client list. It was not. I tried running that command on
all the nodes (sudo rocks run host compute command="momctl -d3
|grep Trusted | grep 10.1.1.1
); turned out that only 10 were. What
the hell?
I'm still not sure exactly where this gets set, but I did notice that
/opt/torque/mom_priv/config
listed the head node as the server, and
was identical on all machines. On a hunch, I tried restarting the pbs
service on all the nodes; suddenly they all came up. I submitted a
bunch more jobs, and they all ran through -- none were deferred. And
running momctl -d
showed that, yes, the head node was now in the
trusted client list.
Thoughts:
None of this was shown by Ganglia (which just monitors load) or showq (which is a Maui command; the problem was with Torque).
Doubtless there were commands I should've been running in Torque to show these things.
While the head node is running a syslog server and collects stuff from the client nodes, Torque logs are not among them; I presume Torque is not using syslog. (Must check that out.)
I still don't know how the trusted client list is set. If it's in a text file, that's something that I think Rocks should manage.
I'm not sure if tracking down the problem this way is exactly the right way to go. I think it's important to understand this, but I suspect the Rocks approach would be "just reboot or reinstall". There's value to that, but I intensely dislike not knowing why things are happening and sometimes that gets in my way.
I came across a problem compiling GotoBLAS2 at work today. It went well on a practice cluster, but on the new one I got this error:
gcc -c -O2 -Wall -m64 -DF_INTERFACE_G77 -fPIC -DSMP_SERVER -DMAX_CPU_NUMBER=24 -DASMNAME=strmm_ounncopy -DASMFNAME=strmm_ounncopy_ -DNAME=strmm_ounncopy_ -DCNAME=strmm_ounncopy -DCHAR_NAME=\"strmm_ounncopy_\o
../kernel/x86_64/gemm_ncopy_4.S: Assembler messages:
../kernel/x86_64/gemm_ncopy_4.S:192: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:193: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:194: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:195: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:197: Error: undefined symbol `WPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:345: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:346: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:348: Error: undefined symbol `WPREFETCHSIZE' in operation
The solution was simple:
gmake clean
gmake TARGET=NEHALEM
The problem appears to be that newer CPUs (Intel X5650 in my case) are
not detected properly by the CPU ID routine in GotoBlas2. You can
verify this by checking the contents of config.h
in the top-level
directory. Without TARGET=NEHALEM
, I saw this line:
#define INTEL_UNKNOWN
But with TARGET=NEHALEM
, this becomes:
#define NEHALEM
The problem with gemm_ncopy_4.S
arises because it defines
RPRETCHSIZE
and WPREFETCHSIZE
using #ifdef
statements depending
on CPU type. There is an entry for #ifdef GENERIC
, but that was not
set for me in config.h
.
In addition, if you type "gmake TARGET=NEHALEM" without "gmake clean" first, you get a little further before you run into a similar error:
usr/bin/ld: ../libgoto2_nehalemp-r1.13.a(ssymv_U.o): relocation R_X86_64_32S against `PREFETCHSIZE' can not be used when making a shared object; recompile with -fPIC
../libgoto2_nehalemp-r1.13.a(ssymv_U.o): could not read symbols: Bad value
If I was a better person, I'd have a look at how the sizes are defined
and figure out what the right value is for newer CPUs, then modify
cpuid.c
(which I presume is what's being used to generate
config.h
, or at least this part of it. Maybe another day...
Oh god this week. I've been setting up the cluster (three chassis' worth of blades from Dell). I've installed Rocks on the front end (rackmount R710). After that:
All blades powered on.
Some installed, most did not. Not sure why. Grub Error 15 is the result, which is Grub for "File not found".
I find suggestions in the Rocks mailing list to turn off floppy controllers. Don't have floppy controllers exactly in these, but I do see boot order includes USB floppy and USB CDROM. Pick a blade, disable, PXE boot and reinstall. Whee, it works!
Try on another blade and find that reinstallation takes 90 minutes. Network looks fine; SSH to the reinstalling blade and wget all the RPMs in about twelve seconds. What the hell?
Discover Rocks' Avalanche Installer and how it uses BitTorrent to serve RPMs to nodes. Notice that the installing node is constantly ARPing to find nodes that aren't turned on (they're waiting for me to figure out what the hell's going on). Restart service rocks-tracker on the front end and HOLY CRAP now it's back down to a three minute installation. Make a mental note to file a bug about this.
Find out that Dell OpenManage Deploy Toolkit is the best way to populate a new machine w/BIOS settings, since the Chassis Management Console can't push that particular setting to blades. Download that, start reading.
Try fifteen different ways of connecting virtual media using CMC. Once I find out the correct syntax for NFS mounts (amusingly different between manuals), some blades find it and some don't; no obvious hints why. What the hell?
Give up, pick a random blade and tell it by hand where to find the goddamn ISO. (This ignores the problems of getting Java apps to work in Awesome [hint: use wmname], which is my own fault.) Collect settings before and after disabling USB CDROM and Floppy and find no difference; this setting is apparently not something they expose to this tool.
Give up and try PXE booting this blade even with the demon USB devices still enabled. It works; installation goes fine and after it reboots it comes up fine. What the hell?
Power cycle the blade to see if it still works and it reinstalls. Reinstalls Rocks. What the hell?
Discover /etc/init.d/rocks-grub, which at boot modified grub.conf to pxe boot next time and at graceful shutdown reverses the change, allowing the machine to boot normally. The thinking is that if you have to power cycle a machine you probably want to reinstall it anyhow.
Finally put this all together. Restart tracker, set all blades in one of the chassis' to reinstall. Pick a random couple of blades and fire up consoles. Power all the blades up. Installation fails with anaconda error, saying it can't find any more mirrors. What the hell?
eth0 is down on the front end; dmesg shows hundreds of "kernel: bnx2: eth0 NIC Copper Link is Down" messages starting approximately the time I power-cycled the blades.
I give up. I am going here tonight because my wife is a good person and is taking me there. And I am going to have much, and much-deserved, beer.
Torque is a resource manager; it's an open source project with a long history. It keeps track of resources -- typically compute nodes, but it's "flexible enough to handle scheduling a conference room". It knows how many compute nodes you have, how much memory, how many cores, and so on.
Maui is the job scheduler. It looks at the jobs being submitted, notes what resources you've asked for, and makes requests of Torque. It keeps track of what work is being done, needs to be done, or has been completed.
MPI stands for "Message Passing Interface". Like BLAS, it's a standard with different implementations. It's used by a lot of HPC/scientific programs to exchange messages between processes -- often but not necessarily on separate computers -- related to their work.
MPI is worth mentioning in the same breath as Torque and Maui because of mpiexec, which is part of OpenMPI; OpenMPI is a popular open-source implementation of MPI. mpiexec (aka mpirun, aka orterun) lets you launch processes in an OpenMPI environment, even if the process doesn't require MPI. IOW, there's no problem running something like "mpiexec echo 'Hello, world!'".
To focus on OpenMPI and mpiexec: you can run n copies of your program by using the "-np" argument. Thus, "-np 8" will run 8 copies of your program...but it will run on the machine you run mpiexec on:
$ mpiexec -np 8 hostname
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
This isn't always useful -- why pay big money for all this hardware if you're not going to use it? -- so you can tell it to run on different hosts:
$ mpiexec -np 8 -host compute-0-0,compute-0-1 hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
And if you're going to do that, you might as well give it a file to read, right?
$ mpiexec -np 8 -hostfile /opt/openmpi/etc/openmpi-default-hostfile hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
That file is where Rocks sticks the hostfile, but it could be anywhere -- including in your home directory, if you decide that you want it to run on a particular set of machines.
However, if you're doing that, then you're really setting yourself up as the resource manager. Isn't that Torque's job? Didn't we set all this up so that you wouldn't have to keep track of what machine is busy?
So OpenMPI can work with Torque:
- How do I run jobs under Torque / PBS Pro?
The short answer is just to use mpirun as normal.
Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Torque / PBS Pro directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will use PBS/Torque-native mechanisms to launch and kill processes ([rsh] and/or ssh are not required).
Whee! So easy! Except that Rocks does not compile OpenMPI with Torque support!
Because the Rocks project is kind of a broad umbrella, with lots of sub-projects underneath, the Torque roll is separate from the OpenMPI roll. Besides, installing one doesn't mean you'll install the other, so it may not make sense to build OpenMPI that way.
The fine folks at Rocks asked the fine folks at OpenMPI and found a way around this: by having every job submitted to Torque/Maui and using MPI source /opt/torque/etc/openmpi-setup.sh. While not efficient, it works; the recommended way, though, is to recompile OpenMPI with Torque installed so that it knows about Torque.
To me, this makes the whole Rocks installation less useful, particularly since this is didn't seem terribly well documented. To be fair, it is there in the Torque roll documentation:
Although OpenMPI has support for the torque tm-interface (tm=taskmanager) it is not compiled into the library shipped with Rocks (the reason for this is that the OpenMPI build process needs to have access to libtm from torque to enable the interface). The best workaround is to recompile OpenMPI on a system with torque installed. Then the mpirun command can talk directly to the batch system to get the nodelist and start the parallel application using the torque daemon already running on the nodes. Job startup times for large parallel applications is significantly shorter using the tm-interface that using ssh to start the application on all nodes.
So maybe I should just shut my mouth.
In any event, I suspect I'll end up recompiling OpenMPI in order to get it to see Torque.
There's a lot to clusters. I'm learning that now.
At $WORK, we're getting a cluster RSN -- rack fulla blades, head node, etc etc. I haven't worked w/a cluster before so I'm practicing with a test one: three little nodes, dual core CPUs, 2 GB memory each, set up as one head node and two compute nodes. I'm using Rocks to manage it.
Here a few stories about things I've learned along the way.
You find a lot of references to BLAS when you start reading software requirements for HPC, and not a lot explaining it.
BLAS stands for "Basic Linear Algebra Subprograms"; the original web page is here. Wikipedia calls it "a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication." This is important to realize, because, as in the article, common usage of the term seems to refer to an API than anything else; there's the reference implementation, but it's not really used much.
As I understand it -- and I invite corrections -- BLAS chugs through linear algebra and comes up with an answer at the end. Brute force is one way to do this sort of thing, but there are ways to speed up the process; these can make a huge difference in the amount of time it takes to do some calculation. Some of these are heuristics and algorithms that allow you to search more intelligently through the search space. Some are ways of compiling or writing the library routines differently, taking advantage of the capabilities of different processors to let you search more quickly.
There are two major open-source BLAS implementations:
The Goto BLAS library is a hand-optimized BLAS implementation that, by all accounts, is very fast. It's partly written in assembler, and the guy who wrote it basically crafted it the way (I think) Enzo Ferrari crafted cars.
ATLAS is another BLAS implementation. The ATLAS home page says "it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK." As noted in the articles attached to this page, ATLAS tries many, many different searches for a solution to a particular problem. It uses CPU capabilities to do these searches efficiently.
As such, compilation of ATLAS is a big deal, and the resulting binaries are tuned to the CPU they were built on. Not only do you need to turn off CPU throttling, but you need to build on the CPU you'll be running on. Pre-built packages are pretty much out.
ATLAS used to be included in the HPC roll of the Rocks 4 series. Despite [irritatingly out-of-date information][13], this has not been the case in a while.
[LAPACK] "is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems." It needs a BLAS library. From the FAQ:
Why aren’t BLAS routines included when I download an LAPACK routine?
It is assumed that you have a machine-specific optimized BLAS library already available on the architecture to which you are installing LAPACK. If this is not the case, you can download a Fortran77 reference implementation of the BLAS from netlib.
Although a model implementation of the BLAS in available from netlib in the blas directory, it is not expected to perform as well as a specially tuned implementation on most high-performance computers -- on some machines it may give much worse performance -- but it allows users to run LAPACK software on machines that do not offer any other implementation of the BLAS.
Alternatively, you can automatically generate an optimized BLAS library for your machine, using ATLAS http:>www.netlib.org/atlas/
(There is an RPM called "blas-3.0" available for rocks; given the URL listed (http://www.netlib.org/lapack/), it appears that this is the model implementation listed above. This version is at /usr/lib64/libblas.so*, and is in ldconfig.)
Point is, you'll want a BLAS implementation, but you've got two (at least) to choose from. And you'll need to compile it yourself. I get the impression that the choice of BLAS library is something that can vary depending on religion, software, environment and so on...which means you'll probably want to look at something like modules to manage all this.
Tomorrow: Torque, Maui and OpenMPI.
This took me a while to figure out. (All my war stories start with that sentence...)
A faculty member is getting a new cluster next year. In the meantime, I've been setting up Rocks on a test bed of older machines to get familiar with it. This week I've been working out how Torque, Maui and MPI work, and today I tried running something non-trivial.
CHARMM is used for molecular simulations; it's mostly (I think) written in Fortran and has been around since the 80s. It's not the worst-behaved scientific program I've had to work with.
I had an example script from the faculty member to run. I was able to run it on the head node of the cluster like so:
mpirun -np 8 /path/to/charmm < stream.inp > out ZZZ=testscript.inp
8 CHARMM processes still running after, like, 5 days. (These things run forever, and I got distracted.) Sweet!
Now to use the cluster the way it was intended: by running the processes on the internal nodes. Just a short script and away we go:
$ cat test_mpi_charmm.sh
#PBS -N test_charmm
#PBS -S /bin/sh
#PBS -N mpi_charmm
#PBS -l nodes=2:ppn=4
. /opt/torque/etc/openmpi-setup.sh
mpirun /path/to/charmm < stream.inp ZZZ=testscript.inp
$ qsub test_mpi_charmm.sh
But no, it wasn't working. The error file showed:
At line 3211 of file ensemble.f
Fortran runtime error: No such file or directory
mpirun has exited due to process rank 0 with PID 9494 on
node compute-0-1.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
Well, that's helpful...but the tail of the output file showed:
CHARMM> ensemble open unit 19 read card name -
CHARMM> restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
Parameter: FILEROOT -> "TEST_RUN"
Parameter: PREV -> "FOO"
Parameter: NREP -> "1"
Parameter: NODE -> "0"
ENSEMBLE> REPLICA NODE 0
ENSEMBLE> OPENING FILE restart/test_run_foo_nr1_nd0
ENSEMBLE> ON UNIT 19
ENSEMBLE> WITH FORMAT FORMATTED AND ACCESS READ
What the what now?
Turns out CHARMM has the ability to checkpoint work as it goes along, saving its work in a restart file that can be read when starting up again. This is a Good Thing(tm) when calculations can take weeks and might be interrupted. From the charmm docs, the restart-relevant command is:
IUNREA -1 Fortran unit from which the dynamics restart file should
be read. A value of -1 means don't read any file.
(I'm guessing a Fortran unit is something like a file descriptor; haven't had time to look it up yet.)
The name of the restart file is set in this bit of the test script:
iunrea 19 iunwri 21 iuncrd 20
Next is this bit:
ensemble open unit 19 read card name -
"restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
An @ sign indicates a variable, it seems. And it's Fortran, and Fortran's been around forever, so it's case-insensitive. So the restart file is being set to "@FILEROOT@PREVnr@NREP_nd@NODE.rst". Snipping from the input file, here are where the variables are set:
set fileroot test
set prev minim
set node ?whoiam
set nrep ?nensem
test" appears to be just a string. I'm assuming "minim" is some kind of numerical constant. But "whoiam" and "nensem" are set by MPI and turned into CHARMM variables. From charmm's documentation:
The CHARMM run is started using MPI commands to specify the number of processes
(replicas) to use, each of which is an identical copy of charmm. This number
is automatically passed by MPI to each copy of the executable, and it is set to
the internal CHARMM variable 'nensem', which can be used in scripts, e.g.
set nrep ?nensem
The other internal variable set automatically via MPI is 'whoiam', e.g.
set node ?whoiam
These are useful for giving different file names to different nodes.
So remember the way charmm was being invoked in the two jobs? The way it worked:
mpirun -np 8 ...
...and the way it didn't:
mpirun ...
Aha! Follow the bouncing ball:
At first I thought that I could get away with increasing the number of copies of charmm that would run by fiddling with torque/serverpriv/servernodes -- telling it that the nodes had 4 processors each (so total of 8) rather than 2. (These really are old machines.) And then I'd change the PBS line to "-l nodes=2:ppn=4", and we're done! Hurrah! Except no: I got the same error as before.
What does work is changing the mpirun args in the qsub file:
mpirun -np 8 ...
However, what that does is run 8 copies on one compute node -- which works, hurrah, but it's not what I think we want:
I think this is a problem for the faculty member to solve, though. It's taken me a whole day to figure this out, and I'm pretty sure I wouldn't understand the implications of just (say) deking out the bit that looks for a restart file. (Besides, there are many such bits.) (Oh, and incidentally, just moving the files out of the way doesn't help...still barfs and dies.) I'll email him about this and let him deal with it.
So we want a cluster at $WORK. I don't know a lot about this, so I figure that something like Rocks or OSCAR is the way to go. OSCAR didn't look like it had been worked on in a while, so Rocks it is. I downloaded the CDs and got ready to install on a handful of old machines.
(Incidentally, I was a bit off-base on OSCAR. It is being worked on, but the last "production-ready" release was version 5.0, in 2006. The newest release is 6.0.5, but as the documentation says:
Note that the OSCAR-6.0.x version is not necessarily suitable for production. OSCAR-6.0.x is actually very similar to KDE-4.0.x: this version is not necessarily "designed" for the users who need all the capabilities traditionally shipped with OSCAR, but this is a good new framework to include and develop new capabilities and move forward. If you are looking for all the capabilities normally supported by OSCAR, we advice you to wait for a later release of OSCAR-6.1.
So yeah, right now it's Rocks.)
Rocks promises to be easy: it installs a frontend, then that frontend installs all your compute nodes. You install different rolls: collections of packages. Everything is easy. Whee!
Only it's not that way, at least not consistently.
I'm reinstalling this time because I neglected to install the Torque roll last time. In theory you can install a roll to an already-existing frontend; I couldn't get it to work.
A lot of stuff -- no, that's not true. Some stuff in Rocks is just not documented very well, and it's the little but important and therefore irritating-when-it's-missing stuff. For example: want the internal compute nodes to be LDAP clients, rather than syncing /etc/passwd all around? That means modifying /var/411/Files.mk to include things like /etc/nsswitch and /etc/ldap.conf. That's documented in the 4.x series, but it has been left out of the 5.x series. I can't tell why; it's still in use.
When you boot from the Rocks install CD, you're directed to type "Build" (to build a new frontend) or "Rescue" (to go into rescue mode). What it doesn't tell you is that if you don't type in something quickly enough, it's going to boot into a regular-looking-but-actually-non-functional CentOS install and after a few minutes will crap out, complaining that it can't find certain files -- instead of either booting from the hard drive or waiting for you to type something. You have to reboot again in order to get another chance.
Right now I'm reinstalling the front end for the THIRD TIME in two days. For some reason, the installation is crapping out and refusing to store the static IP address for the outward-facing interface of the front end. Reinstalling means sitting in the server room feeding CDs (no network installation without an already-existing front end) into old servers (which have no DVD drives) for an hour, then waiting another half hour to see what's gone wrong this time.
Sigh.