When compiling CHARMM, I'll sometimes encounter errors like this:
charmm/lib/gnu/iniall.o: In function `stopch_':
iniall.f:(.text+0x1404): relocation truncated to fit: R_X86_64_PC32 against symbol `ldbia_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x14af): relocation truncated to fit: R_X86_64_PC32 against symbol `seldat_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x14d7): relocation truncated to fit: R_X86_64_32S against symbol `seldat_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x151b): relocation truncated to fit: R_X86_64_32S against symbol `seldat_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x1545): relocation truncated to fit: R_X86_64_PC32 against symbol `shakeq_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x1551): relocation truncated to fit: R_X86_64_PC32 against symbol `shakeq_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x1560): relocation truncated to fit: R_X86_64_PC32 against symbol `kspveci_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x156e): relocation truncated to fit: R_X86_64_PC32 against symbol `kspveci_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x16df): relocation truncated to fit: R_X86_64_PC32 against symbol `shpdat_' defined in COMMON section in charmm/lib/gnu/iniall.o
charmm/lib/gnu/iniall.o: In function `iniall_':
iniall.f:(.text+0x1cae): relocation truncated to fit: R_X86_64_PC32 against symbol `cluslo_' defined in COMMON section in charmm/lib/gnu/iniall.o
iniall.f:(.text+0x1cb8): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
The problem is that the linker is running out of room:
What this means is that the full 64-bit address of foovar, which now lives somewhere above 5 gigabytes, can't be represented within the 32-bit space allocated for it.
The reason for this error is the size of data that you are using. This is seen to happen when your program needs more than 2GB of data. Well, who needs such big data at compile time? I do for one and there are other people in the HPC world who do that too. For all of them the life saver or may be a day saver is the compiler option -mcmodel.
This is a known problem with CHARMM's "huge" keyword.
There are a couple of solutions:
Edit the CHARMM makefiles to include the GCC mcmodel argument (though you should be aware of the subtleties of that argument)
Switch to XXLARGE or some other, smaller memory keyword.
This page, from the University of Alberta, also has excellent background information. (Oh, and also? They have a YouTube channel on using Linux clusters.)
Last week I was running some benchmarks on the new cluster at $WORK; I was trying to see what effect compiling a new, Torque-aware version of OpenMPI would have. As you may remember, the stock version of OpenMPI that comes with Rocks is not Torque-aware, so a workaround was added that told OpenMPI which nodes Torque had allocated to it.
The change was in the Torque submission script. Stock version:
source /opt/torque/etc/openmpi-setup.sh
/opt/openmpi/bin/mpiexec -n $NUM_PROCS /usr/bin/emacs
New, Torque-aware version:
/path/to/new/openmpi/bin/mpiexec -n $NUM_PROCS /usr/bin/emacs
(Of course I benchmark the cluster using Emacs. Don't you have the code for the MPI-aware version?)
In the end, there wasn't a whole lot of difference in runtimes; that didn't surprise me too much, since (as I understand it) the difference between the two methods is mainly in the starting of jobs -- the overhead at the beginning, rather than in the running of the job itself.
For fun, I tried running the job with MPICH2, another MPI implementation:
/opt/mpich2/gnu/bin/mpiexec -n $NUM_PROCS /usr/bin/emacs
and found pretty terrible performance. It turned out that it wasn't running on all the nodes...in fact, it was only running on one node, and with as many processes as CPUs I'd specified. Since this was meant to be a 4-node, 8-CPU/node version, that meant 32 copies of Emacs on one node. Damn right it was slow.
So what the hell? First thought was that maybe this was a library-versus-launching mismatch. You compile MPI applications using the OpenMPI or MPICH2 versions of the Gnu compilers -- which are basically just wrappers around the regular tools that set library paths and such correctly. So if your application links to OpenMPI but you launch it with MPICH2, maybe that's the problem.
I still need to test that. However, I think what's more likely is that MPICH2 is not Torque-aware. The ever-excellent Debian Clusters has an excellent page on this, with a link to the other other other mpiexec page. Now I need to figure out if the Rocks people have changed anything since 2008, and if the Torque Roll documentation is incomplete or just misunderstood (by me).
(Keep in mind: I don't know what I'm talking about. I've written this down because I've found very little that seems to explain this to a newbie. I'm probably wrong; if you know that I'm wrong, leave a comment.)
Linpack is "a software library for performing numerical linear algebra on digital computers". It has been superceded for that purpose by Lapack; now it's mainly used for benchmarking. The latest incarnation, used for scores on the Top 500 list, is called hpl (High Performance Linpack); it uses MPI for communication.
If you search for "linpack score", you'll find an astonishing number of people posting scores for their phone. If you're looking for information on maxing your score on your HTC Dream, I can't help you.
If you have a cluster, then near as I can tell there are two reasons to do this:
The first is the sort of measuring contest that you hope gets you into the Top 500 list. It probably affects your funding, and may affect your continued employment.
The second uses Linpack as a shakedown (did the test work? did anything break?), or as a way of benchmarking performance. Sometimes people will use Linpack scores as a baseline; they'll make tweaks to a cluster (add more memory, change MTU, turn off more daemons, twiddle BIOS settings, etc) and see what the effect is. Linpack is not perfect for this; it stresses CPU and FPU, possibly memory and network, and doesn't really check disk, power usage or other things. But it's a start, it's familiar, and it boils down to a single number.
(HAH.)
The theoretical peak score is:
CPU GHz x Flops/Hz x Cores/node x nodes
(Cite). Flops/Hz is CPU-specific; for the Nehalems, at least, it's 4/Hz. Thus, for the cluster I'm working on, the peak score is:
2.67 GHz x 4 Flops/Hz x 24 Cores/node x 35 nodes
(Note the assumption of HyperThreading turned on in the 24 cores/node figure; while I've got that turned on now, I should probably turn it off -- at least for running Linpack.)
Anyhow, comes out to about 8970 GFlops, or almost 9 TFlops. By contrast, Tianhe-1A, the top entry in the November 2010 Top 500 list, has an RPeak of 4701000 GFlops -- so 4701 TFlops, or about 4.7 Petaflops. So there's no need to buy me that "I made the Top 500 and all I got was this lousy t-shirt" t-shirt yet.
Of course, that's a theoretical peak, and a lot depends on the way your system is configured. For example, [[https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2010-May/047136.html][this post]] to the Rocks-discuss mailing list says:
FYI, I've got around 84-85% on a cluster with Infiniband and OpenMPI, but some people told me they get better results.
That's 85% of the theoretical max. And it depends on Infiniband. Jeezum Crow.
Linpack uses a configuration file named HPL.dat. The format is a little non-obvious, but is documented here. Here's a sample files, as generated by cbench (about which more later):
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
108304 346573 368233 # N -- fantastically important; see ahead
1 # Number of block sizes
80 112 96 # Block sizes
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
25 # Ps
26 # Qs
8.0 threshold
1 # of panel fact
0 2 1 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 2 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 2 0 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 3 1 2 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
256 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
0 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
(Not the best format; sorry.)
P and Q, multiplied, should be the number of cores you want to use (thanks, Tim Doug) -- and so need to match the parameters you pass to Torque, or whatever batching system you're using.
Block sizes: Tim Doug says: "128 works well for me. Others suggest 80, 160, or 256. Experiment." cbench uses 80, 112 and 96; I think this is just how they do things. I see a peak (in very early tests) around 150.
N: The biggie. This posting to the Rocks-discuss mailing list gives an excellent overview of how N works in the Linpack test:
If you really want to stress your cluster, you want to have your matrix size fill approximately 80% of memory. For an NxN matrix, you consume NN8bytes. If you have a 16-node cluster, for example with 8 GB/memory/node, then you have 168GB.80 = 102GB. N would be approximately sqrt(102e^9/8) ~ 113,000.
That's a pretty big matrix an takes O(113,000^3) or 1.4 Quadrillion Floating OPs. If your nodes were 8 core, 2.5GHz, 4 Flops/cycle, then one would expect this matrix to factor (at reasonable computational efficiency) in the 30-90 Minute range. The exact time depends on efficiency, constant on the O(n^3) term and actual speed of your processors and network.
HPL will allow you to set up various matrix sizes, set up something that will compute quickly, eg. a 1000x1000 matrix, to verify that everything is happy. then step through some sizes that will take 1 - 5 minutes to factor, this will allow you calibrate the time you expect the full load to run. Remember each doubling of matrix size results in 8X the number floating ops. You get more efficiency as you get larger (more computation to communication), but it starts to level off pretty quickly. For most interconnects, using ~20% of memory is usually a decent indicator of ultimate system performance, if 20% takes 1 minute to actually compute, you expect the 80% run (8X) to take about an hour.
Full machine LINPACK runs can take many hours to run.
(It's worth emphasizing that Linpack really does take a long time to run with large N; this discussion shows how to start small and ramp up your Linpack tests as you gain confidence.)
cbench, which I mentioned above, is a suite of programs that are meant to exercise and benchmark a cluster. It's really quite excellent, but as with a lot of things in HPC the documentation isn't as explicit as it could be. For example, the N figures that are above are:
Thus, it's a good reflection of/agreement with the approach outlined above: 20% for a short run (good for a ballpark figure) and 80% - 85% for longer runs (to really stress things).
So if you run Linpack a bunch of times and tweak the parameters, you'll see different results. This page discusses why:
The parallel solution of a system of linear equation requires some communication between the processors. To measure the loss of efficiency due this communication, we solved systems of equations of varying size on a varying number of processors. The general rule is: larger N means more work for each CPU and less influence of communication. As you can see from Fig. 1, a 4-CPU setup comes very close to the single CPU peak performance of 528 Mflops. This indicates, that the solver that works in HPL is not significantly worse than ATLAS. The relative speed per CPU decreases with increasing number of CPUs, however.
The problem size N is limited by the total memory. Tina has 512 MByte per node, i.e. each node can hold at most an 8192x8192 matrix of double precision floats. In practice, the matrix has to be smaller since the system itself needs a bit of memory, too. If both CPUs on a node are operating, the maximum size reduces to 5790x5790 per CPU. To minimize the relative weight of communication, the memory load should be as high as possible on each node. In Fig.2 you can see, how the effective speed increases with increasding load factor. A load factor 1 means that 256 MByte are required on each node to hold the NxN coefficient matrix.
[...]
With all 144 CPUs, communication becomes the major bottleneck. The current performance of 41 Gflops scales down to 284 Mflops/CPU [as compared to 528 MFlops for N=5000 on a single 4-CPU system]. The CPUs seem to spend almost half of their time chatting with each other...
Really, as someone else said, it's a black art. There are a ton of papers out there on optimizing Linpack parameters. There's even -- and I am crapping you negative -- a software project called ga-linhack that aims to "Develop a complete genetic algorithm tool set for determining optimal parameters for Linpack runs." Because as they say:
To most cluster engineers (the authors included) the tuning explanations of the hpl parameters yield little clue as to the underlying effect of varying these parameters. Not everyone can take a graduate mathematics course in advanced linear algebra in their free time.
Testify!
This page (mentioned above) also has this quote:
After this, look for the top 8 or 16 results, and refine the config file to use only the parameters that produced these results.
...which for me brought up a lot of questions, like:
I'm still figuring all this out.
I've been puttering away at work getting the cluster going. It's hard, because there are a lot of things I'm having to learn on the go. One of the biggest chunks is Torque and Maui, and how they interact with each other and Rocks as a whole.
For example: today I tried submitting a crapton of jobs all at once.
After a while I checked the queue with showq
(a Maui command; not to
be confused with qstat
, which is Torque) and found that a lot of jobs
were listed as "Deferred" rather than "Idle". I watched, and the idle
ones ran; the deferred ones just stayed in place, even after the list
of running jobs was all done.
At first I thought this might be something to do with fairness. There are a lot of knobs to twiddle in Maui, and since I hadn't looked at the configuration after installation I wasn't really sure what was there. But near as I could tell, there wasn't anything happening there; the config file for Maui was empty, and I couldn't seem to find any mention of what the default settings were. I followed the FAQ and ran the various status commands, but couldn't really see anything obvious there.
Then I tried looking in the Torque logs (/opt/torque/server_logs), and found this:
06/08/2011 14:42:09;0008;PBS_Server;Job;8356.example.com;send of job to compute-3-2 failed error = 15008
06/08/2011 14:42:09;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Access from host not allowed, or unknown host (15008) in send_job, child failed in previous commit request for job 8356.example.com
And on compute-3-2 (/opt/torque/mom_logs)
06/08/2011 14:42:01;0080; pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=QueueJob, from PBS_Server@example.local
That's weird. I ran rocks sync config
out of superstition, but
nothing changed. I found a suggestion that it might be a bug in
Torque, and to run momctl -d
to see if the head node was in the
trusted client list. It was not. I tried running that command on
all the nodes (sudo rocks run host compute command="momctl -d3
|grep Trusted | grep 10.1.1.1
); turned out that only 10 were. What
the hell?
I'm still not sure exactly where this gets set, but I did notice that
/opt/torque/mom_priv/config
listed the head node as the server, and
was identical on all machines. On a hunch, I tried restarting the pbs
service on all the nodes; suddenly they all came up. I submitted a
bunch more jobs, and they all ran through -- none were deferred. And
running momctl -d
showed that, yes, the head node was now in the
trusted client list.
Thoughts:
None of this was shown by Ganglia (which just monitors load) or showq (which is a Maui command; the problem was with Torque).
Doubtless there were commands I should've been running in Torque to show these things.
While the head node is running a syslog server and collects stuff from the client nodes, Torque logs are not among them; I presume Torque is not using syslog. (Must check that out.)
I still don't know how the trusted client list is set. If it's in a text file, that's something that I think Rocks should manage.
I'm not sure if tracking down the problem this way is exactly the right way to go. I think it's important to understand this, but I suspect the Rocks approach would be "just reboot or reinstall". There's value to that, but I intensely dislike not knowing why things are happening and sometimes that gets in my way.
I came across a problem compiling GotoBLAS2 at work today. It went well on a practice cluster, but on the new one I got this error:
gcc -c -O2 -Wall -m64 -DF_INTERFACE_G77 -fPIC -DSMP_SERVER -DMAX_CPU_NUMBER=24 -DASMNAME=strmm_ounncopy -DASMFNAME=strmm_ounncopy_ -DNAME=strmm_ounncopy_ -DCNAME=strmm_ounncopy -DCHAR_NAME=\"strmm_ounncopy_\o
../kernel/x86_64/gemm_ncopy_4.S: Assembler messages:
../kernel/x86_64/gemm_ncopy_4.S:192: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:193: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:194: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:195: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:197: Error: undefined symbol `WPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:345: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:346: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:348: Error: undefined symbol `WPREFETCHSIZE' in operation
The solution was simple:
gmake clean
gmake TARGET=NEHALEM
The problem appears to be that newer CPUs (Intel X5650 in my case) are
not detected properly by the CPU ID routine in GotoBlas2. You can
verify this by checking the contents of config.h
in the top-level
directory. Without TARGET=NEHALEM
, I saw this line:
#define INTEL_UNKNOWN
But with TARGET=NEHALEM
, this becomes:
#define NEHALEM
The problem with gemm_ncopy_4.S
arises because it defines
RPRETCHSIZE
and WPREFETCHSIZE
using #ifdef
statements depending
on CPU type. There is an entry for #ifdef GENERIC
, but that was not
set for me in config.h
.
In addition, if you type "gmake TARGET=NEHALEM" without "gmake clean" first, you get a little further before you run into a similar error:
usr/bin/ld: ../libgoto2_nehalemp-r1.13.a(ssymv_U.o): relocation R_X86_64_32S against `PREFETCHSIZE' can not be used when making a shared object; recompile with -fPIC
../libgoto2_nehalemp-r1.13.a(ssymv_U.o): could not read symbols: Bad value
If I was a better person, I'd have a look at how the sizes are defined
and figure out what the right value is for newer CPUs, then modify
cpuid.c
(which I presume is what's being used to generate
config.h
, or at least this part of it. Maybe another day...
Oh god this week. I've been setting up the cluster (three chassis' worth of blades from Dell). I've installed Rocks on the front end (rackmount R710). After that:
All blades powered on.
Some installed, most did not. Not sure why. Grub Error 15 is the result, which is Grub for "File not found".
I find suggestions in the Rocks mailing list to turn off floppy controllers. Don't have floppy controllers exactly in these, but I do see boot order includes USB floppy and USB CDROM. Pick a blade, disable, PXE boot and reinstall. Whee, it works!
Try on another blade and find that reinstallation takes 90 minutes. Network looks fine; SSH to the reinstalling blade and wget all the RPMs in about twelve seconds. What the hell?
Discover Rocks' Avalanche Installer and how it uses BitTorrent to serve RPMs to nodes. Notice that the installing node is constantly ARPing to find nodes that aren't turned on (they're waiting for me to figure out what the hell's going on). Restart service rocks-tracker on the front end and HOLY CRAP now it's back down to a three minute installation. Make a mental note to file a bug about this.
Find out that Dell OpenManage Deploy Toolkit is the best way to populate a new machine w/BIOS settings, since the Chassis Management Console can't push that particular setting to blades. Download that, start reading.
Try fifteen different ways of connecting virtual media using CMC. Once I find out the correct syntax for NFS mounts (amusingly different between manuals), some blades find it and some don't; no obvious hints why. What the hell?
Give up, pick a random blade and tell it by hand where to find the goddamn ISO. (This ignores the problems of getting Java apps to work in Awesome [hint: use wmname], which is my own fault.) Collect settings before and after disabling USB CDROM and Floppy and find no difference; this setting is apparently not something they expose to this tool.
Give up and try PXE booting this blade even with the demon USB devices still enabled. It works; installation goes fine and after it reboots it comes up fine. What the hell?
Power cycle the blade to see if it still works and it reinstalls. Reinstalls Rocks. What the hell?
Discover /etc/init.d/rocks-grub, which at boot modified grub.conf to pxe boot next time and at graceful shutdown reverses the change, allowing the machine to boot normally. The thinking is that if you have to power cycle a machine you probably want to reinstall it anyhow.
Finally put this all together. Restart tracker, set all blades in one of the chassis' to reinstall. Pick a random couple of blades and fire up consoles. Power all the blades up. Installation fails with anaconda error, saying it can't find any more mirrors. What the hell?
eth0 is down on the front end; dmesg shows hundreds of "kernel: bnx2: eth0 NIC Copper Link is Down" messages starting approximately the time I power-cycled the blades.
I give up. I am going here tonight because my wife is a good person and is taking me there. And I am going to have much, and much-deserved, beer.
This took me a while to figure out. (All my war stories start with that sentence...)
A faculty member is getting a new cluster next year. In the meantime, I've been setting up Rocks on a test bed of older machines to get familiar with it. This week I've been working out how Torque, Maui and MPI work, and today I tried running something non-trivial.
CHARMM is used for molecular simulations; it's mostly (I think) written in Fortran and has been around since the 80s. It's not the worst-behaved scientific program I've had to work with.
I had an example script from the faculty member to run. I was able to run it on the head node of the cluster like so:
mpirun -np 8 /path/to/charmm < stream.inp > out ZZZ=testscript.inp
8 CHARMM processes still running after, like, 5 days. (These things run forever, and I got distracted.) Sweet!
Now to use the cluster the way it was intended: by running the processes on the internal nodes. Just a short script and away we go:
$ cat test_mpi_charmm.sh
#PBS -N test_charmm
#PBS -S /bin/sh
#PBS -N mpi_charmm
#PBS -l nodes=2:ppn=4
. /opt/torque/etc/openmpi-setup.sh
mpirun /path/to/charmm < stream.inp ZZZ=testscript.inp
$ qsub test_mpi_charmm.sh
But no, it wasn't working. The error file showed:
At line 3211 of file ensemble.f
Fortran runtime error: No such file or directory
mpirun has exited due to process rank 0 with PID 9494 on
node compute-0-1.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
Well, that's helpful...but the tail of the output file showed:
CHARMM> ensemble open unit 19 read card name -
CHARMM> restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
Parameter: FILEROOT -> "TEST_RUN"
Parameter: PREV -> "FOO"
Parameter: NREP -> "1"
Parameter: NODE -> "0"
ENSEMBLE> REPLICA NODE 0
ENSEMBLE> OPENING FILE restart/test_run_foo_nr1_nd0
ENSEMBLE> ON UNIT 19
ENSEMBLE> WITH FORMAT FORMATTED AND ACCESS READ
What the what now?
Turns out CHARMM has the ability to checkpoint work as it goes along, saving its work in a restart file that can be read when starting up again. This is a Good Thing(tm) when calculations can take weeks and might be interrupted. From the charmm docs, the restart-relevant command is:
IUNREA -1 Fortran unit from which the dynamics restart file should
be read. A value of -1 means don't read any file.
(I'm guessing a Fortran unit is something like a file descriptor; haven't had time to look it up yet.)
The name of the restart file is set in this bit of the test script:
iunrea 19 iunwri 21 iuncrd 20
Next is this bit:
ensemble open unit 19 read card name -
"restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
An @ sign indicates a variable, it seems. And it's Fortran, and Fortran's been around forever, so it's case-insensitive. So the restart file is being set to "@FILEROOT@PREVnr@NREP_nd@NODE.rst". Snipping from the input file, here are where the variables are set:
set fileroot test
set prev minim
set node ?whoiam
set nrep ?nensem
test" appears to be just a string. I'm assuming "minim" is some kind of numerical constant. But "whoiam" and "nensem" are set by MPI and turned into CHARMM variables. From charmm's documentation:
The CHARMM run is started using MPI commands to specify the number of processes
(replicas) to use, each of which is an identical copy of charmm. This number
is automatically passed by MPI to each copy of the executable, and it is set to
the internal CHARMM variable 'nensem', which can be used in scripts, e.g.
set nrep ?nensem
The other internal variable set automatically via MPI is 'whoiam', e.g.
set node ?whoiam
These are useful for giving different file names to different nodes.
So remember the way charmm was being invoked in the two jobs? The way it worked:
mpirun -np 8 ...
...and the way it didn't:
mpirun ...
Aha! Follow the bouncing ball:
At first I thought that I could get away with increasing the number of copies of charmm that would run by fiddling with torque/serverpriv/servernodes -- telling it that the nodes had 4 processors each (so total of 8) rather than 2. (These really are old machines.) And then I'd change the PBS line to "-l nodes=2:ppn=4", and we're done! Hurrah! Except no: I got the same error as before.
What does work is changing the mpirun args in the qsub file:
mpirun -np 8 ...
However, what that does is run 8 copies on one compute node -- which works, hurrah, but it's not what I think we want:
I think this is a problem for the faculty member to solve, though. It's taken me a whole day to figure this out, and I'm pretty sure I wouldn't understand the implications of just (say) deking out the bit that looks for a restart file. (Besides, there are many such bits.) (Oh, and incidentally, just moving the files out of the way doesn't help...still barfs and dies.) I'll email him about this and let him deal with it.
So we want a cluster at $WORK. I don't know a lot about this, so I figure that something like Rocks or OSCAR is the way to go. OSCAR didn't look like it had been worked on in a while, so Rocks it is. I downloaded the CDs and got ready to install on a handful of old machines.
(Incidentally, I was a bit off-base on OSCAR. It is being worked on, but the last "production-ready" release was version 5.0, in 2006. The newest release is 6.0.5, but as the documentation says:
Note that the OSCAR-6.0.x version is not necessarily suitable for production. OSCAR-6.0.x is actually very similar to KDE-4.0.x: this version is not necessarily "designed" for the users who need all the capabilities traditionally shipped with OSCAR, but this is a good new framework to include and develop new capabilities and move forward. If you are looking for all the capabilities normally supported by OSCAR, we advice you to wait for a later release of OSCAR-6.1.
So yeah, right now it's Rocks.)
Rocks promises to be easy: it installs a frontend, then that frontend installs all your compute nodes. You install different rolls: collections of packages. Everything is easy. Whee!
Only it's not that way, at least not consistently.
I'm reinstalling this time because I neglected to install the Torque roll last time. In theory you can install a roll to an already-existing frontend; I couldn't get it to work.
A lot of stuff -- no, that's not true. Some stuff in Rocks is just not documented very well, and it's the little but important and therefore irritating-when-it's-missing stuff. For example: want the internal compute nodes to be LDAP clients, rather than syncing /etc/passwd all around? That means modifying /var/411/Files.mk to include things like /etc/nsswitch and /etc/ldap.conf. That's documented in the 4.x series, but it has been left out of the 5.x series. I can't tell why; it's still in use.
When you boot from the Rocks install CD, you're directed to type "Build" (to build a new frontend) or "Rescue" (to go into rescue mode). What it doesn't tell you is that if you don't type in something quickly enough, it's going to boot into a regular-looking-but-actually-non-functional CentOS install and after a few minutes will crap out, complaining that it can't find certain files -- instead of either booting from the hard drive or waiting for you to type something. You have to reboot again in order to get another chance.
Right now I'm reinstalling the front end for the THIRD TIME in two days. For some reason, the installation is crapping out and refusing to store the static IP address for the outward-facing interface of the front end. Reinstalling means sitting in the server room feeding CDs (no network installation without an already-existing front end) into old servers (which have no DVD drives) for an hour, then waiting another half hour to see what's gone wrong this time.
Sigh.