How I spent my day

This took me a while to figure out. (All my war stories start with that sentence...)

A faculty member is getting a new cluster next year. In the meantime, I've been setting up Rocks on a test bed of older machines to get familiar with it. This week I've been working out how Torque, Maui and MPI work, and today I tried running something non-trivial.

CHARMM is used for molecular simulations; it's mostly (I think) written in Fortran and has been around since the 80s. It's not the worst-behaved scientific program I've had to work with.

I had an example script from the faculty member to run. I was able to run it on the head node of the cluster like so:

mpirun -np 8 /path/to/charmm < stream.inp > out ZZZ=testscript.inp

8 CHARMM processes still running after, like, 5 days. (These things run forever, and I got distracted.) Sweet!

Now to use the cluster the way it was intended: by running the processes on the internal nodes. Just a short script and away we go:

$ cat test_mpi_charmm.sh
#PBS -N test_charmm
#PBS -S /bin/sh
#PBS -N mpi_charmm
#PBS -l nodes=2:ppn=4

. /opt/torque/etc/openmpi-setup.sh
mpirun /path/to/charmm < stream.inp ZZZ=testscript.inp
$ qsub test_mpi_charmm.sh

But no, it wasn't working. The error file showed:

At line 3211 of file ensemble.f
Fortran runtime error: No such file or directory

mpirun has exited due to process rank 0 with PID 9494 on
node compute-0-1.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).


Well, that's helpful...but the tail of the output file showed:

CHARMM>    ensemble open unit 19 read card name    -
 CHARMM> restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
 Parameter: FILEROOT -> "TEST_RUN"
 Parameter: PREV -> "FOO"
 Parameter: NREP -> "1"
 Parameter: NODE -> "0"
 ENSEMBLE>   REPLICA NODE   0
 ENSEMBLE>   OPENING FILE restart/test_run_foo_nr1_nd0
 ENSEMBLE>   ON UNIT  19
 ENSEMBLE>   WITH FORMAT FORMATTED       AND ACCESS READ

What the what now?

Turns out CHARMM has the ability to checkpoint work as it goes along, saving its work in a restart file that can be read when starting up again. This is a Good Thing(tm) when calculations can take weeks and might be interrupted. From the charmm docs, the restart-relevant command is:

IUNREA     -1     Fortran unit from which the dynamics restart file should
          be read. A value of -1 means don't read any file.

(I'm guessing a Fortran unit is something like a file descriptor; haven't had time to look it up yet.)

The name of the restart file is set in this bit of the test script:

iunrea 19 iunwri 21 iuncrd 20

Next is this bit:

ensemble open unit 19 read card name     -
"restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"

An @ sign indicates a variable, it seems. And it's Fortran, and Fortran's been around forever, so it's case-insensitive. So the restart file is being set to "@FILEROOT@PREVnr@NREP_nd@NODE.rst". Snipping from the input file, here are where the variables are set:

set fileroot  test
set prev minim
set node ?whoiam
set nrep ?nensem

test" appears to be just a string. I'm assuming "minim" is some kind of numerical constant. But "whoiam" and "nensem" are set by MPI and turned into CHARMM variables. From charmm's documentation:

The CHARMM run is started using MPI commands to specify the number of processes
(replicas) to use, each of which is an identical copy of charmm. This number
is automatically passed by MPI to each copy of the executable, and it is set to
the internal CHARMM variable 'nensem', which can be used in scripts, e.g.
    set nrep ?nensem
The other internal variable set automatically via MPI is 'whoiam', e.g.
    set node ?whoiam
These are useful for giving different file names to different nodes.

So remember the way charmm was being invoked in the two jobs? The way it worked:

mpirun -np 8 ...

...and the way it didn't:

mpirun ...

Aha! Follow the bouncing ball:

At first I thought that I could get away with increasing the number of copies of charmm that would run by fiddling with torque/serverpriv/servernodes -- telling it that the nodes had 4 processors each (so total of 8) rather than 2. (These really are old machines.) And then I'd change the PBS line to "-l nodes=2:ppn=4", and we're done! Hurrah! Except no: I got the same error as before.

What does work is changing the mpirun args in the qsub file:

mpirun -np 8 ...

However, what that does is run 8 copies on one compute node -- which works, hurrah, but it's not what I think we want:

I think this is a problem for the faculty member to solve, though. It's taken me a whole day to figure this out, and I'm pretty sure I wouldn't understand the implications of just (say) deking out the bit that looks for a restart file. (Besides, there are many such bits.) (Oh, and incidentally, just moving the files out of the way doesn't help...still barfs and dies.) I'll email him about this and let him deal with it.