How I spent my day

23 Dec 2010

This took me a while to figure out. (All my war stories start with that sentence...)

A faculty member is getting a new cluster next year. In the meantime, I've been setting up Rocks on a test bed of older machines to get familiar with it. This week I've been working out how Torque, Maui and MPI work, and today I tried running something non-trivial.

CHARMM is used for molecular simulations; it's mostly (I think) written in Fortran and has been around since the 80s. It's not the worst-behaved scientific program I've had to work with.

I had an example script from the faculty member to run. I was able to run it on the head node of the cluster like so:

mpirun -np 8 /path/to/charmm < stream.inp > out ZZZ=testscript.inp

8 CHARMM processes still running after, like, 5 days. (These things run forever, and I got distracted.) Sweet!

Now to use the cluster the way it was intended: by running the processes on the internal nodes. Just a short script and away we go:

$ cat test_mpi_charmm.sh
#PBS -N test_charmm
#PBS -S /bin/sh
#PBS -N mpi_charmm
#PBS -l nodes=2:ppn=4

. /opt/torque/etc/openmpi-setup.sh
mpirun /path/to/charmm < stream.inp ZZZ=testscript.inp
$ qsub test_mpi_charmm.sh

But no, it wasn't working. The error file showed:

At line 3211 of file ensemble.f
Fortran runtime error: No such file or directory

mpirun has exited due to process rank 0 with PID 9494 on
node compute-0-1.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

Well, that's helpful...but the tail of the output file showed:

CHARMM>    ensemble open unit 19 read card name    -
 CHARMM> restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
 Parameter: FILEROOT -> "TEST_RUN"
 Parameter: PREV -> "FOO"
 Parameter: NREP -> "1"
 Parameter: NODE -> "0"
 ENSEMBLE>   REPLICA NODE   0
 ENSEMBLE>   OPENING FILE restart/test_run_foo_nr1_nd0
 ENSEMBLE>   ON UNIT  19
 ENSEMBLE>   WITH FORMAT FORMATTED       AND ACCESS READ

What the what now?

Turns out CHARMM has the ability to checkpoint work as it goes along, saving its work in a restart file that can be read when starting up again. This is a Good Thing(tm) when calculations can take weeks and might be interrupted. From the charmm docs, the restart-relevant command is:

IUNREA     -1     Fortran unit from which the dynamics restart file should

          be read. A value of -1 means don't read any file.

(I'm guessing a Fortran unit is something like a file descriptor; haven't had time to look it up yet.)

The name of the restart file is set in this bit of the test script:

iunrea 19 iunwri 21 iuncrd 20

Next is this bit:

ensemble open unit 19 read card name     -
"restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"

An @ sign indicates a variable, it seems. And it's Fortran, and Fortran's been around forever, so it's case-insensitive. So the restart file is being set to "@FILEROOT@PREVnr@NREP_nd@NODE.rst". Snipping from the input file, here are where the variables are set:

set fileroot  test
set prev minim
set node ?whoiam
set nrep ?nensem

test" appears to be just a string. I'm assuming "minim" is some kind of numerical constant. But "whoiam" and "nensem" are set by MPI and turned into CHARMM variables. From charmm's documentation:

The CHARMM run is started using MPI commands to specify the number of processes
(replicas) to use, each of which is an identical copy of charmm. This number
is automatically passed by MPI to each copy of the executable, and it is set to
the internal CHARMM variable 'nensem', which can be used in scripts, e.g.

    set nrep ?nensem

The other internal variable set automatically via MPI is 'whoiam', e.g.

    set node ?whoiam

These are useful for giving different file names to different nodes.

So remember the way charmm was being invoked in the two jobs? The way it worked:

mpirun -np 8 ...

...and the way it didn't:

mpirun ...

Aha! Follow the bouncing ball:

The input script wants to load a checkpoint file...
...which is named after the number of processes mpi was told to run...
...and the script barfs if it's not there.

At first I thought that I could get away with increasing the number of copies of charmm that would run by fiddling with torque/serverpriv/servernodes -- telling it that the nodes had 4 processors each (so total of 8) rather than 2. (These really are old machines.) And then I'd change the PBS line to "-l nodes=2:ppn=4", and we're done! Hurrah! Except no: I got the same error as before.

What does work is changing the mpirun args in the qsub file:

mpirun -np 8 ...

However, what that does is run 8 copies on one compute node -- which works, hurrah, but it's not what I think we want:

many copies on many nodes
communicating as necessary

I think this is a problem for the faculty member to solve, though. It's taken me a whole day to figure this out, and I'm pretty sure I wouldn't understand the implications of just (say) deking out the bit that looks for a restart file. (Besides, there are many such bits.) (Oh, and incidentally, just moving the files out of the way doesn't help...still barfs and dies.) I'll email him about this and let him deal with it.

1 Comment

From: Andrew Ring
25 December 2010 23:27:38

Hello,

I may have read too quickly, but if I am catching the problem and you are using the mpich version of mpirun...

One needs to pass mpirun a list of the machines on which to run the job first. Torque keeps such a list in a variable called $PBS_NODEFILE.

I run something along these lines:

#This assumes two processors per nodes and only two nodes.

# $PBS_NODEFILE is set by Torque

# $NPROCS receives the number of processors allocated by Torque

set NPROCS=`wc -l < $PBS_NODEFILE`

The result in mach.tmp should be a list of machine names that Torque has reserved for this job.

cat $PBS_NODEFILE | /usr/bin/awk '{n++;if (n==2) {printf $1":2\n";n=0}}' > mach.tmp

#This setups a prefix that is used in later execution

# mpirun is passed a list of the machines on which it will run

# As well as the total number of processors to run on.

# I think mpirun allocates the runs

setenv prefix "/usr/local/mpich-1.2.7/bin/mpirun -machinefile mach.tmp -np $NPROCS"

#Define $charmm, $mdin, and $mdout.

#Charmm is then run via:

$prefix $charmm < $mdin > $mdout

Hope this helps.

Add a comment:

Name and email required; email is not displayed.

Name (required)
Email (required)
Website
The site is named after Saint ________ the Carpeted: (required)
Comment

Carousel is a LIE!

How I spent my day

1 Comment

Add a comment:

Related Posts

QRP weekend 08 Oct 2018

Open Source Cubesat Workshop 2018 03 Oct 2018

mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018