Rocks Lessons Part 2 -- Torque, Maui and OpenMPI

Torque is a resource manager; it's an open source project with a long history. It keeps track of resources -- typically compute nodes, but it's "flexible enough to handle scheduling a conference room". It knows how many compute nodes you have, how much memory, how many cores, and so on.

Maui is the job scheduler. It looks at the jobs being submitted, notes what resources you've asked for, and makes requests of Torque. It keeps track of what work is being done, needs to be done, or has been completed.

MPI stands for "Message Passing Interface". Like BLAS, it's a standard with different implementations. It's used by a lot of HPC/scientific programs to exchange messages between processes -- often but not necessarily on separate computers -- related to their work.

MPI is worth mentioning in the same breath as Torque and Maui because of mpiexec, which is part of OpenMPI; OpenMPI is a popular open-source implementation of MPI. mpiexec (aka mpirun, aka orterun) lets you launch processes in an OpenMPI environment, even if the process doesn't require MPI. IOW, there's no problem running something like "mpiexec echo 'Hello, world!'".

To focus on OpenMPI and mpiexec: you can run n copies of your program by using the "-np" argument. Thus, "-np 8" will run 8 copies of your program...but it will run on the machine you run mpiexec on:

$ mpiexec -np 8 hostname
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org

This isn't always useful -- why pay big money for all this hardware if you're not going to use it? -- so you can tell it to run on different hosts:

$ mpiexec -np 8 -host compute-0-0,compute-0-1 hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local

And if you're going to do that, you might as well give it a file to read, right?

$ mpiexec -np 8 -hostfile /opt/openmpi/etc/openmpi-default-hostfile hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local

That file is where Rocks sticks the hostfile, but it could be anywhere -- including in your home directory, if you decide that you want it to run on a particular set of machines.

However, if you're doing that, then you're really setting yourself up as the resource manager. Isn't that Torque's job? Didn't we set all this up so that you wouldn't have to keep track of what machine is busy?

So OpenMPI can work with Torque:

  1. How do I run jobs under Torque / PBS Pro?

The short answer is just to use mpirun as normal.

Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Torque / PBS Pro directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will use PBS/Torque-native mechanisms to launch and kill processes ([rsh] and/or ssh are not required).

Whee! So easy! Except that Rocks does not compile OpenMPI with Torque support!

Because the Rocks project is kind of a broad umbrella, with lots of sub-projects underneath, the Torque roll is separate from the OpenMPI roll. Besides, installing one doesn't mean you'll install the other, so it may not make sense to build OpenMPI that way.

The fine folks at Rocks asked the fine folks at OpenMPI and found a way around this: by having every job submitted to Torque/Maui and using MPI source /opt/torque/etc/openmpi-setup.sh. While not efficient, it works; the recommended way, though, is to recompile OpenMPI with Torque installed so that it knows about Torque.

To me, this makes the whole Rocks installation less useful, particularly since this is didn't seem terribly well documented. To be fair, it is there in the Torque roll documentation:

Although OpenMPI has support for the torque tm-interface (tm=taskmanager) it is not compiled into the library shipped with Rocks (the reason for this is that the OpenMPI build process needs to have access to libtm from torque to enable the interface). The best workaround is to recompile OpenMPI on a system with torque installed. Then the mpirun command can talk directly to the batch system to get the nodelist and start the parallel application using the torque daemon already running on the nodes. Job startup times for large parallel applications is significantly shorter using the tm-interface that using ssh to start the application on all nodes.

So maybe I should just shut my mouth.

In any event, I suspect I'll end up recompiling OpenMPI in order to get it to see Torque.