Torque problems with Rocks

I've been puttering away at work getting the cluster going. It's hard, because there are a lot of things I'm having to learn on the go. One of the biggest chunks is Torque and Maui, and how they interact with each other and Rocks as a whole.

For example: today I tried submitting a crapton of jobs all at once. After a while I checked the queue with showq (a Maui command; not to be confused with qstat, which is Torque) and found that a lot of jobs were listed as "Deferred" rather than "Idle". I watched, and the idle ones ran; the deferred ones just stayed in place, even after the list of running jobs was all done.

At first I thought this might be something to do with fairness. There are a lot of knobs to twiddle in Maui, and since I hadn't looked at the configuration after installation I wasn't really sure what was there. But near as I could tell, there wasn't anything happening there; the config file for Maui was empty, and I couldn't seem to find any mention of what the default settings were. I followed the FAQ and ran the various status commands, but couldn't really see anything obvious there.

Then I tried looking in the Torque logs (/opt/torque/server_logs), and found this:

06/08/2011 14:42:09;0008;PBS_Server;Job;8356.example.com;send of job to compute-3-2 failed error = 15008
06/08/2011 14:42:09;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Access from host not allowed, or unknown host (15008) in send_job, child failed in previous commit request for job 8356.example.com

And on compute-3-2 (/opt/torque/mom_logs)

06/08/2011 14:42:01;0080;   pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=QueueJob, from PBS_Server@example.local

That's weird. I ran rocks sync config out of superstition, but nothing changed. I found a suggestion that it might be a bug in Torque, and to run momctl -d to see if the head node was in the trusted client list. It was not. I tried running that command on all the nodes (sudo rocks run host compute command="momctl -d3 |grep Trusted | grep 10.1.1.1); turned out that only 10 were. What the hell?

I'm still not sure exactly where this gets set, but I did notice that /opt/torque/mom_priv/config listed the head node as the server, and was identical on all machines. On a hunch, I tried restarting the pbs service on all the nodes; suddenly they all came up. I submitted a bunch more jobs, and they all ran through -- none were deferred. And running momctl -d showed that, yes, the head node was now in the trusted client list.

Thoughts: