Torque problems with Rocks
08 Jun 2011I've been puttering away at work getting the cluster going. It's hard, because there are a lot of things I'm having to learn on the go. One of the biggest chunks is Torque and Maui, and how they interact with each other and Rocks as a whole.
For example: today I tried submitting a crapton of jobs all at once.
After a while I checked the queue with showq
(a Maui command; not to
be confused with qstat
, which is Torque) and found that a lot of jobs
were listed as "Deferred" rather than "Idle". I watched, and the idle
ones ran; the deferred ones just stayed in place, even after the list
of running jobs was all done.
At first I thought this might be something to do with fairness. There are a lot of knobs to twiddle in Maui, and since I hadn't looked at the configuration after installation I wasn't really sure what was there. But near as I could tell, there wasn't anything happening there; the config file for Maui was empty, and I couldn't seem to find any mention of what the default settings were. I followed the FAQ and ran the various status commands, but couldn't really see anything obvious there.
Then I tried looking in the Torque logs (/opt/torque/server_logs), and found this:
06/08/2011 14:42:09;0008;PBS_Server;Job;8356.example.com;send of job to compute-3-2 failed error = 15008
06/08/2011 14:42:09;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Access from host not allowed, or unknown host (15008) in send_job, child failed in previous commit request for job 8356.example.com
And on compute-3-2 (/opt/torque/mom_logs)
06/08/2011 14:42:01;0080; pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=QueueJob, from PBS_Server@example.local
That's weird. I ran rocks sync config
out of superstition, but
nothing changed. I found a suggestion that it might be a bug in
Torque, and to run momctl -d
to see if the head node was in the
trusted client list. It was not. I tried running that command on
all the nodes (sudo rocks run host compute command="momctl -d3
|grep Trusted | grep 10.1.1.1
); turned out that only 10 were. What
the hell?
I'm still not sure exactly where this gets set, but I did notice that
/opt/torque/mom_priv/config
listed the head node as the server, and
was identical on all machines. On a hunch, I tried restarting the pbs
service on all the nodes; suddenly they all came up. I submitted a
bunch more jobs, and they all ran through -- none were deferred. And
running momctl -d
showed that, yes, the head node was now in the
trusted client list.
Thoughts:
None of this was shown by Ganglia (which just monitors load) or showq (which is a Maui command; the problem was with Torque).
Doubtless there were commands I should've been running in Torque to show these things.
While the head node is running a syslog server and collects stuff from the client nodes, Torque logs are not among them; I presume Torque is not using syslog. (Must check that out.)
I still don't know how the trusted client list is set. If it's in a text file, that's something that I think Rocks should manage.
I'm not sure if tracking down the problem this way is exactly the right way to go. I think it's important to understand this, but I suspect the Rocks approach would be "just reboot or reinstall". There's value to that, but I intensely dislike not knowing why things are happening and sometimes that gets in my way.
Add a comment:
Name and email required; email is not displayed.
Related Posts
QRP weekend 08 Oct 2018
Open Source Cubesat Workshop 2018 03 Oct 2018
mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018