Don't do that, then

Q from a user today that took two hours (not counting this entry) to track down: why are my jobs idle when showq shows there are a lot of free processors? (Background: we have a Rocks cluster, 39 nodes, 492 cores. Torque + Maui, pretty vanilla config.)

First off, showq did show a lot of free cores:

$ showq
[lots of jobs]

192 Active Jobs     426 of  492 Processors Active (86.59%)
            38 of   38 Nodes Active      (100.00%)
IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT QUEUETIME

32542                  jdoe       Idle     1  1:00:00:00  Fri Feb 15 13:55:55
32543                  jdoe       Idle     1  1:00:00:00  Fri Feb 15 13:55:55
32544                  jdoe       Idle     1  1:00:00:00  Fri Feb 15 13:55:56

Okay, so why? Let's take one of those jobs:

$ checkjob 32542
checking job 32542

State: Idle
Creds:  user:jdoe  group:example  class:default  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00:00
SubmitTime: Fri Feb 15 13:55:55
  (Time Queued  Total: 2:22:55:26  Eligible: 2:22:55:26)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 196  StartCount: 0
PartitionMask: [ALL]
Flags:       HOSTLIST RESTARTABLE
HostList:
  [compute-1-3:1]
Reservation '32542' (21:17:14 -> 1:21:17:14  Duration: 1:00:00:00)
PE:  1.00  StartPriority:  4255
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs: 216  feasible procs:   0

Rejection Reasons: [State        :    1][HostList     :   38]

Note the bit that says:

job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)

If we run "checkjob -v", we see some additional info (all the rest is the same):

Detailed Node Availability Information:

compute-2-1              rejected : HostList
compute-1-1              rejected : HostList
compute-3-2              rejected : HostList
compute-1-3              rejected : State
compute-1-4              rejected : HostList
compute-1-5              rejected : HostList
[and on it goes...]

This means that compute-1-3, one of the nodes we have, has been assigned to the job. It's busy, so it'll get to the job Real Soon Now. Problem solved!

Well, no. Because if you run something like this:

showq -u jdoe |awk '/Idle/ {print "checkjob -v " $1}' | sh

then a) you're probably in a state of sin, and b) you'll see that there are a lot of jobs assigned to compute-1-3. WTF?

Well, this looks pretty close to what I'm seeing. And as it turns out, the user in question submitted a lot of jobs (hundreds) all at the same time. Ganglia lost track of all the nodes for a while, so I assume that Torque did as well. (Haven't checked into that yet...trying to get this down first; documenting stuff for Rocks is always a problem for me.) The thread reply suggests qalter, but that doesn't seem to work.

While I'm at it, here's a list of stuff that doesn't work:

(Oh, and btw turns out runjob is supposed to be replaced by mjobctl, but mjobctl doesn't appear to work. True story.)

So at this point I'm stuck suggesting two things to the user:

God I hate HPC sometimes.

Helpful links that explained some of this: