Don't do that, then
18 Feb 2013Q from a user today that took two hours (not counting this entry) to track down: why are my jobs idle when showq shows there are a lot of free processors? (Background: we have a Rocks cluster, 39 nodes, 492 cores. Torque + Maui, pretty vanilla config.)
First off, showq did show a lot of free cores:
$ showq
[lots of jobs]
192 Active Jobs 426 of 492 Processors Active (86.59%)
38 of 38 Nodes Active (100.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
32542 jdoe Idle 1 1:00:00:00 Fri Feb 15 13:55:55
32543 jdoe Idle 1 1:00:00:00 Fri Feb 15 13:55:55
32544 jdoe Idle 1 1:00:00:00 Fri Feb 15 13:55:56
Okay, so why? Let's take one of those jobs:
$ checkjob 32542
checking job 32542
State: Idle
Creds: user:jdoe group:example class:default qos:DEFAULT
WallTime: 00:00:00 of 1:00:00:00
SubmitTime: Fri Feb 15 13:55:55
(Time Queued Total: 2:22:55:26 Eligible: 2:22:55:26)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 196 StartCount: 0
PartitionMask: [ALL]
Flags: HOSTLIST RESTARTABLE
HostList:
[compute-1-3:1]
Reservation '32542' (21:17:14 -> 1:21:17:14 Duration: 1:00:00:00)
PE: 1.00 StartPriority: 4255
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs: 216 feasible procs: 0
Rejection Reasons: [State : 1][HostList : 38]
Note the bit that says:
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
If we run "checkjob -v", we see some additional info (all the rest is the same):
Detailed Node Availability Information:
compute-2-1 rejected : HostList
compute-1-1 rejected : HostList
compute-3-2 rejected : HostList
compute-1-3 rejected : State
compute-1-4 rejected : HostList
compute-1-5 rejected : HostList
[and on it goes...]
This means that compute-1-3, one of the nodes we have, has been assigned to the job. It's busy, so it'll get to the job Real Soon Now. Problem solved!
Well, no. Because if you run something like this:
showq -u jdoe |awk '/Idle/ {print "checkjob -v " $1}' | sh
then a) you're probably in a state of sin, and b) you'll see that there are a lot of jobs assigned to compute-1-3. WTF?
Well, this looks pretty close to what I'm seeing. And as it turns out, the user in question submitted a lot of jobs (hundreds) all at the same time. Ganglia lost track of all the nodes for a while, so I assume that Torque did as well. (Haven't checked into that yet...trying to get this down first; documenting stuff for Rocks is always a problem for me.) The thread reply suggests qalter, but that doesn't seem to work.
While I'm at it, here's a list of stuff that doesn't work:
- "qalter -l neednodes=
" ; maui restart - "runjob -c
"; maui restart - "runjob -c
"; "releasehold " ; maui restart
(Oh, and btw turns out runjob is supposed to be replaced by mjobctl, but mjobctl doesn't appear to work. True story.)
So at this point I'm stuck suggesting two things to the user:
- Don't submit umpty jobs at once
- Either wait for compute-1-3 to work through all your jobs, or cancel and resubmit them.
God I hate HPC sometimes.
Helpful links that explained some of this:
Add a comment:
Name and email required; email is not displayed.
Related Posts
QRP weekend 08 Oct 2018
Open Source Cubesat Workshop 2018 03 Oct 2018
mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018