Troubleshooting deferred jobs, episode 80

The other day at $WORK, a user asked me why the jobs she was submitting to the cluser were being deferred. They only needed one core each, and showq showed lots free, so WTF?

By the time I checked on the state of these deferred jobs, the jobs were already running -- and yeah, there were lots of cores free. The checkjob command showed something interesting, though:

$ checkjob 34141 | grep Messages
Messages:  cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'

I thought this was from the node that the job was on now:

$ qstat -f 34141 | grep exec_host
exec_host = compute-3-5/19

but that was a red herring. (I could've also got the host from "checkjob | grep -2 "Allocated Nodes".) Instead, grepping through maui.log showed that it had been compute-1-11 that was the real problem:

/opt/maui/log $ sudo grep 34141 maui.log.3 maui.log.2 maui.log.1 maui.log |grep -E 'WARN|ERROR'
maui.log.3:03/05 16:21:48 ERROR:    job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:48 ERROR:    job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:50 ERROR:    job '34141' cannot be started: (rc: 15041  errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'  hostlist: 'compute-1-11')
maui.log.3:03/05 16:21:50 WARNING:  cannot start job '34141' through resource manager
maui.log.3:03/05 16:21:50 ERROR:    cannot start job '34141' in partition DEFAULT
maui.log.3:03/05 17:21:56 ERROR:    job '34141' cannot be started: (rc: 15041  errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'  hostlist: 'compute-1-11')

There were lots of messages like this; I think the scheduler kept only gave up on that node much later (hours).

checknode showed nothing wrong; in fact, it was running a job currently and had 4 free cores:

$ checknode compute-1-11
checking node compute-1-11

State:      Busy  (in current state for 6:23:11:32)
Configured Resources: PROCS: 12  MEM: 47G  SWAP: 46G  DISK: 1M
Utilized   Resources: PROCS: 8
Dedicated  Resources: PROCS: 8
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:      13.610
Network:    [DEFAULT]
Features:   [NONE]
Attributes: [Batch]
Classes:    [default 0:12]

Total Time:   INFINITY  Up:   INFINITY (98.74%)  Active:   INFINITY (18.08%)

Reservations:
  Job '33849'(8)  -6:23:12:03 -> 93:00:47:56 (99:23:59:59)
JobList:  33849

maui.log showed an alert:

maui.log.10:03/03 22:32:26 ALERT: RM state corruption. job '34001' has idle node 'compute-1-11' allocated (node forced to active state)

but that was another red herring; this is common and benign.

dmesg on compute-1-11 showed the problem:

compute-1-11 $ dmesg | tail
sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Hardware Error
<<vendor>> ASC=0x80 ASCQ=0x87ASC=0x80 <<vendor>> ASCQ=0x87
Info fld=0x10489
end_request: I/O error, dev sda, sector 66697
Aborting journal on device sda1.
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
(Linux)|Wed Mar 06 09:37:20|[compute-1-11:~]$ mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda5 on /state/partition1 type ext3 (rw)
/dev/sda2 on /var type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
sophie:/export/scratch on /share/networkscratch type nfs (rw,addr=10.1.1.1)

mount: warning /etc/mtab is not writable (e.g. read-only filesystem).
   It's possible that information reported by mount(8) is not
   up to date. For actual information about system mount points
   check the /proc/mounts file.

but this was also logged on the head node in /var/log/messages:

$ sudo grep compute-1-11.local /var/log/* |grep -vE 'automount|snmpd|qmgr|smtp|pam_unix|Accepted publickey' > ~/rt_1526/compute-1-11.syslog
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in remtree, unlink failed on /opt/torque/mom_priv/jobs/34038.sophie.TK
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed

and in /var/log/kern:

$ sudo tail /var/log/kern
Mar  5 10:05:00 compute-1-11.local kernel: Aborting journal on device sda1.
Mar  5 10:05:01 compute-1-11.local kernel: ext3_abort called.
Mar  5 10:05:01 compute-1-11.local kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Mar  5 10:05:01 compute-1-11.local kernel: Remounting filesystem read-only
Mar  7 05:18:06 compute-1-11.local kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range

There are a few things I've learned from this:

I've started to put some of these commands in a sub -- that's a really awesome framework from 37 signals to collect commonly-used commands together. In this case, I've named the sub "sophie", after the cluster I work on (named in turn after the daughter of the PI). You can find it on github or my own server (github is great, but what happens when it goes away? ...but that's a rant for another day.) Right now there are only a few things in there, and they're somewhat specific to my environment, and doubtless they could be improved -- but it's helping a lot so far.