After a lot of faffing about, I've accomplished the following on
the backup server at $WORK:
Broken out /var/lib/mysql to a separate, mirrored, Linux software
raid-1 275 GB partition; it's using about 36 GB of that at the
moment, which is 15% -- the lowest it's been in a long, LONG-ass
time.
Migrated the Bacula catalog db to Innodb.
Shrunk the raid-0 spool partition to about 1.6 TB, down from 2
TB; did this to free up the two disks for the mirrored partition
Ensured that MySQL will use /dev/shm as a temporary area
Sped up the restoration of files (which was mostly because of
earlier "analyze" commands on the File table while it was still
MyISAM)
innodbfilepertable is on; innodbbufferpoolsize=10G;
```
defaultstorageengine=InnoDB
I encountered the following problems:
* The stupid raid card in the backup server only supports two RAID drives --
thus, the mirrored drive for /var/lib/mysql is Linux software
raid. I'd have preferred to keep things consistent, but it was
not to be.
```
The many "analyze" and "repair" steps took HOURS...only to turn
out to be deadlocked because it was running out of tmp space.
I had to copy the mysql files to the raid-0 drive to have enough
space to do the conversion.
Knock-on effects included lack of sleep and backups not being run
last night
Basically, this took a lot of tries to get right, and about all
```
of my time for the last 24 hours.
I learned:
* The repair of the File table while MyISAM, with tmp files in
/dev/shm, took about 15 minutes. That compares with leaving it
overnight and still not having it done.
```
You have to watch the mysql log file for errors about disk space,
and/or watch df -h to see /tmp or whatever fill up.
You can interrupt a repair and go back to it afterward if you
have to. At least, I was able to...I wouldn't do it on a regular
basis, but it gives me cautious optimism that it's not an
automatic ticket to backups.
Importing the File.sql file (nominally 18 GB but du shows 5
```
GB...sparse?), which converted it to InnoDB, took 2.5 hours.
I still have to do these things:
* Update documentation.
* Update Bacula checks to include /var/lib/mysql.
* Perhaps up pool_size to 20 GB.
* Set up a slave server again.
* A better way of doing this might've been to set up LVM on md0, then use snapshots for database backup.
* Test with a reboot! Right now I need to get some sleep.
A couple days ago a website stopped working, refusing to show
anything below its top menu bar. I knew it was written in PHP, but I
hadn't realized it used Smarty, a template engine. I didn't
realize it before the outage, but one of Smarty's features is that it
"compiles" (generates, really) template files into PHP code on the
fly as they're accessed and/or updated. The idea is that if you need
to change the template files, you just upload them; the next
time that page is accessed, Smarty notices that it needs to recompile
the PHP file, does so, and writes out the new file to a directory
called "template_c".
This means that it needs to write to that directory (which Smarty docs
recommend stay out of the DocumentRoot). But I hadn't turned that on,
and in general this sort of thing makes me nervous. (Though I do
allow it for other websites, so I'm not being very consistent here.)
The development machine has the files rooted at ~user/public_html; to
deploy, we rsync to the server's /var/www/$PROJECT directory. The
template_c files have already been compiled in the process of testing,
and those compiled files include() other Smarty libraries -- and it
specifies the full path to those libraries. Can you see where this is
going?
As far as I can tell, what happened was:
The PHP was tested on the dev server; Smarty compiled the templates,
and added include() directives or the library files using the full
path to the user's home directory.
The PHP was copied into place on the web server and tested.
Smarty looked at the templates, found the compiled versions, and ran
them. In the process, it tried to load other Smarty libraries.
It worked, because autofs mounted this user's home directory when
requested.
Time passed.
On Wednesday, someone tried using the site. Autofs, for some
reason, couldn't mount the developer's home directory, so Smarty
couldn't include those files, and we saw nothing after the page's
menu bar.
I got around this by using sed on the compiled template files to set
the correct path to the library files, which at least got the site up
again.
I'm still not sure what the hell went wrong with autofs. On this
particular server I used to have a static map for home directories,
and it doesn't work with the developer's home. I thought I had
replaced that with an LDAP-based map a long time ago...but there it
was, and I can't find anything in the Cfengine git repo that shows a
change, or that I'd deployed the LDAP map in the first place. I
pushed that out, and now this all works the way it should...
...except that it shouldn't work like this. I'm reluctant to let PHP
write to the filesystem, but that's what Smarty expects. I think it
supports relative paths -- gotta dig up my notes -- which'd be fine.
(I saw one suggestion on the Smarty support forums to include "." in
your PHP_INCLUDE path using .htaccess, which made me run around the
room hollering and cursing) As a workaround, I'm going to move the
development files to /var/www/$PROJECT on the dev server, which will
match production; I'm unhappy with this because it breaks the real
advantages Smarty brings, and makes the deployment process a bit
harder...but I'm still a nervous nelly.
First an Ubuntu upgrade took much, much longer than anticipated when
a) upgrading MySQL failed, for some reason, making for an amusing
series of attempts at actually completing do_release_upgrade, and b)
.Xauthority in my home directory was owned by root, which is a very,
very fun way of borking your X session.
Then a panicked call about a website that no longer worked. Turned
out to be PHP trying to include files from a home directory that was
not available on the server. Why did this ever work in the first
place? I'm still not sure. Good thing I've got beer at home.
We've had a three-day stretch of clear skies; that's not the first
since the last time I went out, but damn near, and definitely the
first that wasn't in the middle of the week (middle-aged sysadmin
needs his goram sleep) or covered by sickness.
We spent Martinmas at my in-laws eating new wine and chestnuts,
and by the time we got back it was late and Jupiter was already up. I
set up the scope on the steps near our townhouse and showed the kids.
Jr/Fresco and I'd been talking about what eyepiece to use: the 40mm or
the 12mm? He grabbed the 40mm since it was bigger, and was really
surprised to see how much smaller Jupiter looked (30x) compared to the
12mm (100x). Both saw the NEB and SEB, and noticed Europa, Callisto
and Io.
It was clear skies then...but in the hour it took me to read them
stories, put them to bed and get out the door to the local park,
clouds moved in and all but obscured everything...except Jupiter, that
is. (Cue macho joke about KING OF THE GODS, that's who.) So I made
lemonade and spent my time looking at Jupiter.
It was wonderful. The seeing was quite steady, and that made up for
things not being quite as bright as they might have been. I was able
to get up to 320x, which is a feat for me -- not to mention being able
to simply keep it in view when it sails across the screen like that.
The North Polar Region, the NEB and the SEB were easily visible,
and I could just make out the Great Red Spot rotating out of view.
From time to time I could distinguish the north and south components
of the SEB, the north and south Temperate Bands, and what looked like
a thin dark band right across the equator (which I just barely see
hints of in these photos; not sure if I was imagining that or
not).
ANother thing I saw was the reapparance of Ganymede from
occulatation (that is, from behind Jupiter's disk). I knew when to
expect it; when the time came, I saw it and thought "Oh yeah,
neat...not as cool as a transit, though." I ignored it for a few more
minutes, then realized something: I was seeing a disk, not just a
point o' light...and that was only at 200x. I had my copy of the
RASC Observer's Handbook (okay, maybe it is handy to have
around), so I looked up Ganymede and saw that it was half again as big
as the moon. Wow. I had a closer look at the other moons, and while
I couldn't really see any disks, I did seem to see a sort of brownish
colour to Callisto (which may actually be accurate).
I came in after only an hour; the clouds were erratic, and I wanted to
get inside. Not the widest-ranging observing session, but lots of fun.
I write these blog entries in Markdown mode, but markdown-mode in
Emacs doesn't stick links at the end of the text the way God intended
(and the way footnote-mode does). This is close, but not yet working:
(defun x-hugh-markdown-footnote (description)
"A Small but Useful(tm) function to add a footnote in Markdown mode.
FIXME: Not yet working, but close."
(interactive "sDescription: ")
(let ((current-line (line-number-at-pos))
(last-line (line-number-at-pos (point-max)))
(link (read-string "Link: "))
(link-number (x-hugh-get-next-link-number)))
(save-excursion
(if (> (- last-line current-line) 1)
()
(insert-string "\n"))
(goto-char (point-max))
(insert-string (format "\n[%d]: %s" link-number link)))
(insert-string (format "[%s][%d]" description link-number))))
(defun x-hugh-get-next-link ()
"Figure out the number for the next link."
(interactive)
(save-excursion
(goto-char (point-max))
(beginning-of-line)
(if (looking-at "\\[\\([0-9]\\)]:")
(eval (+ 1 (string-to-number (match-string 1))))
(eval 0))))
Right now it craps out with a "Wrong type argument:
integer-or-marker-p, nil" when it runs x-hugh-get-next-link.
Doubtless I'm doing a bad thing in my half-baked attempt to return a
number. But still, close!
(UPDATE: I figured it out! To return a number, wrap it with
eval. Fixed above. Working!)
(Believe it or not, I started out to write about Github and
bioinformatics. Such is the life of the easily distracted.)
ObMusic: "The Balcony" by The Rumour Said Fire. Reminds me of 60s
folk. I like it.
Another busy day at $WORK, busy enough that I missed the Judea Pearl
lecture the CS dep't was hosting. On the way out the door I grabbed
the copy of "Communications of the ACM" with his face on the cover,
thinking I'd catch up. The two pieces on him were quite small,
though, so it was on to other fare.
"The Myth of the Elevator Pitch" caught my eye, but as I read it
I became more and more convinced that I was reading literary
criticism combined with mystic bullshit. Example:
At first glance, it appears that the elevator pitch is a component
of the envisioning process. The purpose of that practice is to tell
a compelling story (a "vision") of how the world would be if the
innovation idea were incorporated into it. [...] But the notion
that a pitch is a precis of an envisioning story is not quite right.
A closer examination reveals that a pitch is actually a combination
of a precis of the envisioning story and an offer. The offer to
make the vision happen is the heart of the third practice.
Ah, the heart of the third practice. (I feel like that should be
capitalized: "the Heart of the Third Practice." Better!)
The standard idea of a pitch is that it is a communication -- a
presentation transmitting information from a speaker to a listener.
In contrast, the core idea in the Eight Practices of innovation is
that each practice is a conversation that generates a commitment to
produce an outcome essential to an innovation. [....] The problem
with the communication idea is that communications do not elicit
commitments. Conversations elicit commitments. Commitments produce
action.
Mistaking it for communication -- yes, an easy mistake to make.
We can now define a pitch as a short conversation that seeks a
commitment to listen to an offer conversation. It is transitional
between the envisioning and offering conversations.
And back to the literary criticism. I feel like I've just watched a
new literary form being born. I'm only surprised that it seems the
CompSci folks scooped the LitCrit dep't.
(Of course, while I'm busy spellchecking all this, I am most definitely
not being published in "Communications of the ACM". So I suck.)
Tonight I picked up a grain order I placed with my local homebrew
club. Pickup was at Parallel 49, where the president of the
club is the brewer. My oldest son came along, and said brewer was
kind enough to give him (and his dad!) a quick tour of the place. My
son was impressed, and so was I; I'd never seen a 10k litre
fermentation tank before. It occurred to me later how overwhelming
that would be for me: to be faced with this enormous volume waiting to
be filled. I'm in awe of someone who can look at that and say, "Yeah,
I know exactly what I want to put in there."
I also came away with a free bottle of their Salty Scot Scottish Ale;
haven't tried that yet, but I did like the growler of the India Pale
Lager (which I was happy to pay for). It's a nice twist on the usual
IPA fare. I do regret having to leave behind the milk stout,
though...next time.
I've been asked to revisit Hadoop at $WORK. About a year ago I got a
small cluster (3 nodes) working and was plugging away at
Myrna...but then our need for Myrna disappeared, and Hadoop was
left fallow. The need this time around seems more permanent.
So far I'm trying to get a simple streaming job working. The
initial script is pretty simple:
Running the mapper.sh/reducer.sh files works fine; the problem is that
under Hadoop, it fails:
2012-11-06 12:07:30,106 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
2012-11-06 12:07:30,110 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2012-11-06 12:07:30,111 WARN org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
I'm unsure right now if that's [this error][3] or something else I've
done wrong. Oh well, it'll be fun to turn on debugging and see what's
going on under the hood...
...unless, of course, unless I'm wasting my time. A quick search
turned up a number of Hadoop-based bioinformatics tools
([Biodoop][4], [Seqpiq][5] and [Hadoop-Bam][6]), and I'm sure there
are a crapton more.
Other chores:
* Duplicating pythonbrew/modules work on another server since our
cluster is busy
* Migrating our mail server to a VM
* Setting up printing accounting with Pykota (latest challenge:
dealing wth usernames that aren't in our LDAP tree)
* Accumulated paperwork
* Renewing lapsed support on a Very Important Server
Oh well, at least I'm registered for [LISA][7]. Woohoo!
A while back I upgraded the MySQL for Bacula at $WORK. I tested it
afterward to make sure it worked well, but evidently not thoroughly
enough: I discovered today that the query needed to restore a file was
taking forever. I lost patience at 30 minutes and killed the query;
fortunately, I was able to find another way to get the file from
Bacula. This is a big slowdown from previous experience. Time for
some benchmarking and exploratory tweaking...
Incidentally, the faster way to get the file was to select "Enter a
list of files to restore", rather than let it build the directory tree
for the whole thing. The fileset in question is not that big, so I
think I must be doing something pretty damned wrong to get MySQL so
slow.
Last week my 4.5 year old came down with his usual asthma-inflamed
cold; this week it's my wife's standard sinus infection and
tonsillitis for my 6 year old. It's been busy: two weekends spent at
the doctor's in a row is more than we usually aim for.
It's not all bad, though. In a way it's been nice to sit around and
just be with them at home, rather than think constantly about how to
entertain them out of the house. We've read books (of course), played
games, done crafts and pointed TV at our heads. I've even started
installing games on my laptop for them to play. (Complicated
succession rules mean I got my wife's old laptop recently.) 4.5 is
particularly enjoying TuxRacer (which I've just found out was
renamed to Extreme TuxRacer).
Next weekend it's Martinmas, plus a long weekend. I hope to have
a 25kg bag of pale malt by then, and it'd be nice to get some beer
in...something a bit more relaxing than illness.
At $WORK I typically have three things open on my desktop:
Firefox
Emacs
4 x 10^8 xterms
I use Awesome as a window manager, and despite RSI (I keep tucking my
thumb under my left hand to reach the Windows key) it has worked out
okay for me. There's not a lot I need the mouse for, especially if
I'm disciplined enough to keep the mouse far away from the keyboard.
However, there's one thing I need: cutting and pasting. When
troubleshooting I take notes in Org/Emacs, often on a remote
SSH/tmux session, and I need to cut-and-paste sixty different URLs
from Firefox into my notes. (Still not quite hardcore enough to run
w3m-mode regularly, which is a bit like saying "I only get drunk after
10am.") And for that, I have to reach for the mouse, which breaks
flow, which makes baby Linus cry.
This xdotool script (Hey Jordan, I thought your site was supposed
to be up by now!) comes very, very close to making that go away.
#!/bin/sh# Activate firefox and copy the URL bar contents; also available with "xclip -o". OMG.# Get the current window and mouse location so we can come backpwid=`xdotool getactivewindow`eval$(xdotool getactivewindow getwindowgeometry --shell|grep -e X -e Y)# Find and activate the FF windowwid=`xdotool search --name "Mozilla Firefox"`
xdotool windowactivate $wid
sleep 0.2
# Go to location bar, select all, and cut
xdotool key "ctrl+l"
xdotool key "ctrl+a"
xdotool key "ctrl+c"
sleep 0.2
# Go back, move the mouse a bit, and middle-click to paste
xdotool windowactivate $pwid
xdotool mousemove $(( X +50))$(( Y +50))
sleep 0.2
xdotool click 2
It is a trifle flaky: it works in some environments (xterm+emacs
shell) but not others (xterm w/SSH/tmux/emacs). Of course, that's
just after a few minutes playing around, and I may need to look for
help from wmcrtl. But oh, that's nice. Thanks, Jordan.
Like most Unix geeks, I've got a lovingly-crafted collection of
dotfiles that I've come to depend on. But as a sysadmin, I play
around on a larger collection of machines than most geeks probably get
to (a fact for which I'm profoundly grateful) and, as a result, a
larger set of home directories. I've come to love dfm, a
git-and-perl bit of duct tape that makes managing and centralizing
these dotfiles much, much easier...but it still depends on manually
merging files back into the master repo, rebasing, etc etc which let's
face it you don't always have time for.
It occurs to me that this is a lot like speciation -- in
particular, allopatric or parapatric speciation, driven by
(respectively) complete or partial geographical isolation. At some
point (and I've hit this a few times now), everything diverges so much
that there's just no way to get them back together without
supernatural levels of effort and swearing. They drift apart, and
some eventually die off. I wonder if you could use the ability to
create a patch as a proxy for genetic drift. There must be a study
somewhere on this...
Just had a dream where I'd been called into Sun, just before Oracle's
takeover, to figure out why they were spending so much money on
eyeglasses for employees. "We think it's part of their benefits, but
our accounting department doesn't have a separate line item for it,"
someone explained. My eyebrows lifted in disbelief. "Well, then,
it's damned lucky for you I've got Cfengine."
So at work there's this program called CHARMM that's my
responsibility to build. It's not the worst thing I've ever had
to compile, but it's close:
the build script is an interactive tcsh script; you have to tell it
where to find libraries, and can't pass these on as arguments;
the whole thing is insanely sensitive to what compiler and library
you use, but often that'll only turn up when you actually go to
use the damn thing;
CHARMM will often just crash if it it doesn't like you or your
input, and the reaction on the forums is often, "Oh yeah, you should
just patch that";
there are about 40 different build options, of which an amusing
fraction are documented (and an even more amusing fraction
documented correctly), and of course the combinations just
explode; the wrong combinations will make CHARMM crash, and the
reaction on the forums is often, "Oh yeah, they've been meaning to
fix the documentation on that."
To get around this I built a script that collects arguments, applies a
patch, unpacks the tarball, echoes the right args to the build script
(which means, of course, that I am damned for all eternity) and then
builds the damned thing. Hurrah! Everything's fine!
Except now I have to rebuild. And invoking the script is becoming
complicated:
I want to name the destination directory after some combination of
compiler, options, and patches;
I'm writing qsub scripts that set all this up and then invoke
my script, which invokes the build script;
And there are, at last count, two versions of CHARMM, two patches
that are required for one of those versions, and two sets of
options. That's 8 scripts.
So I'm looking at all this and thinking, "Self, you're telling me I
need a templating system for my templating system. Have I got that
right?" And I think, "Self, you've got it." So I punch myself
in the face. And that's why it hurts to build CHARMM.
Yesterday I spent the morning in an unaccustomed role: that of the
older, wiser homebrewer. My friends Ross and Wes were doing their
first batch of homebrew, along with Kevin and Will, and wanted me
along to help out. I was happy to oblige. My kids came along to play
with their kids, and my wife to make homemade candy corn with the
other wives (seriously, it's like a farmhouse social), and Wes brought
beer for us all to drink, picked up from Central City Brewing the day
before.
It was a damn good time and a shockingly warm and half-sunny day. The
guys went all-grain for their first batch, and were using my
homegrown hops for bittering and flavour. Ross rocked the
clipboard while Wes and Kevin did the lifting:
A little bit of spilled wort on my shoes later ("Who the hell wears
good shoes to a homebrew session?"), everything was in the fermenter
and we were cleaned up. Gravity was a little lower than expected --
1.038 -- but hey, we can clean up efficiency later, right? And word
on the street is that the airlock was bubbling away later that night.
Awesome time. I might have to take this up professionally.
Last Friday was not a good day. First, I try installing a couple
hard drives I bought for a Dell MD3200 disk array, and it rejected
them; turns out that it will not work with drives that are not
pre-approved. It's right there in the documentation. I was aware of
last year's kerfuffle with Dell announcing that their servers would
no longer work w/unapproved drives, and then their backdown on
that...but the disk arrays are different. So now I have a couple
extra 3 TB SATA drives to find a place for, and a couple drives to
buy from Dell.
While I'm inin the server room staring at the blinking error light on
the disk array and wondering if I'd just brought down the server it
was attached to, I notice another blinking error light. This one was
on the server that hosts Xen VMs that run LDAP, monitoring and a web
server. It had a failing drive. Good thing it's in RAID 6, right?
Sure, but it failed nearly a month ago -- I had not set up email
alerts on the server's ILOM, so I never got notified about this.
Fuck.
I send off an email to order a drive, then figure out how to get
alerted about this. Email alerts are configured, but belt and
suspenders: I get the CLI tool for the RAID card, find a Nagios
plugin that runs it, and add the check to Nagios, running on the
server's dom0. Hurrah, it alerts me! I ack the alert, and now it's
time to head home.
On my way home I start getting pages about the VMs on this machine
-- nothing down, but lots of timeouts. The machine recovers, then
stumbles and stays down. (These alerts were coming from a second
instance of Nagios I have set up, which is mostly there to monitor the
main instance that runs on this server.) My commute is 90 minutes,
and I have no net access along the way. When I finally get home, I
SSH to work and find that the machine is hung; as far as I can tell,
the CLI tool was just not exiting, and after enough accumulated the
RAID card just stopped responding entirely. I reboot the machine, and
ten minutes later we're back up.
Ten minutes after that, I realize I'm still in trouble: I'm getting
pages about a few other machines that are not responding. Remember
how one of the VMs on the original server ran LDAP? It's one of three
LDAP servers I have, because I fucking hate it when LDAP goes down.
The clients are configured to fail over if their preferred server (the
VM) isn't responding. I check on one of the machines, and nscd had
about a thousand open sockets...which makes me think that the sequence
was something like this:
During the hang, the VM was responding a little bit -- probably
just enough to complete a 3-way handshake.
nscd would keep that connection open, because it had decided that
the server was there and would be answering. But it wouldn't.
Another query would come along, and nscd would open another
connection. Rinse and repeat.
Eventually, nscd bumped up against the open FD limit (1024), and was
unable to open up new connections to any LDAP server.
I'm thinking about putting in a check for the number of open FDs nscd
has, but I'm starting to second-guess myself; it feels a bit circular
somehow. Not the right word, but I'm tired and can't think of a better.
Yesterday I did a long-anticipated firmware upgrade on a disk array at
$work. It's attached to the head node of a small cluster we have, and
holds the usual assortment of home directories and data. The process was
kind of involved:
shut down the cluster to prevent disk I/O ("not mandatory, but
strongly recommended" -- thx, I'll just go with "mandatory");
remove the current management software from the head node, reboot
and then reinstall;
X was needed for the installation ("not mandatory, but--" Okay,
right, got it, thx): twice via SSH, once by running startx locally;
I couldn't upgrade directly to the new firmware itself, but had to
install bridge firmware, wait 30 minutes for things to settle out
(!), then install the new firmware
oh, and "due to limitations of the Linux environment", I couldn't
install the firmware from the head node itself that just had the
management software upgraded -- instead, I had to install that
software on another machine and install it from there.
Which is why this all took about four hours to do. But that's not
all:
Before all that, I read the many, many manuals; did a dress
rehearsal to shake out problems; and made sure I had a checklist
(thank you, Tom Limoncelli and Orgmode) with the exact
commands to run
During the upgrade, I took notes on things I'd forgotten and
problems I'd encountered.
After the upgrade, I did a postmortem: updated my documentation and
filed bugs, notified the users that things were back up, and watched
for problems.
Which is why a 4 hour upgrade took me 9.5 hours. I think there might
be a handy rule of thumb for big work like this, though I can't decide
if it's "it always takes twice as long" or "it always takes five hours
longer than you think." Heh.
One other top tip: stop NFS exports while you're working on a server
(but see the next paragraph!). One user started a session on another
machine, which automounted her home directory from the head node.
This was close to the end of my work, and while I could have used
another reboot, I elected not to because I didn't want to mess up her
session. Yes, the reboot was important, but I'd neglected to think
about this situation, and I didn't think she should have to pay for my
mistake.
And if you're going to turn off NFS exports, make damn sure you have
your monitoring system checking exports in the first place; that way,
you won't forget to turn it back on afterward. (/me scurries to add
that test to Nagios right now...)
Tomorrow I've upgrading firmware on a disk array that's attached to a
small cluster I manage; yesterday, in preparation for that, I ran a full
backup of the disks in question. I noticed that the home directories
were taking longer than I thought, so I checked out how full they
were. The answer was 97%. Oh, fuck.
The prof whose cluster this is asked for quotas to be set up for
everyone; he didn't have a lot of disk space to attach, and wanted to
impose some discipline on his lab. And I'd done so...only somehow,
the quotas were off now, probably because I'd left it off the last
time I'd had to fiddle with quotas. Because of that, one user was
taking up nearly half the disk, and another was taking up almost a
third. To make things worse, I had not set up my usual Nagios
monitoring for this machine (disk space, say) because Ganglia was set
up on it, and I'd vaguely thought that two such systems would be
silly...so I was not getting my usual "OMG WTF BBQ" messages from
Nagios.
It gets worse. I'd put in cron scripts that maintained the quota
files, nagged users by email and CC'd me...but the permissions were
544, which meant they never ran. No email? Well, then, everything
must be fine, right? Sigh.
So:
I talked to the user w/half the disk space, and it turned out that
almost all of it was in a directory called "old" which she could
delete w/o problems. That got us space.
I whipped up a simple Nagios plugin to check that quotas were on,
and made sure I got a complaint; I turned on quotas on another
partition, and made sure Nagios told me it was fine.
I fixed the permmissions on the cron scripts, and made sure they ran
(I left the debug setting on, and holy crap is it verbose...I'll
need to fix that).
I'm considering adding a Nagios plugin that checks for cron files
(/etc/cron.*) that are not executable (although if I'm lucky, maybe
there's something in the cron runner that'll complain about this).
And as a reminder to myself: if repquota gives horribly wrong
information, run "quotaon -p" to verify that quotas are, in fact,
on.