Holy God, It's Done At Last

After a lot of faffing about, I've accomplished the following on the backup server at $WORK:

  • Broken out /var/lib/mysql to a separate, mirrored, Linux software raid-1 275 GB partition; it's using about 36 GB of that at the moment, which is 15% -- the lowest it's been in a long, LONG-ass time.

  • Migrated the Bacula catalog db to Innodb.

  • Shrunk the raid-0 spool partition to about 1.6 TB, down from 2 TB; did this to free up the two disks for the mirrored partition

  • Ensured that MySQL will use /dev/shm as a temporary area

  • Sped up the restoration of files (which was mostly because of earlier "analyze" commands on the File table while it was still MyISAM)

  • innodbfilepertable is on; innodbbufferpoolsize=10G; ``` defaultstorageengine=InnoDB


I encountered the following problems:

   * The stupid raid card in the backup server only supports two RAID drives --

thus, the mirrored drive for /var/lib/mysql is Linux software raid. I'd have preferred to keep things consistent, but it was not to be. ```

  • The many "analyze" and "repair" steps took HOURS...only to turn out to be deadlocked because it was running out of tmp space.

  • I had to copy the mysql files to the raid-0 drive to have enough space to do the conversion.

  • Knock-on effects included lack of sleep and backups not being run last night

  • Basically, this took a lot of tries to get right, and about all ``` of my time for the last 24 hours.


I learned:

   * The repair of the File table while MyISAM, with tmp files in

/dev/shm, took about 15 minutes. That compares with leaving it overnight and still not having it done. ```

  • You have to watch the mysql log file for errors about disk space, and/or watch df -h to see /tmp or whatever fill up.

  • You can interrupt a repair and go back to it afterward if you have to. At least, I was able to...I wouldn't do it on a regular basis, but it gives me cautious optimism that it's not an automatic ticket to backups.

  • Importing the File.sql file (nominally 18 GB but du shows 5 ``` GB...sparse?), which converted it to InnoDB, took 2.5 hours.


I still have to do these things:

   * Update documentation.
   * Update Bacula checks to include /var/lib/mysql.
   * Perhaps up pool_size to 20 GB.
   * Set up a slave server again.
   * A better way of doing this might've been to set up LVM on md0, then use snapshots for database backup.
   * Test with a reboot!  Right now I need to get some sleep.

Tags: bacula mysql

Smarty compile path problems

A couple days ago a website stopped working, refusing to show anything below its top menu bar. I knew it was written in PHP, but I hadn't realized it used Smarty, a template engine. I didn't realize it before the outage, but one of Smarty's features is that it "compiles" (generates, really) template files into PHP code on the fly as they're accessed and/or updated. The idea is that if you need to change the template files, you just upload them; the next time that page is accessed, Smarty notices that it needs to recompile the PHP file, does so, and writes out the new file to a directory called "template_c".

This means that it needs to write to that directory (which Smarty docs recommend stay out of the DocumentRoot). But I hadn't turned that on, and in general this sort of thing makes me nervous. (Though I do allow it for other websites, so I'm not being very consistent here.)

The development machine has the files rooted at ~user/public_html; to deploy, we rsync to the server's /var/www/$PROJECT directory. The template_c files have already been compiled in the process of testing, and those compiled files include() other Smarty libraries -- and it specifies the full path to those libraries. Can you see where this is going?

As far as I can tell, what happened was:

  • The PHP was tested on the dev server; Smarty compiled the templates, and added include() directives or the library files using the full path to the user's home directory.

  • The PHP was copied into place on the web server and tested.

  • Smarty looked at the templates, found the compiled versions, and ran them. In the process, it tried to load other Smarty libraries.

  • It worked, because autofs mounted this user's home directory when requested.

  • Time passed.

  • On Wednesday, someone tried using the site. Autofs, for some reason, couldn't mount the developer's home directory, so Smarty couldn't include those files, and we saw nothing after the page's menu bar.

I got around this by using sed on the compiled template files to set the correct path to the library files, which at least got the site up again.

I'm still not sure what the hell went wrong with autofs. On this particular server I used to have a static map for home directories, and it doesn't work with the developer's home. I thought I had replaced that with an LDAP-based map a long time ago...but there it was, and I can't find anything in the Cfengine git repo that shows a change, or that I'd deployed the LDAP map in the first place. I pushed that out, and now this all works the way it should...

...except that it shouldn't work like this. I'm reluctant to let PHP write to the filesystem, but that's what Smarty expects. I think it supports relative paths -- gotta dig up my notes -- which'd be fine. (I saw one suggestion on the Smarty support forums to include "." in your PHP_INCLUDE path using .htaccess, which made me run around the room hollering and cursing) As a workaround, I'm going to move the development files to /var/www/$PROJECT on the dev server, which will match production; I'm unhappy with this because it breaks the real advantages Smarty brings, and makes the deployment process a bit harder...but I'm still a nervous nelly.

Tags: php

Longday

First an Ubuntu upgrade took much, much longer than anticipated when a) upgrading MySQL failed, for some reason, making for an amusing series of attempts at actually completing do_release_upgrade, and b) .Xauthority in my home directory was owned by root, which is a very, very fun way of borking your X session.

Then a panicked call about a website that no longer worked. Turned out to be PHP trying to include files from a home directory that was not available on the server. Why did this ever work in the first place? I'm still not sure. Good thing I've got beer at home.

Tags:

Observing Report -- Saturday, November 10

We've had a three-day stretch of clear skies; that's not the first since the last time I went out, but damn near, and definitely the first that wasn't in the middle of the week (middle-aged sysadmin needs his goram sleep) or covered by sickness.

We spent Martinmas at my in-laws eating new wine and chestnuts, and by the time we got back it was late and Jupiter was already up. I set up the scope on the steps near our townhouse and showed the kids. Jr/Fresco and I'd been talking about what eyepiece to use: the 40mm or the 12mm? He grabbed the 40mm since it was bigger, and was really surprised to see how much smaller Jupiter looked (30x) compared to the 12mm (100x). Both saw the NEB and SEB, and noticed Europa, Callisto and Io.

It was clear skies then...but in the hour it took me to read them stories, put them to bed and get out the door to the local park, clouds moved in and all but obscured everything...except Jupiter, that is. (Cue macho joke about KING OF THE GODS, that's who.) So I made lemonade and spent my time looking at Jupiter.

It was wonderful. The seeing was quite steady, and that made up for things not being quite as bright as they might have been. I was able to get up to 320x, which is a feat for me -- not to mention being able to simply keep it in view when it sails across the screen like that. The North Polar Region, the NEB and the SEB were easily visible, and I could just make out the Great Red Spot rotating out of view. From time to time I could distinguish the north and south components of the SEB, the north and south Temperate Bands, and what looked like a thin dark band right across the equator (which I just barely see hints of in these photos; not sure if I was imagining that or not).

ANother thing I saw was the reapparance of Ganymede from occulatation (that is, from behind Jupiter's disk). I knew when to expect it; when the time came, I saw it and thought "Oh yeah, neat...not as cool as a transit, though." I ignored it for a few more minutes, then realized something: I was seeing a disk, not just a point o' light...and that was only at 200x. I had my copy of the RASC Observer's Handbook (okay, maybe it is handy to have around), so I looked up Ganymede and saw that it was half again as big as the moon. Wow. I had a closer look at the other moons, and while I couldn't really see any disks, I did seem to see a sort of brownish colour to Callisto (which may actually be accurate).

I came in after only an hour; the clouds were erratic, and I wanted to get inside. Not the widest-ranging observing session, but lots of fun.

Tags: astronomy geekdad

Distracted by Emacs

I write these blog entries in Markdown mode, but markdown-mode in Emacs doesn't stick links at the end of the text the way God intended (and the way footnote-mode does). This is close, but not yet working:

(defun x-hugh-markdown-footnote (description)
  "A Small but Useful(tm) function to add a footnote in Markdown mode.

  FIXME: Not yet working, but close."
  (interactive "sDescription: ")
  (let ((current-line (line-number-at-pos))
        (last-line (line-number-at-pos (point-max)))
        (link (read-string "Link: "))
        (link-number (x-hugh-get-next-link-number)))
    (save-excursion
      (if (> (- last-line current-line) 1)
          ()
        (insert-string "\n"))
      (goto-char (point-max))
      (insert-string (format "\n[%d]: %s" link-number link)))
    (insert-string (format "[%s][%d]" description link-number))))

(defun x-hugh-get-next-link ()
  "Figure out the number for the next link."
  (interactive)
  (save-excursion
    (goto-char (point-max))
    (beginning-of-line)
    (if (looking-at "\\[\\([0-9]\\)]:")
        (eval (+ 1 (string-to-number (match-string 1))))
      (eval 0))))

Right now it craps out with a "Wrong type argument: integer-or-marker-p, nil" when it runs x-hugh-get-next-link. Doubtless I'm doing a bad thing in my half-baked attempt to return a number. But still, close!

(UPDATE: I figured it out! To return a number, wrap it with eval. Fixed above. Working!)

(Believe it or not, I started out to write about Github and bioinformatics. Such is the life of the easily distracted.)

ObMusic: "The Balcony" by The Rumour Said Fire. Reminds me of 60s folk. I like it.

Tags: emacs yakshaving music

It's like a new literary form

Another busy day at $WORK, busy enough that I missed the Judea Pearl lecture the CS dep't was hosting. On the way out the door I grabbed the copy of "Communications of the ACM" with his face on the cover, thinking I'd catch up. The two pieces on him were quite small, though, so it was on to other fare.

"The Myth of the Elevator Pitch" caught my eye, but as I read it I became more and more convinced that I was reading literary criticism combined with mystic bullshit. Example:

At first glance, it appears that the elevator pitch is a component of the envisioning process. The purpose of that practice is to tell a compelling story (a "vision") of how the world would be if the innovation idea were incorporated into it. [...] But the notion that a pitch is a precis of an envisioning story is not quite right. A closer examination reveals that a pitch is actually a combination of a precis of the envisioning story and an offer. The offer to make the vision happen is the heart of the third practice.

Ah, the heart of the third practice. (I feel like that should be capitalized: "the Heart of the Third Practice." Better!)

The standard idea of a pitch is that it is a communication -- a presentation transmitting information from a speaker to a listener. In contrast, the core idea in the Eight Practices of innovation is that each practice is a conversation that generates a commitment to produce an outcome essential to an innovation. [....] The problem with the communication idea is that communications do not elicit commitments. Conversations elicit commitments. Commitments produce action.

Mistaking it for communication -- yes, an easy mistake to make.

We can now define a pitch as a short conversation that seeks a commitment to listen to an offer conversation. It is transitional between the envisioning and offering conversations.

And back to the literary criticism. I feel like I've just watched a new literary form being born. I'm only surprised that it seems the CompSci folks scooped the LitCrit dep't.

(Of course, while I'm busy spellchecking all this, I am most definitely not being published in "Communications of the ACM". So I suck.)

Tags: rant

Pickup

Tonight I picked up a grain order I placed with my local homebrew club. Pickup was at Parallel 49, where the president of the club is the brewer. My oldest son came along, and said brewer was kind enough to give him (and his dad!) a quick tour of the place. My son was impressed, and so was I; I'd never seen a 10k litre fermentation tank before. It occurred to me later how overwhelming that would be for me: to be faced with this enormous volume waiting to be filled. I'm in awe of someone who can look at that and say, "Yeah, I know exactly what I want to put in there."

I also came away with a free bottle of their Salty Scot Scottish Ale; haven't tried that yet, but I did like the growler of the India Pale Lager (which I was happy to pay for). It's a nice twist on the usual IPA fare. I do regret having to leave behind the milk stout, though...next time.

Tags: homebrewing geekdad

Hadoop and samtools

I've been asked to revisit Hadoop at $WORK. About a year ago I got a small cluster (3 nodes) working and was plugging away at Myrna...but then our need for Myrna disappeared, and Hadoop was left fallow. The need this time around seems more permanent.

So far I'm trying to get a simple streaming job working. The initial script is pretty simple:

samtools view input.bam| cut -f 3 | uniq -c | sed 's/^[\t]*//' | sort -k1,1nr > output.txt

This breaks down to:

  • mapper.sh: samtools view
  • reducer.sh: "cut -f 3 | uniq -c | sed ... | sort ..."

which, invoked Hadoop-style, should be: ``` hstream -input input.bam \ -file mapper.sh -mapper "mapper.sh" \ -file reducer.sh -reducer "reducer.sh" \ -output output.txt


Running the mapper.sh/reducer.sh files works fine; the problem is that
under Hadoop, it fails:

2012-11-06 12:07:30,106 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s] 2012-11-06 12:07:30,110 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done 2012-11-06 12:07:30,111 WARN org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)


I'm unsure right now if that's [this error][3] or something else I've
done wrong.  Oh well, it'll be fun to turn on debugging and see what's
going on under the hood...

...unless, of course, unless I'm wasting my time.  A quick search
turned up a number of Hadoop-based bioinformatics tools
([Biodoop][4], [Seqpiq][5] and [Hadoop-Bam][6]), and I'm sure there
are a crapton more.

Other chores:

* Duplicating pythonbrew/modules work on another server since our
  cluster is busy
* Migrating our mail server to a VM
* Setting up printing accounting with Pykota (latest challenge:
  dealing wth usernames that aren't in our LDAP tree)
* Accumulated paperwork
* Renewing lapsed support on a Very Important Server

Oh well, at least I'm registered for [LISA][7].  Woohoo!








Tags: hadoop bioinformatics lisa

Slow MySQL makes Bacula cry

A while back I upgraded the MySQL for Bacula at $WORK. I tested it afterward to make sure it worked well, but evidently not thoroughly enough: I discovered today that the query needed to restore a file was taking forever. I lost patience at 30 minutes and killed the query; fortunately, I was able to find another way to get the file from Bacula. This is a big slowdown from previous experience. Time for some benchmarking and exploratory tweaking...

Incidentally, the faster way to get the file was to select "Enter a list of files to restore", rather than let it build the directory tree for the whole thing. The fileset in question is not that big, so I think I must be doing something pretty damned wrong to get MySQL so slow.

Tags: mysql bacula

Illness

Last week my 4.5 year old came down with his usual asthma-inflamed cold; this week it's my wife's standard sinus infection and tonsillitis for my 6 year old. It's been busy: two weekends spent at the doctor's in a row is more than we usually aim for.

It's not all bad, though. In a way it's been nice to sit around and just be with them at home, rather than think constantly about how to entertain them out of the house. We've read books (of course), played games, done crafts and pointed TV at our heads. I've even started installing games on my laptop for them to play. (Complicated succession rules mean I got my wife's old laptop recently.) 4.5 is particularly enjoying TuxRacer (which I've just found out was renamed to Extreme TuxRacer).

Next weekend it's Martinmas, plus a long weekend. I hope to have a 25kg bag of pale malt by then, and it'd be nice to get some beer in...something a bit more relaxing than illness.

Tags: geekdad

Closer to fine

At $WORK I typically have three things open on my desktop:

  1. Firefox
  2. Emacs
  3. 4 x 10^8 xterms

I use Awesome as a window manager, and despite RSI (I keep tucking my thumb under my left hand to reach the Windows key) it has worked out okay for me. There's not a lot I need the mouse for, especially if I'm disciplined enough to keep the mouse far away from the keyboard.

However, there's one thing I need: cutting and pasting. When troubleshooting I take notes in Org/Emacs, often on a remote SSH/tmux session, and I need to cut-and-paste sixty different URLs from Firefox into my notes. (Still not quite hardcore enough to run w3m-mode regularly, which is a bit like saying "I only get drunk after 10am.") And for that, I have to reach for the mouse, which breaks flow, which makes baby Linus cry.

This xdotool script (Hey Jordan, I thought your site was supposed to be up by now!) comes very, very close to making that go away.

#!/bin/sh
# Activate firefox and copy the URL bar contents; also available with "xclip -o".  OMG.

# Get the current window and mouse location so we can come back
pwid=`xdotool getactivewindow`
eval $(xdotool getactivewindow getwindowgeometry --shell|grep -e X -e Y)
# Find and activate the FF window
wid=`xdotool search --name "Mozilla Firefox"`
xdotool windowactivate $wid
sleep 0.2
# Go to location bar, select all, and cut
xdotool key "ctrl+l"
xdotool key "ctrl+a"
xdotool key "ctrl+c"
sleep 0.2

# Go back, move the mouse a bit, and middle-click to paste
xdotool windowactivate $pwid
xdotool mousemove $(( X + 50 )) $(( Y + 50 ))
sleep 0.2
xdotool click 2

It is a trifle flaky: it works in some environments (xterm+emacs shell) but not others (xterm w/SSH/tmux/emacs). Of course, that's just after a few minutes playing around, and I may need to look for help from wmcrtl. But oh, that's nice. Thanks, Jordan.

Tags: lookmanomouse

Incipient Speciation

Like most Unix geeks, I've got a lovingly-crafted collection of dotfiles that I've come to depend on. But as a sysadmin, I play around on a larger collection of machines than most geeks probably get to (a fact for which I'm profoundly grateful) and, as a result, a larger set of home directories. I've come to love dfm, a git-and-perl bit of duct tape that makes managing and centralizing these dotfiles much, much easier...but it still depends on manually merging files back into the master repo, rebasing, etc etc which let's face it you don't always have time for.

It occurs to me that this is a lot like speciation -- in particular, allopatric or parapatric speciation, driven by (respectively) complete or partial geographical isolation. At some point (and I've hit this a few times now), everything diverges so much that there's just no way to get them back together without supernatural levels of effort and swearing. They drift apart, and some eventually die off. I wonder if you could use the ability to create a patch as a proxy for genetic drift. There must be a study somewhere on this...

Tags:

Forensic accounting with Cfengine 3

Just had a dream where I'd been called into Sun, just before Oracle's takeover, to figure out why they were spending so much money on eyeglasses for employees. "We think it's part of their benefits, but our accounting department doesn't have a separate line item for it," someone explained. My eyebrows lifted in disbelief. "Well, then, it's damned lucky for you I've got Cfengine."

Tags: cfengine wtf

No, YOU'RE too nested

So at work there's this program called CHARMM that's my responsibility to build. It's not the worst thing I've ever had to compile, but it's close:

  • the build script is an interactive tcsh script; you have to tell it where to find libraries, and can't pass these on as arguments;

  • the whole thing is insanely sensitive to what compiler and library you use, but often that'll only turn up when you actually go to use the damn thing;

  • CHARMM will often just crash if it it doesn't like you or your input, and the reaction on the forums is often, "Oh yeah, you should just patch that";

  • there are about 40 different build options, of which an amusing fraction are documented (and an even more amusing fraction documented correctly), and of course the combinations just explode; the wrong combinations will make CHARMM crash, and the reaction on the forums is often, "Oh yeah, they've been meaning to fix the documentation on that."

To get around this I built a script that collects arguments, applies a patch, unpacks the tarball, echoes the right args to the build script (which means, of course, that I am damned for all eternity) and then builds the damned thing. Hurrah! Everything's fine!

Except now I have to rebuild. And invoking the script is becoming complicated:

  • I'm using modules to use different compilers;

  • I want to name the destination directory after some combination of compiler, options, and patches;

  • I'm writing qsub scripts that set all this up and then invoke my script, which invokes the build script;

  • And there are, at last count, two versions of CHARMM, two patches that are required for one of those versions, and two sets of options. That's 8 scripts.

So I'm looking at all this and thinking, "Self, you're telling me I need a templating system for my templating system. Have I got that right?" And I think, "Self, you've got it." So I punch myself in the face. And that's why it hurts to build CHARMM.

Tags: software

My life as a consultant

Yesterday I spent the morning in an unaccustomed role: that of the older, wiser homebrewer. My friends Ross and Wes were doing their first batch of homebrew, along with Kevin and Will, and wanted me along to help out. I was happy to oblige. My kids came along to play with their kids, and my wife to make homemade candy corn with the other wives (seriously, it's like a farmhouse social), and Wes brought beer for us all to drink, picked up from Central City Brewing the day before.

It was a damn good time and a shockingly warm and half-sunny day. The guys went all-grain for their first batch, and were using my homegrown hops for bittering and flavour. Ross rocked the clipboard while Wes and Kevin did the lifting:

Rockin the clipboard

A little bit of spilled wort on my shoes later ("Who the hell wears good shoes to a homebrew session?"), everything was in the fermenter and we were cleaned up. Gravity was a little lower than expected -- 1.038 -- but hey, we can clean up efficiency later, right? And word on the street is that the airlock was bubbling away later that night.

Awesome time. I might have to take this up professionally.

Tags: beer

Tracking down a Windows problem

I came across this a while back: a wonderfully informative technical blog post on tracking down a slowdown in Windows.

Tags: windows

Fridays...don't talk to me about Fridays

Last Friday was not a good day. First, I try installing a couple hard drives I bought for a Dell MD3200 disk array, and it rejected them; turns out that it will not work with drives that are not pre-approved. It's right there in the documentation. I was aware of last year's kerfuffle with Dell announcing that their servers would no longer work w/unapproved drives, and then their backdown on that...but the disk arrays are different. So now I have a couple extra 3 TB SATA drives to find a place for, and a couple drives to buy from Dell.

While I'm inin the server room staring at the blinking error light on the disk array and wondering if I'd just brought down the server it was attached to, I notice another blinking error light. This one was on the server that hosts Xen VMs that run LDAP, monitoring and a web server. It had a failing drive. Good thing it's in RAID 6, right? Sure, but it failed nearly a month ago -- I had not set up email alerts on the server's ILOM, so I never got notified about this. Fuck.

I send off an email to order a drive, then figure out how to get alerted about this. Email alerts are configured, but belt and suspenders: I get the CLI tool for the RAID card, find a Nagios plugin that runs it, and add the check to Nagios, running on the server's dom0. Hurrah, it alerts me! I ack the alert, and now it's time to head home.

On my way home I start getting pages about the VMs on this machine -- nothing down, but lots of timeouts. The machine recovers, then stumbles and stays down. (These alerts were coming from a second instance of Nagios I have set up, which is mostly there to monitor the main instance that runs on this server.) My commute is 90 minutes, and I have no net access along the way. When I finally get home, I SSH to work and find that the machine is hung; as far as I can tell, the CLI tool was just not exiting, and after enough accumulated the RAID card just stopped responding entirely. I reboot the machine, and ten minutes later we're back up.

Ten minutes after that, I realize I'm still in trouble: I'm getting pages about a few other machines that are not responding. Remember how one of the VMs on the original server ran LDAP? It's one of three LDAP servers I have, because I fucking hate it when LDAP goes down. The clients are configured to fail over if their preferred server (the VM) isn't responding. I check on one of the machines, and nscd had about a thousand open sockets...which makes me think that the sequence was something like this:

  • During the hang, the VM was responding a little bit -- probably just enough to complete a 3-way handshake.

  • nscd would keep that connection open, because it had decided that the server was there and would be answering. But it wouldn't.

  • Another query would come along, and nscd would open another connection. Rinse and repeat.

  • Eventually, nscd bumped up against the open FD limit (1024), and was unable to open up new connections to any LDAP server.

I'm thinking about putting in a check for the number of open FDs nscd has, but I'm starting to second-guess myself; it feels a bit circular somehow. Not the right word, but I'm tired and can't think of a better.

Gah.

Tags: sysadmin

IT always takes longer than you think

Yesterday I did a long-anticipated firmware upgrade on a disk array at $work. It's attached to the head node of a small cluster we have, and holds the usual assortment of home directories and data. The process was kind of involved:

  • shut down the cluster to prevent disk I/O ("not mandatory, but strongly recommended" -- thx, I'll just go with "mandatory");

  • remove the current management software from the head node, reboot and then reinstall;

  • X was needed for the installation ("not mandatory, but--" Okay, right, got it, thx): twice via SSH, once by running startx locally;

  • I couldn't upgrade directly to the new firmware itself, but had to install bridge firmware, wait 30 minutes for things to settle out (!), then install the new firmware

  • oh, and "due to limitations of the Linux environment", I couldn't install the firmware from the head node itself that just had the management software upgraded -- instead, I had to install that software on another machine and install it from there.

Which is why this all took about four hours to do. But that's not all:

  • Before all that, I read the many, many manuals; did a dress rehearsal to shake out problems; and made sure I had a checklist (thank you, Tom Limoncelli and Orgmode) with the exact commands to run

  • During the upgrade, I took notes on things I'd forgotten and problems I'd encountered.

  • After the upgrade, I did a postmortem: updated my documentation and filed bugs, notified the users that things were back up, and watched for problems.

Which is why a 4 hour upgrade took me 9.5 hours. I think there might be a handy rule of thumb for big work like this, though I can't decide if it's "it always takes twice as long" or "it always takes five hours longer than you think." Heh.

One other top tip: stop NFS exports while you're working on a server (but see the next paragraph!). One user started a session on another machine, which automounted her home directory from the head node. This was close to the end of my work, and while I could have used another reboot, I elected not to because I didn't want to mess up her session. Yes, the reboot was important, but I'd neglected to think about this situation, and I didn't think she should have to pay for my mistake.

And if you're going to turn off NFS exports, make damn sure you have your monitoring system checking exports in the first place; that way, you won't forget to turn it back on afterward. (/me scurries to add that test to Nagios right now...)

Tags: sysadmin

Quotas are on, right?

Tomorrow I've upgrading firmware on a disk array that's attached to a small cluster I manage; yesterday, in preparation for that, I ran a full backup of the disks in question. I noticed that the home directories were taking longer than I thought, so I checked out how full they were. The answer was 97%. Oh, fuck.

The prof whose cluster this is asked for quotas to be set up for everyone; he didn't have a lot of disk space to attach, and wanted to impose some discipline on his lab. And I'd done so...only somehow, the quotas were off now, probably because I'd left it off the last time I'd had to fiddle with quotas. Because of that, one user was taking up nearly half the disk, and another was taking up almost a third. To make things worse, I had not set up my usual Nagios monitoring for this machine (disk space, say) because Ganglia was set up on it, and I'd vaguely thought that two such systems would be silly...so I was not getting my usual "OMG WTF BBQ" messages from Nagios.

It gets worse. I'd put in cron scripts that maintained the quota files, nagged users by email and CC'd me...but the permissions were 544, which meant they never ran. No email? Well, then, everything must be fine, right? Sigh.

So:

  • I talked to the user w/half the disk space, and it turned out that almost all of it was in a directory called "old" which she could delete w/o problems. That got us space.

  • I whipped up a simple Nagios plugin to check that quotas were on, and made sure I got a complaint; I turned on quotas on another partition, and made sure Nagios told me it was fine.

  • I fixed the permmissions on the cron scripts, and made sure they ran (I left the debug setting on, and holy crap is it verbose...I'll need to fix that).

  • I'm considering adding a Nagios plugin that checks for cron files (/etc/cron.*) that are not executable (although if I'm lucky, maybe there's something in the cron runner that'll complain about this).

  • And as a reminder to myself: if repquota gives horribly wrong information, run "quotaon -p" to verify that quotas are, in fact, on.

Tags: sysadmin

Music Monday -- October 15, 2015

"Sunday Under Glass", by Beulah.

Pavement + the Beach Boys, plus wildly intelligent and playful lyrics. Sadly missed.

Tags: musicmonday